Related words in a corpus
Posted by probreasoning in Uncategorized on January 16, 2011
I will attempt to describe a simple approach to find related words in a large corpus. I follow the problem described by Joseph Turian. We have documents and a total of
unique words among them.
First, let us assume that two related words occur in a non-independent fashion. i.e., if the (marginal) probabilities of occurrence of related words and
in any document are
and
, then the probability of occurrence of both is
. Framing this as a likelihood ratio, the likelihood ratio for words
being related is
,
where denotes the number of documents with certain words, eg.,
denotes the number of documents containing the word
but not word
.
I then wanted a simple model to test whether two words have the same pattern of occurrence. While this is is not a generative model, it leads to a simple hypothesis test for this second degree co-occurrence pattern. For any hidden word $u$, given a document with present words and absent words
, where
, and
, the probability of observing word
in the document is
.
This leads to a Fisher’s non-central hyper geometric distribution for (abusing notation a little bit):
The denominators cancel out, and the integral reduces to gamma functions. Cutting to the chase, the test reduces to (click on the equation below for a better view)
Finally, we may choose to rank related words by a combination . The results I got can be downloaded from here.
If someone would like to collaborate to make this more complete. e.g., by using one generative model for the co-occurrence frequency and co-occurrence pattern measures, kindly leave a comment.