Related words in a corpus

I will attempt to describe a simple approach to find related words in a large corpus. I follow the problem described by Joseph Turian. We have D documents and a total of  W unique words among them.

First, let us assume that two related words occur in a non-independent fashion. i.e., if the (marginal) probabilities of occurrence of related words u and v in any document are P(u) and P(v), then the probability of  occurrence of both is P(u,v) \neq P(u) P(v). Framing this as a likelihood ratio, the likelihood ratio for words u,v being related is

\lambda_1 = \frac{\int_0^1 \int_0^1 p_u^{n_{u \neg v}}  p_v^{n_{v \neg u}} (p_u  p_v)^{n_{uv}} ((1-p_u)(1-p_v))^{n_{\neg u \neg  v}} d p_u d p_v}{\int \int \int_{\sum p_i = 1} p_1^{n_{u \neg  v}}  p_2^{n_{v \neg u}}  p_3^{n_{uv}} p_4^{n_{\neg u \neg v}} d p_1 d p_2d   p_4 d p_4}

\Rightarrow \lambda_1 = \frac{\Gamma(n_{u \neg v}+1)\Gamma(n_{u v}+1)\Gamma(n_{\neg u \neg v}+1)\Gamma(n_{v \neg u}+1)}{\beta(n_u+1,n_{\neg u}+1)\beta(n_v+1,n_{\neg v}+1) \Gamma(\lvert D \rvert +4)},

where n_{.} denotes the number of documents with certain words, eg., n_{u \neg v} denotes the number of documents containing the word u but not word v.

I then wanted a simple model to test whether two words have the same pattern of occurrence. While this is is not a generative model, it leads to a simple hypothesis test for this second degree co-occurrence pattern. For any hidden word $u$, given a document with present words w_p and absent words w_q, where W =  w_p \cup w_q \cup \{ u \}, and w_p \cap w_q = w_p \cap \{ u \} =  w_q \cap \{ u \}=0, the probability of observing word u in the document is

\sum_{i \in w_p} p_{ui}.

This leads to a Fisher’s non-central hyper geometric distribution for n_{ui} (abusing notation a little bit):

P(\{n_{ui}\} | \{p_{ui}\}) = \frac{\prod_{i \in W / \{ u,v \}} {n_{i} \choose n_{ui} } p_{ui}^{n_{ui}}}{\sum_{ \{m_{ui}\} \forall \sum_i m_{ui} = n_i}\prod_{i \in W / \{ u,v \}} {n_{i} \choose m_{ui} } p_{ui}^{m_{ui}}}

\lambda_2 = \frac{\int ... \int_{\sum_i q_i=1}P(\{n_{ui}\} | \{p_i\}) P(\{n_{vi}\} | \{p_i\}) \prod_i d p_i}{ (\int ... \int_{\sum_i p_i=1}P(\{n_{ui}\} | \{p_i\}) \prod_i d p_i)(\int ... \int_{\sum_i q_i=1}P(\{n_{vi}\} | \{q_i\}) \prod_i d q_i) }

The denominators cancel out, and the integral reduces to gamma functions. Cutting to the chase, the test reduces to (click on the equation below for a better view)

Finally, we may choose to rank related words by a combination \log \lambda_1 + \log \lambda_2. The results I got can be downloaded from here.

If someone would like to collaborate to make this more complete. e.g., by using one generative model for the co-occurrence frequency and co-occurrence pattern measures, kindly leave a comment.

1 Comment

Follow

Get every new post delivered to your Inbox.