• Assumptions behind the Probabilistic Relevance Framework
    • Relevance is a property of a individual document
    • Relevance is binary, a document is either relevant or not.

Basic model

Assumptions

  1. conditional independence between documents term given relevance info
    • debatable but shown to be robust in practice
  2. Only query terms are relevant to rank the documents
    • Reasonable assumption in the absence of other information

Model

$$P(rel|d, q) \approx \sum_q U_i(tf_i)\\ \propto_q \sum_{q, tf_i>0} w_i$$

$$U_i(x) = \log \frac{P(TF_i=x|rel)}{P(TF_i=x|\sim rel)}, \\ w_i = U_i(tf_i) - U_i(0) = \log \frac{P(TF_i=x|rel)P(TF_i=0|\sim rel )}{P(TF_i=x|\sim rel )P(TF_i=0|rel)}$$

Comments

  • The advantage of using $w_i$ instead of U_i is that we only need to compute the score for documents that contain at least one query term.
  • The model is not restricted to terms and term-frequencies, any attribute of the document of query-document pair can be included.
    • Any discrete property with a natural zero can be dealt with using the W_i form of the weight.
    • If we want to include a property without such a natural zero, we need to revert to the U_i form.
  • The approximations above work if we are only interested in the ranking problem. They do not provide accurate probabilities due to the approximations and transformations performed to arrive at the simplified equation.

The Binary Independence Model

Assumptions

  1. TF_i is a binary variable. That is the term is either present or absent.
    • Define t_i as the event that the term is present in the document

Model

  • Under the binary assumption w_i becomes

    $$w_i^{BIM} = \log \frac{P(t_i|rel)(1-P(t_i|\sim rel))}{(1-P(t_i|rel))P(t_i|\sim rel)}$$

  • We can estimate the probabilities below using the following quantities:

    • N: Size of the whole collection
    • n_i: Number of documents in the collection containing t_i
    • R: Relevant set size (i.e., number of documents judged relevant)
    • r_i: Number of judged relevant docs containing t_i

      $$P(t_i|rel)=r_i/R, \quad P(t_i|\sim rel)=(n_i - r_i)/(N-R)$$

  • Using the estimates above and a small 0.5 correction for robustness, we have

    $$w_i^{BIM} = \log \frac{(r_i + 0.5)(N-R-n_i+r_i+0.5)}{(n_i-r_i+0.5)(R-r_i+0.5)}$$

    • The resulting formula is the well-known Robertson/Sprck Jones weight, also denoted as w_i^{RSJ}

Absence of relevant information

  • If we assume that the entire collection is non-relevant, we can use r_i = R = 0 in the formula above leading to a close approximation to the classical idf

    $$w_i^{IDF} = \log \frac{N - n_i+0.5}{n_i+0.5}$$

Comments

  • We assumed above that non-judged document is non-relevant
  • The absence of relevant information assumption is equivalent of saying that P(t_i|rel) = 0.5