Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP6714 Information Retrieval and Web Search
Question 1 (10 marks)
Write down the pseudo-code of answering the Boolean query not A and B. You need to
use the function skipTo(id) wherever it is possible.
You can refer to the following skeleton code. Note that it is by no means the complete
pseudo-code, and you should add multiple lines or modify existing line(s) to complete
this task.
Algorithm 1: Q1(p1, p2)
1 answer ← ∅;
2 while . . . do
3 if docID(p1) > docID(p2) then
4 ;
5 else
6 if docID(p1) < docID(p2) then
7 ;
8 else
9 ;
10 return answer ;
COMP6714 Page 1 of 8
Question 2 (14 marks)
Consider applying γ-encoding to a posting list of n document IDs (within the range of
[1, N ]).
Prove that:
• For a value x, its γ-encoded value takes at most 2 log2(x) + 1 bits.
• The compressed posting list (using γ codes on the gaps) takes at most n · log2
(
2N2
n2
)
bits.
COMP6714 Page 2 of 8
Question 3 (14 marks)
(a) Consider a dictionary that contains v terms, and each term has exactly l characters.
Assume we build both a permuterm index and a bi-gram index for the dictionary.
What are the sizes of these two indexes (note, do not include the size of the dic-
tionary and the inverted lists), respectively? You may assume that a pointer (to a
term in the dictionary) is 4-bytes and all terms only contain characters from a to
z.
(b) Consider a query of P*Q*R, where P, Q, and R are a sequence of characters. Briefly
describe how a permuterm index can be used to efficiently answer this query.
(c) The above query can also be answer without using list interaction. Briefly describe
how this can be done, and give a reason why this might be more efficient than the
previous query processing method.
COMP6714 Page 3 of 8
Question 4 (14 marks)
Consider the logarithmic merge algorithm for dynamic index construction. The sub-
indexes created on the disk are: I3, I2, I1, and I0 (note: I0 is the smallest sub-index on
the disk). Assume the current in-memory index is full and needs to be dumped to the
disk.
(a) What are the sub-indexes after dumping the current in-memory index to the disk?
(b) How many sub-indexes will the algorithm create to index a document collection of
size |C| when the memory size is M .
(c) Let |C| = 14 ·M . What is the total number of times sub-indexes are merged? You
need to include merges of a sub-index on the disk and the index in the memory.
COMP6714 Page 4 of 8
Question 5 (10 marks)
The figure below shows the output of two information retrieval systems on the same two
queries in a competitive evaluation. The top 10 ranks are shown. Crosses correspond
to a document which has been judged relevant by a human judge; dashes correspond to
irrelevant documents. Assume that System 1 retrieved all the relevant documents for
both queries.
System 1: System 2:
Rank Q1 Q2
1 X X
2 X -
3 X -
4 X -
5 - -
6 - X
7 - -
8 - -
9 X X
10 X X
Rank Q1 Q2
1 X X
2 X -
3 X -
4 - X
5 X X
6 X -
7 - -
8 - -
9 X -
10 - -
(a) Explain the following evaluation metrics and give results for query Q2 for both
systems.
1. Precision at rank 8.
2. Recall at precision 1
3
.
(b) Give the formula for mean average precision (MAP), and calculating MAP for both
systems.
(c) Consider Q1 for System 1. Compute the interpolated precisions at recall levels 0.5
and 0.8, respectively.
COMP6714 Page 5 of 8
Question 6 (14 marks)
Consider the following documents and the ground-truth of their relevance to the query
{x1, x3}. Give the order that these documents will be ordered under the BIM model
with add 1
2
smoothing? You need to justify your answer.
Document Relevant? x1 x2 x3
D1 T 1 1 1
D2 T 0 1 0
D3 F 1 0 0
D4 T 0 0 1
D5 F 0 1 0
COMP6714 Page 6 of 8
Question 7 (10 marks)
Suppose we have a document collection with an extremely small vocabulary with only
6 words w1, w2, . . . , w6. The following table shows the estimated background language
model p(w|C) using the whole collection of documents (2nd column) and the word counts
for document d1 (3rd column) and d2 (4th column), where c(w, di) is the count of word
w in document di. Let Q = {w1, w2, w3, w4, w5, w6} be a query.
Word p(w|C) c(w, d1) c(w, d2)
w1 0.800 2 7
w2 0.100 3 1
w3 0.025 1 1
w4 0.025 2 1
w5 0.025 2 0
w6 0.025 0 0
(a) Suppose we do not smooth the language model for d1 and d2. Compute the likeli-
hood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute
the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be
6.1× 10−3) Which document would be ranked higher?
(b) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer
smoothing method with λ = 0.8 (i.e., p(w|d) = λ ·pmle(w|Md)+(1−λ) ·pmle(w|Mc)).
Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2)
(Do not compute the log-likelihood. You should use the scientific notation) Which
document would be ranked higher?
COMP6714 Page 7 of 8
Question 8 (14 marks)
Consider the basic architecture of a crawler in the figure below.
(a) Explain why the “content seen?” module is needed before the “Dup URL elim”
module?
(b) Assume that the search engine uses MinHash algorithm to detect near-duplicate
documents. Given a document with following hashed shingles: { 1, 7, 15, 81 }, and
two universal hashing functions:
• h1(x) = (7x+ 1 mod 31) mod 13
• h2(x) = (18x+ 26 mod 31) mod 13
What are the resulting MinHash signatures of the document?
END OF EXAM PAPER