成功案例设置

COMP6714 ASSIGNMENT

Q1. (25 marks)
Consider the following pseudo code which performs list intersection based on the divide-
and-conquer paradigm. Note that the input lists are not necessarily sorted.
Algorithm 1: Intersect(A, B)
1 if . . . then
/* Deal with the boundary case */
2 . . . ;
3 return . . . ;
4 else
/* Recursively break down each list into two parts and recurse */
5 . . . ;
6 return . . . ;
(1) Complete the above pseudo code. You can assume that you can invoke the following
member methods on a List object L:
• L.len returns the length of the list L.
You can also use the usually indexing and slicing operation on the list (as in
python).
(2) Think of a method to divide each input list into k sub-lists (k ≥ 2) without changing
the main logic of the algorithm you implemented in the first part. You should be
able to describe the only change succinctly.
Q2. (25 marks)
Consider the scenario of dynamic inverted index construction. Assume that t sub-indexes
(each of M pages) will be created if one chooses the no-merge strategy.
(1) Show that if the logarithmic merge strategy is used, it will result in at most dlog2 te
sub-indexes.
(2) Prove that the total I/O cost of the logarithmic merge is O(t ·M · log2 t).
1
2 DUE ON 20:59 29 NOV, 2019 (FRI)
Q3. (25 marks)
The following list of Rs and Ns represents relevant (R) and nonrelevant (N) returned
documents in a ranked list of 20 documents retrieved in response to a query from a collection
of 10, 000 documents. The top of the ranked list is on the left of the list. This list shows 6
relevant documents. Assume that there are 8 relevant documents in total in the collection.
R R N N N N N N R N R N N N R N N N N R
(Note that spaces above are just added to make the list easier to read)
(1) What is the precision of the system on the top-20?
(2) What is the F1 on the top-20?
(3) What is/are the uninterpolated precision(s) of the system at 25% recall?
(4) What is the interpolated precision at 33% recall?
(5) Assume that these 20 documents are the complete result set of the system. What
is the MAP for the query?
Assume, now, instead, that the system returned the entire 10, 000 documents in a ranked
list, and these are the first 20 results returned.
(6) What is the largest possible MAP that this system could have?
(7) What is the smallest possible MAP that this system could have?
(8) In a set of experiments, only the top-20 results are evaluated by hand. The result
in (5) is used to approximate the range (6) to (7). For this example, how large (in
absolute terms) can the error for the MAP be by calculating (5) instead of (6) and
(7) for this query?
Q4. (25 marks)
Suppose we have a document collection with an extremely small vocabulary with only
6 words w1, w2, . . . , w6. The following table shows the estimated background language
model p(w|C) using the whole collection of documents (2nd column) and the word counts
for document d1 (3rd column) and d2 (4th column), where c(w, di) is the count of word w
in document di. Let Q = {w1, w2, w3, w4, w5, w6} be a query.
Word p(w|C) c(w, d1) c(w, d2)
w1 0.800 2 7
w2 0.100 3 1
w3 0.025 1 1
w4 0.025 2 1
w5 0.025 2 0
w6 0.025 0 0
(1) Suppose we do not smooth the language model for d1 and d2. Compute the likeli-
hood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute
the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be
6.1× 10−3) Which document would be ranked higher?
COMP6714 ASSIGNMENT 1 3
(2) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer
smoothing method with λ = 0.8 (i.e., p(w|d) = λ·pmle(w|Md)+(1−λ)·pmle(w|Mc)).
Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2)
(Do not compute the log-likelihood. You should use the scientific notation) Which
document would be ranked higher?

留学ICU™️ 留学生辅助指导品牌

在线客服 7*24 全天为您提供咨询服务

咨询电话(全球): +86 17530857517

客服QQ：2405269519

微信咨询：zz-x2580

服务项目

服务流程

客户保障

关于我们

微信订阅号