Q1. (25 marks) Consider the following pseudo code which performs list intersection based on the divide- and-conquer paradigm. Note that the input lists are not necessarily sorted. Algorithm 1: Intersect(A, B) 1 if . . . then /* Deal with the boundary case */ 2 . . . ; 3 return . . . ; 4 else /* Recursively break down each list into two parts and recurse */ 5 . . . ; 6 return . . . ; (1) Complete the above pseudo code. You can assume that you can invoke the following member methods on a List object L: • L.len returns the length of the list L. You can also use the usually indexing and slicing operation on the list (as in python). (2) Think of a method to divide each input list into k sub-lists (k ≥ 2) without changing the main logic of the algorithm you implemented in the first part. You should be able to describe the only change succinctly. Q2. (25 marks) Consider the scenario of dynamic inverted index construction. Assume that t sub-indexes (each of M pages) will be created if one chooses the no-merge strategy. (1) Show that if the logarithmic merge strategy is used, it will result in at most dlog2 te sub-indexes. (2) Prove that the total I/O cost of the logarithmic merge is O(t ·M · log2 t). 1 2 DUE ON 20:59 29 NOV, 2019 (FRI) Q3. (25 marks) The following list of Rs and Ns represents relevant (R) and nonrelevant (N) returned documents in a ranked list of 20 documents retrieved in response to a query from a collection of 10, 000 documents. The top of the ranked list is on the left of the list. This list shows 6 relevant documents. Assume that there are 8 relevant documents in total in the collection. R R N N N N N N R N R N N N R N N N N R (Note that spaces above are just added to make the list easier to read) (1) What is the precision of the system on the top-20? (2) What is the F1 on the top-20? (3) What is/are the uninterpolated precision(s) of the system at 25% recall? (4) What is the interpolated precision at 33% recall? (5) Assume that these 20 documents are the complete result set of the system. What is the MAP for the query? Assume, now, instead, that the system returned the entire 10, 000 documents in a ranked list, and these are the first 20 results returned. (6) What is the largest possible MAP that this system could have? (7) What is the smallest possible MAP that this system could have? (8) In a set of experiments, only the top-20 results are evaluated by hand. The result in (5) is used to approximate the range (6) to (7). For this example, how large (in absolute terms) can the error for the MAP be by calculating (5) instead of (6) and (7) for this query? Q4. (25 marks) Suppose we have a document collection with an extremely small vocabulary with only 6 words w1, w2, . . . , w6. The following table shows the estimated background language model p(w|C) using the whole collection of documents (2nd column) and the word counts for document d1 (3rd column) and d2 (4th column), where c(w, di) is the count of word w in document di. Let Q = {w1, w2, w3, w4, w5, w6} be a query. Word p(w|C) c(w, d1) c(w, d2) w1 0.800 2 7 w2 0.100 3 1 w3 0.025 1 1 w4 0.025 2 1 w5 0.025 2 0 w6 0.025 0 0 (1) Suppose we do not smooth the language model for d1 and d2. Compute the likeli- hood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be 6.1× 10−3) Which document would be ranked higher? COMP6714 ASSIGNMENT 1 3 (2) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer smoothing method with λ = 0.8 (i.e., p(w|d) = λ·pmle(w|Md)+(1−λ)·pmle(w|Mc)). Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute the log-likelihood. You should use the scientific notation) Which document would be ranked higher?