Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: zz-x2580
COMP6714 Project: Westlaw alike
queries
As presented in the lecture, Westlaw is a popular commercial information retrieval system.
You can search for documents by Boolean Terms and Connector queries. For example:
STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
where STATUTE, ACTION, FEDERAL, TORT, CLAIM are the search terms and space, /S, /2, /3
are the connectors.
In this project, you are going to implement a retrieval system in Python3 called
SimplyBoolean, which you have already encountered in the assignment. As a core
requirement for this project, you must implement SimplyBoolean using a positional index.
SimplyBoolean is a retrieval system that supports Westlaw alike queries. It supports the
following reduced set of connectors:
" ", space, +n, /n, +s, /s, &
as well as parentheses. Note that the connectors of your query will be processed in exactly
the order above. Further details of these connectors can be found in the Quick Reference
Guide of WestLaw available from WebCMS.
Different to Westlaw, SimplyBoolean does not support various forms of search terms,
except a normal search term (i.e. single-word, those without " ") and a phrase.
Term matching (including terms in a phrase) in SimplyBoolean follows the below:
Search in SimplyBoolean is case insensitive.
Full stops for abbreviations are ignored. e.g., U.S., US are the same.
Singular/Plural is ignored. e.g., cat, cats, cat's, cats' are all the same.
Tense is ignored. e.g., breaches, breach, breached, breaching are all the same.
A sentence can only end with a full stop, a question mark, or an exclammation mark.
Except the above, all other punctuation should be treated as token dividers.
All (whole) numeric tokens such as years, decimals, integers are ignored. You should
not index these tokens and hence should not consider them for proximity queries
such as +n. E.g. you should not index '123' (wholly numeric) but should index 'abc123'
(partially numeric).
You are provided with approximately 1000 small documents (named with their document
IDs) available in ~cs6714/reuters/data. You can find these files by logging into CSE
machines and going to folder ~cs6714/reuters/data. Your submitted project will be tested
against a similar collection of up to 1000 documents (i.e., we may replace some of these
documents to avoid any hard-coded solutions).
Your submission must include 2 main programs: index.py and search.py as described below.
The Indexer
$ python3 index.py [folder-of-documents] [folder-of-indexes]
where [folder-of-documents] is the path to the directory for the collection of documents to
be indexed and [folder-of-indexes] is the path to the directory containing the index file(s)
be created. All the files in [folder-of-documents] should be opened as read-only, as you
may not have the write permission for these files. If [folder-of-indexes] does not exist,
create a new directory as specified. You may create multiple index files although too many
index files may slow down your performance. The total size of all your index files generated
shall not exceed 20MB (which should be plenty for this project).
After the indexing is completed, it will output the total number of documents, the total
number of tokens (after any preprocessing and filtering) to be indexed, and the total
number of terms to be indexed. The following example illustrates the required input and
output formats:
$ python3 index.py ~/Desktop/MyDataFolder ./MyTestIndex
Total number of documents: 672
Total number of tokens: 638321
Total number of terms: 13297
Note: the output of index.py ends with one newline ('\n') character.
Searching
$ python3 search.py [folder-of-indexes]
where [folder-of-indexes] is the path to the directory containing the index file(s) that are
generated by the indexer. After the above command is executed, it will accept a search
query from the standard input and output the result to the standard output as a sequence
of document names (the same as their document IDs) one per line and sorted in an
ascending order by their numeric values (e.g., 72 will be output before 125). It will then
continue to accept the search queries from the standard input and output the results to the
standard output until the end (i.e., a Ctrl-D). The following example illustrates the required
input and output formats:
$ python3 search.py ~/Proj/MyTestIndex
company inc & revenue
9
17
33
185
share +5 investor & US
3
67
271
365
499
625
$
Chaining Mixed Connectors
Example:
a b /s c
Explanation: As per the WestLaw guide, following the precedence rules for this example,
OR (the space) has higher priority. Because OR is a boolean connector, we can re-write the
query into an equivalent form below:
(a /s c) (b /s c)
Chaining Non-boolean Connectors
Example:
a +n b /s c
Explanation: The connector precedence here lies with '+n' first, then '/s'. This query can be
understood as the equivalent of doing (a +n b) first, then only among the documents (and
more importantly, their postings), we will output the documents for which (a /s c) (for the
same posting 'a') or (b /s c) (for the same posting 'b') is true. To further explain in english,
the query wants documents where either:
there is 'a' which precedes 'b' by at most 'n' terms, that same occurrence of 'a' is in a
sentence with 'c'
there is 'a' which precedes 'b' by at most 'n' terms, that same occurrence of 'b' is in a
sentence with 'c'
Another example:
a +n (b /s c)
Explanation: With the presence of parentheses, this query will have (b /s c) processed first.
From the resulting document postings, we will output the documents for which (a +n b)
(for the same posting 'b') or (a +n c) (for the same posting 'c') is true. To further explain in
english, the query wants documents where either:
there is 'b' in a sentence with 'c', that same occurrence of 'b' occurs at most 'n' terms
after 'a'
there is 'b' in a sentence with 'c', that same occurrence of 'c' occurs at most 'n' terms
after 'a'
Marking
This assignment is worth 30 points.
Your submission will be tested and marked on CSE linux machines using Python3.
Therefore, please make sure you have tested your solution on these machines using
Python3 before you submit. You will not receive any marks if your program does not work
on CSE linux machines and only works in other environment such as your own laptop.
Full marks will be awarded to submissions that follow this specification and pass all the test
cases.
Although we do not measure the runtime speed, your indexing program will be terminated
if it does not end after one minute, and you will receive zero marks for the project (since
we cannot get the index generated successfully for further testing); and your search
program will be terminated if it does not end after 10 seconds per search query, and you
will receive zero marks for that search query.
There will be test cases for each connector in addition to the test cases for mixing several
connectors. Therefore, if you are unable to implement all the required connectors, try your
best to implement as many as you can.
Submission
Deadline: Monday 7th November 12:00pm AEST (noon).
The penalty for late submission of assignments will be 5% (of the worth of the assignment)
subtracted from the raw mark per day of being late. In other words, earned marks will be
lost. No assignments will be accepted later than 5 days after the deadline.
Use the give command below to submit the assignment:
give cs6714 proj *.py
Make sure to use classrun to check your submission to make sure that you have submitted
all the required files for index.py and search.py to run properly.
6714 classrun -check proj
Plagiarism
The work you submit must be your own work. Group submissions will not be allowed. Your
program must be entirely your own work. Plagiarism detection software will be used to
compare all submissions pairwise (including submissions for similar assignments in
previous years, if applicable) and serious penalties will be applied, particularly in the case of
repeat offences.
Do not copy ideas or code from others.
Do not use a publicly accessible repository or allow anyone to see your code.
Please refer to the Student Conduct section in the course outline to help you understand
what plagiarism is and how it is dealt with at UNSW.
Finally, reproducing, publishing, posting, distributing or translating this assignment is an
infringement of copyright and will be referred to UNSW Student Conduct and Integrity for
action.