COMP6714 Project ​Specification
Project ​Specification
项目类别:计算机

Hello, dear friend, you can consult us at any time if you have any questions, add  WeChat:  zz-x2580


COMP6714  Project: Westlaw alike

queries
As presented in the lecture, Westlaw is a popular commercial information retrieval system.
You can search for documents by Boolean Terms and Connector queries. For example:
STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

where STATUTE, ACTION, FEDERAL, TORT, CLAIM are the search terms and space, /S, /2, /3
are the connectors.
In this project, you are going to implement a retrieval system in Python3 called
SimplyBoolean, which you have already encountered in the assignment. As a core
requirement for this project, you must implement SimplyBoolean using a positional index.
SimplyBoolean is a retrieval system that supports Westlaw alike queries. It supports the
following reduced set of connectors:
" ", space, +n, /n, +s, /s, &

as well as parentheses. Note that the connectors of your query will be processed in exactly
the order above. Further details of these connectors can be found in the Quick Reference
Guide of WestLaw available from WebCMS.
Different to Westlaw, SimplyBoolean does not support various forms of search terms,
except a normal search term (i.e. single-word, those without " ") and a phrase.
Term matching (including terms in a phrase) in SimplyBoolean follows the below:
Search in SimplyBoolean is case insensitive.
Full stops for abbreviations are ignored. e.g., U.S., US are the same.
Singular/Plural is ignored. e.g., cat, cats, cat's, cats' are all the same.
Tense is ignored. e.g., breaches, breach, breached, breaching are all the same.
A sentence can only end with a full stop, a question mark, or an exclammation mark.
Except the above, all other punctuation should be treated as token dividers.
All (whole) numeric tokens such as years, decimals, integers are ignored. You should
not index these tokens and hence should not consider them for proximity queries
such as +n. E.g. you should not index '123' (wholly numeric) but should index 'abc123'
(partially numeric).
You are provided with approximately 1000 small documents (named with their document
IDs) available in ~cs6714/reuters/data. You can find these files by logging into CSE
machines and going to folder ~cs6714/reuters/data. Your submitted project will be tested
against a similar collection of up to 1000 documents (i.e., we may replace some of these
documents to avoid any hard-coded solutions).
Your submission must include 2 main programs: index.py and search.py as described below.

The Indexer
$ python3 index.py [folder-of-documents] [folder-of-indexes]
where [folder-of-documents] is the path to the directory for the collection of documents to
be indexed and [folder-of-indexes] is the path to the directory containing the index file(s)
be created. All the files in [folder-of-documents] should be opened as read-only, as you
may not have the write permission for these files. If [folder-of-indexes] does not exist,
create a new directory as specified. You may create multiple index files although too many
index files may slow down your performance. The total size of all your index files generated
shall not exceed 20MB (which should be plenty for this project).
After the indexing is completed, it will output the total number of documents, the total
number of tokens (after any preprocessing and filtering) to be indexed, and the total
number of terms to be indexed. The following example illustrates the required input and
output formats:
$ python3 index.py ~/Desktop/MyDataFolder ./MyTestIndex
Total number of documents: 672
Total number of tokens: 638321
Total number of terms: 13297
Note: the output of index.py ends with one newline ('\n') character.
Searching
$ python3 search.py [folder-of-indexes]
where [folder-of-indexes] is the path to the directory containing the index file(s) that are
generated by the indexer. After the above command is executed, it will accept a search
query from the standard input and output the result to the standard output as a sequence
of document names (the same as their document IDs) one per line and sorted in an
ascending order by their numeric values (e.g., 72 will be output before 125). It will then
continue to accept the search queries from the standard input and output the results to the
standard output until the end (i.e., a Ctrl-D). The following example illustrates the required
input and output formats:
$ python3 search.py ~/Proj/MyTestIndex
company inc & revenue
9
17
33
185
share +5 investor & US
3
67
271
365
499
625
$
Chaining Mixed Connectors
Example:
a b /s c

Explanation: As per the WestLaw guide, following the precedence rules for this example,
OR (the space) has higher priority. Because OR is a boolean connector, we can re-write the
query into an equivalent form below:
(a /s c) (b /s c)
Chaining Non-boolean Connectors
Example:
a +n b /s c
Explanation: The connector precedence here lies with '+n' first, then '/s'. This query can be
understood as the equivalent of doing (a +n b) first, then only among the documents (and
more importantly, their postings), we will output the documents for which (a /s c) (for the
same posting 'a') or (b /s c) (for the same posting 'b') is true. To further explain in english,
the query wants documents where either:
there is 'a' which precedes 'b' by at most 'n' terms, that same occurrence of 'a' is in a
sentence with 'c'
there is 'a' which precedes 'b' by at most 'n' terms, that same occurrence of 'b' is in a
sentence with 'c'
Another example:
a +n (b /s c)
Explanation: With the presence of parentheses, this query will have (b /s c) processed first.
From the resulting document postings, we will output the documents for which (a +n b)
(for the same posting 'b') or (a +n c) (for the same posting 'c') is true. To further explain in
english, the query wants documents where either:
there is 'b' in a sentence with 'c', that same occurrence of 'b' occurs at most 'n' terms
after 'a'
there is 'b' in a sentence with 'c', that same occurrence of 'c' occurs at most 'n' terms
after 'a'
Marking
This assignment is worth 30 points.
Your submission will be tested and marked on CSE linux machines using Python3.
Therefore, please make sure you have tested your solution on these machines using
Python3 before you submit. You will not receive any marks if your program does not work
on CSE linux machines and only works in other environment such as your own laptop.
Full marks will be awarded to submissions that follow this specification and pass all the test
cases.
Although we do not measure the runtime speed, your indexing program will be terminated
if it does not end after one minute, and you will receive zero marks for the project (since
we cannot get the index generated successfully for further testing); and your search

program will be terminated if it does not end after 10 seconds per search query, and you
will receive zero marks for that search query.
There will be test cases for each connector in addition to the test cases for mixing several
connectors. Therefore, if you are unable to implement all the required connectors, try your
best to implement as many as you can.
Submission
Deadline: Monday 7th November 12:00pm AEST (noon).
The penalty for late submission of assignments will be 5% (of the worth of the assignment)
subtracted from the raw mark per day of being late. In other words, earned marks will be
lost. No assignments will be accepted later than 5 days after the deadline.
Use the give command below to submit the assignment:
give cs6714 proj *.py
Make sure to use classrun to check your submission to make sure that you have submitted
all the required files for index.py and search.py to run properly.
6714 classrun -check proj
Plagiarism
The work you submit must be your own work. Group submissions will not be allowed. Your
program must be entirely your own work. Plagiarism detection software will be used to
compare all submissions pairwise (including submissions for similar assignments in
previous years, if applicable) and serious penalties will be applied, particularly in the case of
repeat offences.
Do not copy ideas or code from others.
Do not use a publicly accessible repository or allow anyone to see your code.
Please refer to the Student Conduct section in the course outline to help you understand
what plagiarism is and how it is dealt with at UNSW.
Finally, reproducing, publishing, posting, distributing or translating this assignment is an
infringement of copyright and will be referred to UNSW Student Conduct and Integrity for
action.
留学ICU™️ 留学生辅助指导品牌
在线客服 7*24 全天为您提供咨询服务
咨询电话(全球): +86 17530857517
客服QQ:2405269519
微信咨询:zz-x2580
关于我们
微信订阅号
© 2012-2021 ABC网站 站点地图:Google Sitemap | 服务条款 | 隐私政策
提示:ABC网站所开展服务及提供的文稿基于客户所提供资料,客户可用于研究目的等方面,本机构不鼓励、不提倡任何学术欺诈行为。