Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: zz-x2580
Project 2: Book Rating Prediction
Task: Build a classifier to predict the rating of books
Due: Group Registration: Friday 5 May, 5pm
Stage I: Friday 19 May, 5pm
Stage II: Friday 26 May, 5pm
Submission: Stage I: Report (PDF) and code to Canvas; test outputs to Kaggle in-class competition
Stage II: Peer reviews and reflection to Canvas
Marks: The Project will be marked out of 20, and will contribute 20% of your total mark.
Groups: Groups of 1 or 2, with commensurate expectations for each (see Sections 2 and 5).
1 Overview
The goal of this Project is to build and critically analyse supervised Machine Learning methods to predict the
ratings of books based on their titles, authors, descriptions and other features. There are three levels of rating,
3, 4 and 5, for each book.
This assignment aims to reinforce the largely theoretical lecture concepts surrounding data representation, classifier construction, evaluation and error analysis, by applying them to an open-ended problem. You will also
have an opportunity to practice your general problem-solving skills, written communication skills, and critical
thinking skills.
2 Deliverables
This project has two stages. The deliverables of each stage are listed as follows. More details about deliverables
are given in the Submission (Section 5).
Stage I:
1. Report: an anonymous written report, of 1,300-1,800 words (for a group of one person) or 2,000-2,500
words (for a group of two people).
2. Output: the output of your classifiers, comprising the label predictions for test instances, submitted to
the Kaggle1
in-class competition described below.
3. Code: one or more programs, written in Python, which implement machine learning models to make
predictions and evaluate the results.
Stage II:
1. Peer review: reviews of two reports written by other students, 200-300 words each (for a group of one
person) or 300-400 words each (for a group of two people).
2. Reflection: a written reflection piece of 400 words. This deliverable is individual work.
1
3 Data
The information of book is collected from Goodreads2
, which is a platform that allows users to search its
database of books, rate books and write reviews. The data files for this project are available via Canvas, and are
described in a corresponding README.
In our dataset, each book contains:
? Book features: name, authors, publish year, publish month, publish day, publisher,
language, page numbers, and description.
? Text features: produced by various text encoding methods for name, authors, and description.
Each text feature is provided as a single file with rows corresponding to the file of book features.
? Class label: the rating of a book rating label (3 possible levels, 3, 4 or 5)
You will be provided with a training set and a test set. The training set contains the book features, text features,
and the rating label, which is the “class label” of our task. The test set only contains the book and text
features without labels.
The files provided are:
? book rating train.csv: the book features and class label of training instances.
? book rating test.csv: the book features of test instances.
? book text features *.zip: the preprocessed text features for training and test sets, 1 zipped file for each
text encoding method. Details about using these text features are provided in README.
4 Task
You are expected to develop Machine Learning models to predict the rating of a book based on its features (e.g.
name, authors, description, publish year etc.). You will explore effective features, implement and compare
different machine learning models and conduct error analysis for this task.
Various machine learning techniques have been (or will be) discussed in this subject (0R, Naive Bayes, Decision
Trees, kNN, SVM, neural network, etc.); many more exist. You may use any machine learning method you
consider suitable for this problem. You are strongly encouraged to make use of machine learning software
and/or existing libraries (such as sklearn) in your attempts at this project.
In addition to different learning algorithms, there are many different ways to encode text for these algorithms.
The files in book text features *.zip are some possible representations of the name, authors and description of
books we have provided. For example, one of the encoding method is CountVectorizer in sklearn,
which converts text documents into “Bag of Words” – the documents are described by word occurrences while
ignoring the relative position information of the words. You can use these representations to develop your
classifiers, but please also feel free to extract your own features from the raw book features according to your
needs. Just keep in mind that any data representation you use for the text in the training set will need to be able
to generalise to the test set.
You are expected to complete the following two phases for this task:
? Training-evaluation phase: the holdout or cross-validation approaches can be applied on the training
data provided.
? Test phase: the trained classifiers will be evaluated on the unlabelled test data. The predicted labels of
test cases should be submitted as part of the Stage I deliverable.
2
5 Submission
The report, code, peer reviews and reflections should be submitted via Canvas; the predictions on test data
should be submitted to Kaggle.
5.1 Individual vs. Team Participation
You have the option of participating individually, or in a group of two. In the case that you opt to participate
individually, you will be required to implement at least 2 and up to 4 distinct Machine Learning models.
Groups of two will be required to implement at least 4 and up to 5 distinct Machine Learning models, of
which one is to be an ensemble model – stacking based on the other models. The report length requirement also
differs, as detailed below:
Group size Distinct models required Report length
1 2–4 1,300–1,800 words
2 4–5 2,000–2,500 words
Group Registration
If you wish to form a group of 2, only one of the members needs to register by Friday 5 May 5:00pm, via the
form “Project 2 Group Registration” on Canvas. For a group of 2, only one of the members needs to submit
deliverables.
Note that once you have signed up for a given group, you will not be allowed to change groups. If you do not
register before the deadline above, we will assume that you will be completing the assignment as an individual,
even if you were in a two-person group for Assignment