Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: zz-x2580
COMP30027 Report – Book Rating
Predictions
Anonymous
1. Introduction
In the tremendously developed world now, the
platform on the realm of literature has
migrated from physical books to online
platform. These platforms have provided a
treasure trove of variety of books for
booklovers. From the reviews of thousands of
readers, we will be able to study and analyse
the important information such as book
ratings, descriptions, publishers etc.
In recent years, machine learning techniques
can be used to predict book rating which can
assist authors, publishers and marketers
identifying potential audience and tailoring
marketing strategies to maximise reader
engagement.
The aim of this report is to analyse different
features, such as titles of the books, the
authors, descriptions and other features and
build a supervised machine learning model to
predict the rating of books. The names of
authors, descriptions and titles will be
extracted for sentiment analysis. This project
will be divided into sections using correlations
attributes and sentiment analysis of ‘Text’
containing name of books, authors,
descriptions, publishers as well as the
language of the books, to attempt to predict
book rating with 3 different levels: 3, 4 or 5.
The report will try to train classifiers using
different techniques and analyse the results
with regards to the attributes.
2. Methodology
2.1 Data Pre-processing
Different features are given in the training and
testing csv files. Upon manual inspection, the
data consists of unwanted stop words, words in
different languages, non-words etc. in order to
enhance the performance of the classifier,
pre-processing methods were carried as shown
below.
2.1.1 Case-folding
Raw data that has been extracted contains
alphabetical features that are in both upper and
lower cases. In this step, all the characters that
are in upper-case are converted into
lower-case.
2.1.2 Removing punctuation and
numbers
After case-folding process, there are numerical
values and punctuations exists such as ‘
’
and ‘’. these non-ASCII characters,
symbols convey no values and meaning in the
data, thus they can be considered as less
valuable information.
2.1.3 Removing stop words
Common English stop words are removed as
these words do not convey and specific
meaning. By removing words that contain
low-level information, dataset size has been
reduced thus the training time required will be
eventually reduces as fewer number of tokens
are involved.