Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: zz-x2580
COMP30027 Machine Learning
Submission: Source code (in Python) and written responses
Marks: The Project will be marked out of 20, and will contribute 20% of your total mark.
This will be equally weighted between implementation and responses to the questions.
Groups: You may choose to form a group of 1 or 2.
Groups of 2 will respond to more questions, and commensurately produce more implementation.
Overview
In this Project, you will implement a supervised naı¨ve Bayes learner and evaluate it with respect
to various supervised datasets. You will then use your observations to respond to some conceptual
questions about naı¨ve Bayes.
Naive Bayes classifiers
There are some suggestions for implementing your learner in the “Naı¨ve Bayes” and “Discrete
Continuous” lectures, but ultimately, the specifics of your implementation are up to you. Your imple
mentation must be able to perform the following functions:
• preprocess() the data by reading it from a file and converting it into a useful format for
training and testing
• train() by calculating prior probabilities and likelihoods from the training data and using
these to build a naive Bayes model
• predict() classes for new items in a test dataset (for the purposes of this assignment, you
can re-use the training data as a test set)
• evaluate() the prediction performance by comparing your model’s class outputs to ground
truth labels
Your implementation should be able to handle both nominal and numeric attribute types in the
same dataset. You can assume numeric attributes are Gaussian-distributed. When handling discrete at
tributes, you should implement some type of smoothing to ensure the likelihoods are greater than zero.
Your implementation should actually compute the priors, likelihoods, and posterior probabilities for
the naı¨ve Bayes model and may not simply call an existing implementation such as GaussianNB
from scikit-learn.
Data
For this assignment, we have adapted some of the classification datasets available from the UCI ma
chine learning repository (https://archive.ics.uci.edu/ml/index.html). In all of
these datasets, the task is classifcation, but the attribute types vary:
Datasets with nominal attributes only:
• breast-cancer-wisconsin
• mushroom
• lymphography
Datasets with numeric attributes only:
• wdbc
• wine
Datasets with ordinal attributes only:
• car
• nursery
• somerville
Datasets with a mix of attribute types:
• adult
• bank
These datasets vary in terms of number of instances and number of classes, in addition to the
number and type of attributes. More information is provided in the README file included with the
datasets. You are not required to use all of these datasets in your submission, however it is strongly
recommended that you use multiple datasets to answer the questions below. Different datasets will
produce different results, so if you only test your algorithm on one or two datasets, you may arrive at
an incorrect conclusion due to a small sample space.
Questions
The following problems are designed to pique your curiosity when running your classifier(s) over the
given data sets:
1. Try discretising the numeric attributes in these datasets and treating them as discrete variables
in the naı¨ve Bayes classifier. You can use a discretisation method of your choice and group the
numeric values into any number of levels (but around 3 to 5 levels would probably be a good
starting point). Does discretizing the variables improve classification performance, compared
to the Gaussian naı¨ve Bayes approach? Why or why not?
2. Implement a baseline model (e.g., random or 0R) and compare the performance of the naı¨ve
Bayes classifier to this baseline on multiple datasets. Discuss why the baseline performance
varies across datasets, and to what extent the naı¨ve Bayes classifier improves on the baseline
performance.