MATH38161 Multivariate Statistics and Machine Learning
Coursework
November 2024
Overview
The coursework is a data analysis project with a written report. You will apply skills
and techniques acquired from Week 1 to Week 8 to analyse a subset of the FMNIST
dataset.
In completing this coursework, you should primarily use the techniques and methods
introduced during the course. The assessment will focus on your understanding and
demonstration of these techniques in alignment with the learning outcomes, rather
than the accuracy or exactness of the final results.
The project report will be marked out of 30. The marking scheme is detailed below.
You have twelve days to complete this coursework, with a total workload of approximately 10 hours (including preliminary coursework tasks).
Format
• Software: You should mainly use R to perform the data analysis. You may use
built-in functions from R packages or implement the algorithms with your own
codes.
• Report: You may use any document preparation system of your choice but the
final document must be a single PDF in A4 format. Ensure that the text in the
PDF is machine-readable.
• Content: Your report must include the complete analysis in a reproducible format,
integrating the computer code, figures, and text etc. in one document.
• Title Page: Show your full name and your University ID on the title page of your
report.
• Length: Recommended length is 8 pages of content (single sided) plus title
page. Maximum length is 10 pages of content plus the title page. Any content
exceeding 10 pages will not be marked.
1
Submission process and deadline
• The deadline for submission is 11:59pm, Friday 29 November 2024.
• Submission is online on Blackboard (through Grapescope).
Academic Integrity and Use of AI Tools
This is an individual coursework. Your analysis and report must be completed
independently, including all computer code. Note that according to the University
guidances, output generated by AI tools is considered work created by another person.
• Citations: Acknowledge all sources, including AI tools used to support text and
code writing.
• Ethics: Use sources in an academically appropriate and ethical manner. Do not
copy verbatim, and cite the original authors rather than second- or third-level
sources.
• Accuracy: Be mindful that sources, including Wikipedia and AI tools, may contain
non-obvious errors.
Copying and plagiarism (=passing off someone else’s work as your own) is a very
serious offence and will be strictly prosecuted. For more details see the “Guidance
to students on plagiarism and other forms of academic malpractice” available at
https://documents.manchester.ac.uk/display.aspx?DocID=2870 .
2
Coursework tasks
Analysis of the FMNIST data using principal component analysis
(PCA) and Gaussian mixture models (GMMs)
The Fashion MNIST dataset contains 70,000 grayscale images of fashion products
categorised into 10 distinct classes. More information is available on Wikipedia and
Github.
The data set to be analysed in this coursework is a subset of the full FMNIST data and
contains 10,000 images, each with dimensions of 28 by 28 pixels, resulting in a total of
784 pixels per image. Each pixel is represented by an integer value ranging from 0 to
255. You can download this data subset as “fmnist.rda” (7.4 MB) from Blackboard.
load("fmnist.rda") # load sampled FMNIST data set
dim(fmnist$x) # dimension of features data matrix (10000, 784)
## [1] 10000 784
range(fmnist$x) # range of feature values (0 to 255)
## [1] 0 255
Here is a plot of the first 15 images:
par(mfrow=c(3,5), mar=c(1,1,1,1))
for (k in 1:15) # first 15 images
{
m = matrix( fmnist$x[k,] , nrow=28, byrow=TRUE)
image(t(apply(m, 2, rev)), col=grey(seq(1,0,length=256)), axes = FALSE)
}
3
Each sample is assigned to one label represented by an integer from 0 to 9 (as R factor
with 10 levels):
fmnist$label[1:15] # first 15 labels
## [1] 7 1 4 8 1 4 7 1 2 0 7 0 8 1 6
## Levels: 0 1 2 3 4 5 6 7 8 9
Task 1: Dimension reduction for FMNIST data using principal components analysis
(PCA)
The following steps are suggested guidelines to help structure your analysis but are not
meant as assignment-style questions. Integrate your work as part of a cohesive report
with a logical narrative.
• Do some research to learn more about the FMNIST data.
• Compute the 784 principal components from the 784 original pixel variables.
• Compute and plot the proportion of variation attributed to each principal component.
• Create a scatter plot of the first two principal components. Use the known labels
to colour the scatter plot.
• Construct the correlation loadings plot.
• Interpret and discuss the result.
• Save the first 10 principal components of all 10,000 images to a data file for Task 2.
Task 2: Analysis of the FMNIST data set using Gaussian mixture models (GMMs)
Using all 784 pixel variables for cluster analysis is computationally impractical. In
this task, use the 10 (or fewer) principal components instead of the original 784 pixel
variables. Again, these steps serve as guidelines. Integrate this work into your report
logically following from Task 1.
• Cluster the data using Gaussian mixture models (GMMs).
• Find out how many clusters can be identified.
• Interpret and discuss the results.
Structure of the report
Your report should be structured into the following sections:
1. Dataset
2. Methods
3. Results and Discussion
4. References
In Section 1 provide some background and describe the data set. In Section 2 briefly
introduce the method(s) you are using to analyse the data. In Section 3 run the analyses
and present and interpret the results. Show all your R code so that your results are
fully reproducible. In Section 4 list all journal articles, books, wikipedia entries, github
pages and other sources you refer to in your report.
4
Marking scheme
The project report will be assessed out of 30 points based on the following rubrics.
Criteria Marks Rubrics
Description of
data
6 Excellent (5-6 marks): Provides a clear and thorough
overview of the FMNIST dataset, detailing the image
structure, pixel data, and its context within multivariate
analysis.
Good (3-4 marks): Provides a clear overview of the
dataset with some context; minor details may be missing.
Adequate (1-2 marks): Basic description of the dataset
with limited context; lacks important details.
Insufficient (0 marks): Little to no description provided.
Description of
Methods
6 Excellent (5-6 marks): Clearly and thoroughly explains
PCA and GMMs, their purposes, and how they apply to
this dataset.
Good (3-4 marks): Provides a clear explanation of PCA
and GMMs, with minor gaps in clarity or relevance.
Adequate (1-2 marks): Basic explanation of methods with
limited detail or relevance to the course techniques.
Insufficient (0 marks): Lacks clear explanations of the
methods.
Results and
Discussion
12 Excellent (10-12 marks): Correctly applies PCA and
GMMs, presents clear and informative visualisations, and
provides a coherent and insightful interpretation of the
results.
Good (7-9 marks): Accurately applies PCA and GMMs
with mostly clear visuals and reasonable interpretation;
minor improvements needed.
Adequate (4-6 marks): Basic application of techniques,
limited or unclear visuals, minimal interpretation.
Insufficient (0-3 marks): Incorrect application of
techniques, with little to no interpretation.
Overall
Presentation of
Report
6 Excellent (5-6 marks): Report is well-organised, clear, and
professionally formatted, with a logical narrative and
adherence to page limits.
Good (3-4 marks): Report is generally clear and
organised, with minor structural or formatting issues.
Adequate (1-2 marks): Report lacks coherence or has
significant formatting issues; may not meet all format
requirements.
Insufficient (0 marks): Report lacks structure and clarity,
does not meet formatting requirements.