1. Task Description: This task aims to help students understand and master the various stages of data mining through hands-on practice, including big data input, data cleaning, classification analysis, classification algorithms, association analysis, clustering analysis, and anomaly detection. By completing this task, students will learn how to handle real-world data, apply data mining algorithms, and interpret analysis results.
2. Task Objectives:
l Master big data input and preprocessing
l Understand and apply data cleaning techniques
l Be familiar with common classification algorithms and conduct classification analysis
l Master methods for association rule mining and their application scenarios
l Understand the concept of clustering analysis and its application to different datasets
l Explore methods and applications of anomaly detection
3. Task Steps and Requirements:
3.1 Data Acquisition and Input
Obtain a dataset containing more than 10,000 records from an open data platform (e.g., Kaggle, UCI datasets). The dataset can be from different domains such as social media, healthcare, finance, or e-commerce.
Recommended datasets include:
Kaggle (https://www.kaggle.com/datasets): Titanic dataset (classic classification task), House Prices dataset (regression analysis), Customer Reviews dataset (sentiment analysis).
UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets): Adult dataset (classification task to predict income level), Bank Marketing dataset (financial behavior analysis), Online Retail dataset (e-commerce transaction data).
Use Python to input the data and import it into an appropriate analysis environment (e.g., Spyder).
3.2 Data Cleaning
Inspect the data for missing values, duplicates, and outliers.
Apply data cleaning techniques (e.g., imputing missing values, removing duplicates, handling outliers) to ensure the data quality meets analysis standards.
Normalize or standardize the data to facilitate further data analysis and mining.
3.3 Classification Analysis and Algorithms
Choose an appropriate classification task (e.g., predicting purchase intent based on user behavior).
Apply at least two different classification algorithms to the dataset (e.g., Decision Tree, Support Vector Machine, K-Nearest Neighbors).
Compare the performance of different algorithms using metrics such as confusion matrix, accuracy, and recall.
3.4 Association Rule Analysis
Apply association rule mining (e.g., Apriori algorithm) to the dataset to uncover potential relationships between data items.
Present the discovered association rules in tabular form and explain their practical value in real-world scenarios.
3.5 Clustering Analysis
Perform clustering analysis (e.g., K-Means or hierarchical clustering) to identify natural groupings within the data.
Visualize the clustering results, analyze the characteristics of each cluster, and explain the significance of these clusters in a business context.
3.6 Anomaly Detection
Apply anomaly detection algorithms (e.g., Isolation Forest, density-based anomaly detection) to identify anomalies in the data.
Explain the possible meaning of these anomalies, such as potential fraud or equipment failures.
3.7 Summary Report
Write a comprehensive report that describes each step of the process, methods used, analysis of results, and conclusions drawn from the data.
The report should include exploratory data analysis results, performance comparison of classification analysis, association rules, clustering results, and insights from anomaly detection.
4. Submission Requirements:
Code files must be fully annotated to ensure readability.
The data analysis report should be submitted in PDF format, with a minimum of 2000 words, and include visual charts (e.g., line charts, pie charts, scatter plots).
Final code and report should be submitted before the course deadline.
5. Grading Criteria:
Data Preprocessing and Cleaning: 20%
Classification Analysis and Algorithm Comparison: 20%
Association Rule Mining: 15%
Clustering Analysis: 15%
Anomaly Detection: 10%
Summary Report: 20%
6. Additional Notes:
The report should include in-depth analysis of the results rather than just a simple presentation of numbers. The thought process behind each step's business significance should be clearly articulated.
Wish you all the best in your learning journey, and I look forward to seeing your outstanding work!