Data Mining Course Tasks
Task Description
项目类别:计算机

1. Task Description: This task aims to help students understand and master the various stages of data mining through hands-on practice, including big data input, data cleaning, classification analysis, classification algorithms, association analysis, clustering analysis, and anomaly detection. By completing this task, students will learn how to handle real-world data, apply data mining algorithms, and interpret analysis results.

 

2. Task Objectives:

Master big data input and preprocessing

Understand and apply data cleaning techniques

Be familiar with common classification algorithms and conduct classification analysis

Master methods for association rule mining and their application scenarios

Understand the concept of clustering analysis and its application to different datasets

Explore methods and applications of anomaly detection

 

3. Task Steps and Requirements:

3.1 Data Acquisition and Input

Obtain a dataset containing more than 10,000 records from an open data platform (e.g., Kaggle, UCI datasets). The dataset can be from different domains such as social media, healthcare, finance, or e-commerce.

Recommended datasets include:

Kaggle (https://www.kaggle.com/datasets): Titanic dataset (classic classification task), House Prices dataset (regression analysis), Customer Reviews dataset (sentiment analysis).

UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets): Adult dataset (classification task to predict income level), Bank Marketing dataset (financial behavior analysis), Online Retail dataset (e-commerce transaction data).

Use Python to input the data and import it into an appropriate analysis environment (e.g., Spyder).

3.2 Data Cleaning

Inspect the data for missing values, duplicates, and outliers.

Apply data cleaning techniques (e.g., imputing missing values, removing duplicates, handling outliers) to ensure the data quality meets analysis standards.

Normalize or standardize the data to facilitate further data analysis and mining.

3.3 Classification Analysis and Algorithms

Choose an appropriate classification task (e.g., predicting purchase intent based on user behavior).

Apply at least two different classification algorithms to the dataset (e.g., Decision Tree, Support Vector Machine, K-Nearest Neighbors).

Compare the performance of different algorithms using metrics such as confusion matrix, accuracy, and recall.

3.4 Association Rule Analysis

Apply association rule mining (e.g., Apriori algorithm) to the dataset to uncover potential relationships between data items.

Present the discovered association rules in tabular form and explain their practical value in real-world scenarios.

3.5 Clustering Analysis

Perform clustering analysis (e.g., K-Means or hierarchical clustering) to identify natural groupings within the data.

Visualize the clustering results, analyze the characteristics of each cluster, and explain the significance of these clusters in a business context.

3.6 Anomaly Detection

Apply anomaly detection algorithms (e.g., Isolation Forest, density-based anomaly detection) to identify anomalies in the data.

Explain the possible meaning of these anomalies, such as potential fraud or equipment failures.

3.7 Summary Report

Write a comprehensive report that describes each step of the process, methods used, analysis of results, and conclusions drawn from the data.

The report should include exploratory data analysis results, performance comparison of classification analysis, association rules, clustering results, and insights from anomaly detection.

 

4. Submission Requirements:

Code files must be fully annotated to ensure readability.

The data analysis report should be submitted in PDF format, with a minimum of 2000 words, and include visual charts (e.g., line charts, pie charts, scatter plots).

Final code and report should be submitted before the course deadline.

 

5. Grading Criteria:

Data Preprocessing and Cleaning: 20%

Classification Analysis and Algorithm Comparison: 20%

Association Rule Mining: 15%

Clustering Analysis: 15%

Anomaly Detection: 10%

Summary Report: 20%

6. Additional Notes:

The report should include in-depth analysis of the results rather than just a simple presentation of numbers. The thought process behind each step's business significance should be clearly articulated.

Wish you all the best in your learning journey, and I look forward to seeing your outstanding work!

 

留学ICU™️ 留学生辅助指导品牌
在线客服 7*24 全天为您提供咨询服务
咨询电话(全球): +86 17530857517
客服QQ:2405269519
微信咨询:zz-x2580
关于我们
微信订阅号
© 2012-2021 ABC网站 站点地图:Google Sitemap | 服务条款 | 隐私政策
提示:ABC网站所开展服务及提供的文稿基于客户所提供资料,客户可用于研究目的等方面,本机构不鼓励、不提倡任何学术欺诈行为。