CS439: Introduction to Data Science Fall 2024
Problem Set 1
项目类别:计算机

Late Policy: The homework is due on 10/11 (Friday) at 11:59pm. We will release the solutions

of the homework on Canvas on 10/16 (Wednesday) 11:59pm. If your homework is submitted to

Canvas before 10/11 11:59pm, there will no late penalty. If you submit to Canvas after 10/11

11:59pm and before 10/16 11:59pm (i.e., before we release the solution), your score will be

penalized by 0.9k

, where k is the number of days of late submission. For example, if you

submitted on 10/14, and your original score is 80, then your final score will be 80*0.93

=58.32

for 14-11=3 days of late submission. If you submit to Canvas after 10/16 11:59pm (i.e., after we

release the solution), then you will earn no score for the homework.  

General Instructions

Submission instructions: These questions require thought but do not require long answers.

Please be as concise as possible. You should submit your answers as a writeup in PDF format,

for those questions that require coding, write your code for a question in a single source code

file, and name the file as the question number (e.g., question_1.java or question_1.py), finally,

put your PDF answer file and all the code files in a folder named as your Name and NetID (i.e.,

Firstname-Lastname-NetID.pdf), compress the folder as a zip file (e.g., Firstname-LastnameNetID.zip),

and submit the zip file via Canvas.

For the answer writeup PDF file, we have provided both a word template and a latex template

for you, after you finished the writing, save the file as a PDF file, and submit both the original

file (word or latex) and the PDF file.

Questions

1. Map-Reduce (35 pts)

Write a MapReduce program in Hadoop that implements a simple “People You Might Know”

social network friendship recommendation algorithm. The key idea is that if two people have a

lot of mutual friends, then the system should recommend that they connect with each other.

Input: Use the provided input file hw1q1.zip.

The input file contains the adjacency list and has multiple lines in the following format:

<User><TAB><Friends>

Here, <User> is a unique integer ID corresponding to a unique user and <Friends> is a commaseparated

list of unique IDs corresponding to the friends of the user with the unique ID <User>.

Note that the friendships are mutual (i.e., edges are undirected): if A is friend with B, then B is

also friend with A. The data provided is consistent with that rule as there is an explicit entry for

each side of each edge.

Algorithm: Let us use a simple algorithm such that, for each user U, the algorithm recommends

N = 10 users who are not already friends with U, but have the largest number of mutual friends

in common with U.

Output: The output should contain one line per user in the following format:

<User><TAB><Recommendations>

where <User> is a unique ID corresponding to a user and <Recommendations> is a commaseparated

list of unique IDs corresponding to the algorithm’s recommendation of people that

<User> might know, ordered by decreasing number of mutual friends. Even if a user has

fewer than 10 second-degree friends, output all of them in decreasing order of the number of

mutual friends. If a user has no friends, you can provide an empty list of recommendations. If

there are multiple users with the same number of mutual friends, ties are broken by ordering

them in a numerically ascending order of their user IDs.

Also, please provide a description of how you are going to use MapReduce jobs to solve this

problem. We only need a very high-level description of your strategy to tackle this problem.

Note: It is possible to solve this question with a single MapReduce job. But if your solution

requires multiple MapReduce jobs, then that is fine too.

What to submit:

(i) The source code as a single source code file named as the question number (e.g.,

question_1.java).

(ii) Include in your writeup a short paragraph describing your algorithm to tackle this problem.

(iii) Include in your writeup the recommendations for the users with following user IDs:

924, 8941, 8942, 9019, 9020, 9021, 9022, 9990, 9992, 9993.

2. Association Rules (35 pts)

Association Rules are frequently used for Market Basket Analysis (MBA) by retailers to

understand the purchase behavior of their customers. This information can be then used for many different purposes such as cross-selling and up-selling of products, sales promotions,

loyalty programs, store design, discount plans and many others.

Evaluation of item sets: Once you have found the frequent itemsets of a dataset, you need to

choose a subset of them as your recommendations. Commonly used metrics for measuring

significance and interest for selecting rules for recommendations are:

2a. Confidence (denoted as conf(A → B)): Confidence is defined as the probability of

occurrence of B in the basket if the basket already contains A:

conf(A → B) = Pr(B|A),

where Pr(B|A) is the conditional probability of finding item set B given that item set A is

present.

留学ICU™️ 留学生辅助指导品牌
在线客服 7*24 全天为您提供咨询服务
咨询电话(全球): +86 17530857517
客服QQ:2405269519
微信咨询:zz-x2580
关于我们
微信订阅号
© 2012-2021 ABC网站 站点地图:Google Sitemap | 服务条款 | 隐私政策
提示:ABC网站所开展服务及提供的文稿基于客户所提供资料,客户可用于研究目的等方面,本机构不鼓励、不提倡任何学术欺诈行为。