Statistical Methods for Data Science DATA7202
Statistical Methods for Data Science
项目类别:计算机

Hello, dear friend, you can consult us at any time if you have any questions, add  WeChat:  zz-x2580


Statistical Methods for Data Science

DATA7202
Assignment 2 (Weight: 25%)
Please answer the questions below. For theoretical questions, you should present rigorous proofs
and appropriate explanations. Your report should be visually appealing and all questions should
be answered in the order of their appearance. For programming questions, you should present your
analysis of data using Python, Matlab, or R, as a short report, clearly answering the objectives
and justifying the modeling (and hence statistical analysis) choices you make, as well as discussing
your conclusions. Do not include excessive amounts of output in your reports. All the code should
be copied into the appendix and the sources should be packaged separately and submitted on the
blackboard in a zipped folder with the name:
"student_last_name.student_first_name.student_id.zip".
For example, suppose that the student name is John Smith and the student ID is 123456789.
Then, the zipped file name will be John.Smith.123456789.zip.
1. [15 Marks (see details below)] Consider the Hitters data-set from Assignment 1 (given in
Hitters.csv) and recall that our objective is to predict a hitter’s salary via linear models.
(a) [10 Marks]) Apply Principal Component Regression (PCR) with all possible number of
principal components. Using the 10-Fold Cross-Validation, plot the mean squared error as
a function of the number of components and determine the optimal number of components.
(b) [5 Marks] Apply the Lasso method and plot the 10-Fold Cross-Validation mean squared
error as a function of λ. Determine the best λ and the corresponding mean squared error.
2. [15 Marks] Consider the data given in ships.csv. There are 34 observations that contain
a ship type (coded 1-5 for A, B, C, D and E), year of construction (1=1960-64, 2=1965-70,
3=1970-74, 4=1975-79), period of operation (1=1960-74, 2=1975-79), months of service (63 to
20,370), and the response variable damage incidents, which ranges from 0 to 53.
Construct a Poisson regression model and report the coefficients (for type, construction, op-
eration, and months), and the corresponding 95% CIs. You can use the statsmodels.api
module.
3. [30 Marks (see details below)] A soft drink bottler is analyzing vending machine service routes
in his distribution system. He is interested in predicting the amount of time required by the
route driver to service the vending machines in an outlet. This service activity includes stocking
the machine with beverage products and minor maintenance or housekeeping. The industrial
engineer responsible for the study has suggested that the two most important variables affecting
the delivery time are the number of cases of product stocked and the distance walked by the
route driver. The engineer has collected 25 observations on delivery time (minutes), number of
cases and distance walked (feet). The data is in the file “softdrink.csv”.
(a) [10 Marks Compute the multiple regression of Time on Cases and Distance. State the
fitted model, the estimated residual standard deviation, and the P-values for the overall
model and each of the two predictors.
1
(b) [10 Marks Obtain residual plots and the histogram of the residuals. Comment on these.
(c) [10 Marks There is an observation in this data set which is extremely influential according
to Cook’s distance. Which observation is it? Display a Cook’s distance plot to determine
the Cook’s distance of the next most influential observation.
4. [25 Marks] Conjugate Categorical random variable analysis: Consider n iid categorical random
variables Yi, (i = 1, . . . , n), each with p.d.f.
P(y | p1, . . . , pk) =
k∏
j=1
p
1{y=j}
j
Suppose that the prior for θ = {p1, . . . , pk} is the Dirichlet distribution Dirichlet(α(1)0 , . . . , α(k)0 ).
Namely
p(p1, . . . , pk | α(1)0 , . . . , α(k)0 ) ∝
k∏
j=1
p
α
(j)
0 −1
j
Derive the posterior distribution of θ.
5. [15 Marks (see details below)] Consider a sampling from the 2-dimensional pdf
f(x, y) = c e−(xy+x+y), x > 0, y > 0,
for some normalization constant c, using a Gibbs sampler. Let (X, Y ) ∼ f .
(a) [5 Marks] Find the conditional pdf of X given Y = y, and the conditional pdf of Y given
X = x.
(b) [10 Marks] Write working code that implements the Gibbs sampler and outputs 1000
points that are approximately distributed according to f .
留学ICU™️ 留学生辅助指导品牌
在线客服 7*24 全天为您提供咨询服务
咨询电话(全球): +86 17530857517
客服QQ:2405269519
微信咨询:zz-x2580
关于我们
微信订阅号
© 2012-2021 ABC网站 站点地图:Google Sitemap | 服务条款 | 隐私政策
提示:ABC网站所开展服务及提供的文稿基于客户所提供资料,客户可用于研究目的等方面,本机构不鼓励、不提倡任何学术欺诈行为。