DATA7202 Statistical Methods for Data Science
Statistical Methods for Data Science
项目类别:数学

Hello, dear friend, you can consult us at any time if you have any questions, add  WeChat:  zz-x2580


Statistical Methods for Data Science


DATA7202

Please answer the questions below. For theoretical questions, you should present rigorous proofs

and appropriate explanations. Your report should be visually appealing and all questions should

be answered in the order of their appearance. For programming questions, you should present your

analysis of data using Python, Matlab, or R, as a short report, clearly answering the objectives

and justifying the modeling (and hence statistical analysis) choices you make, as well as discussing

your conclusions. Do not include excessive amounts of output in your reports. All the code should

be copied into the appendix and the sources should be packaged separately and submitted on the

blackboard in a zipped folder with the name:

"student_last_name.student_first_name.student_id.zip".

For example, suppose that the student name is John Smith and the student ID is 123456789.

Then, the zipped file name will be John.Smith.123456789.zip.

1. [15 Marks] Repeat the advertisement exercise with the following changes.

(a) The data is generated via the following data generation mechanism. Xi ~ U(0, 1) for

i ∈ {1, 2, 3}; here U(0, 1) stands for the continuous uniform distribution over the [0, 1] set.

However, we require that X1 + X2 + X3 = 1, that is, the explanatory variables stand for

a percentage of the budget.

(b) In addition, the model for y is as follow:

Y = 0.5X1 + 3X2 + 5X3 + 5X2X3 + 2X1X2X3 + W, (1)

where W ~ U(0, 1).

Similar to the original example, generate train and test sets of size N = 1000. Fit the linear regression

and the random forest models to the data. For the linear regression, make an inference

about the coefficients, specifically, comment about the contributions of different advertisement

types to sales. Use the linear model and the RF (with 500 trees), to make a prediction (using

the test set), and report the corresponding mean squared errors.

When constructing datasets, please use “1” and “2” seeds for the train and the test sets,

respectively.

2. [10 Marks] Consider the following variant of the cross-validation procedure.

(i) Using the available data, find a subset of “good” predictors that show correlation with

the response variable.

(ii) Using these predictors, construct a model (for regression or classification).

(iii) Use cross-validation to estimate the model prediction error.

1

Is this a good method? Do you expect to obtain the true prediction error? Explain your

answer.

3. [5 Marks] Suppose that we observe X1, . . . , Xn ~ F. We model F as a normal distribution

with mean μ and standard deviation of σ. For this problem, determine the hypothesis class

H = {f(x, θ); θ ∈ Θ}.

and state explicitly what is θ and Θ.

4. [15 Marks] Let H be a class of binary classifiers over a set Z. Let D be an unknown distribution

over X , and let g be a target hypothesis in H. F Show that the expected value of LossT (g)

over the choice of T equals LossD(g), namely,

ET LossD(g) = LossD(g).

5. [15 Marks (see details below)] Consider the following dataset.

Now, suppose that we would like to consider two models.

留学ICU™️ 留学生辅助指导品牌
在线客服 7*24 全天为您提供咨询服务
咨询电话(全球): +86 17530857517
客服QQ:2405269519
微信咨询:zz-x2580
关于我们
微信订阅号
© 2012-2021 ABC网站 站点地图:Google Sitemap | 服务条款 | 隐私政策
提示:ABC网站所开展服务及提供的文稿基于客户所提供资料,客户可用于研究目的等方面,本机构不鼓励、不提倡任何学术欺诈行为。