SOST 30062: Data Science Modelling
Data Science Modelling
项目类别:统计学
SOST 30062: Data Science Modelling

Statistical learning
I a set of tools
I for understanding data
Introduction
Main text:
Click HERE for a link to the website.
Introduction
Premises
1. Many statistical learning (SL) methods are relevant in a wide rage
of settings
2. No single SL method will perform well in all possible applications
3. The internal mechanisms of the methods we will cover are complex
(and interesting) but will not need them to apply SL techniques
succesfully
4. The focus is on real-life applications.
Introduction
Two types of tools:
I Supervised: Build a statistical model for predicting or estimating
an output based on one or more inputs
I Unsupervised: Apply to situations with only inputs and one wants
to learn relationships and structure from such data
Supervised learning: Overview
Motivating example
100
200
300
20 40 60 80
Age
W
a
ge
100
200
300
2004 2006 2008
Year
W
a
ge
100
200
300
1. < HS Grad2. HS Grad3. Some College4. College Grad5. Adv nced Degree
Wage$education
W
a
ge
Motivating example
In the example
I Years of education, age and year are input variables. Also
predictors, independent variables, features, covariates
I Wages is the output variable, also response variable, dependent
variable
Despite the variation/noise, the figures suggest there are some overall
relationships between the inputs and the output.
That overall relationship is what interests us.
Some notation
I Inputs are denoted X1, X2, . . . , Xp
I Ouputs are denoted Y
I A unit (person, firm, village, etc) is denoted i, and there are N of
these units, so i = 1, 2, . . . , N
I Unit’s i value for variable j is xij (where j = 1, 2, . . . , p).
We believe there is some relationship between Y and
X = (X1, X2, . . . , Xp) which can be writen, in general form
Y = f(X1, X2, ..., Xp) + ε ≡ f(X) + ε
I f is some fixed but unknown function
I ε is a random error term
I ε is independent of X1, X2, . . . , Xp
I ε has 0 mean
Why estimate f?
Prediction
Values of the inputs are known, but output is not observed at the time.
If f can be estimated then we can predict Y given the levels of
X1 . . . Xp
Yˆ = fˆ(X1, . . . Xp) ≡ fˆ(X)
The accuracy of the prediction depends on
I Reducible error: In general fˆ 6= f but this error can be reduced
using the most appropriate statistical method
I Irreducible error: Recall Y = f(X) + ε, where ε cannot be
predicted. No matter how well we estimate f , we cannot reduce
the error introduced by ε
Why estimate f?
Prediction
More formally, the average squared error is a valid measure of accuracy
of the prediction, E(Y − Yˆ )2.
It can be shown
E(Y − Yˆ )2 =E
[
f(x)− fˆ(X)
]2
+ V (ε)
=Reducible+ Irreducible (1)
Why estimate f?
Inference
How do X1, X2, ...Xp affect Y ?
I Which predictor is associated with the response?
I Can the relationship between inputs and output be approximated
using a specific model?
How is f estimated?
There are i = 1, . . . n units, j = 1...p inputs and one output.
Let yi unit i’s value for the output
Let xij unit i’s value for input j
Let’s put these values into a vector xi = (xi1, xi2, ..., xip)′
The full dataset set is the set {(x1, y1), (x2, y2)...(xn, yn)}. In SL this is
called the training data.
We will use this training data to estimate (learn) the unknown
function f .
How is f estimated?
Parametric methods
1. First assume a functional form (a shape) for f . For instance,
2. Find a procedure that uses the training data to fit (or train) the
model
Example:
1. Assume
f(X) = β0 + β1 ·X1 + β2 ·X2 + ...+ βp ·Xp
The model specifies everything; the only unknown bit are the
parameters βj , j = 1...p.
2. Finding the βj , j = 1...p; the common way of doing this is using
Ordinary Least Squares.
This method is parametric in the sense that the problem of finding f is
reduced to estimating a small set of parameters.
How is f estimated?
Parametric methods
Pros
I Tend to rely on models that can be estimated quickly and easily
I Parametric models are easy to interpret, so they are good for
inference
Cons - The choice of the functional forms is subjective - The model we
choose will usually not match the true f and this will lead to poor
inference and prediction if the differences are too big
To address the last point, we can devise flexible parametric models, but
this will generally reduce the interpretability of the model and might
lead to a problem of overfitting
Example
The motorcycle data; We have fitted a model of the form
Acceleration = β0 + β1 ·Milliseconds+ ε
−100
−50
0
50
10 20 30 40
Milliseconds
Ac
ce
le
ra
tio
n
How is f estimated?
Non-Parametric methods
Do not make explicit assumptions about the functional form of f
Instead, they try to estimate an f that gets as close as possible to the
data points without being too wiggly or rough.
I These methods can potentially fit a wider range of shapes for f
I They avoid the risk of misspecification1
However
I These results produced by these methods are more difficult to
interpret
I Since they do not rely on prior information (in the form of
parameters) they need a lot more data (information) to work
optimally
I They normally rely on tuning parameters that determine the
amount of smoothing; these tuning parameters need to be chosen
before estimation.
1While in a parametric model the proposed shape might be far away from f , this
is avoided in nonparametric methods which do not impose any a priori shape on f )
Example
The motorcycle data; We have fitted a nonparametric model (a local
linear regression)
−100
−50
0
50
10 20 30 40
Milliseconds
Ac
ce
le
ra
tio
n


留学ICU™️ 留学生辅助指导品牌
在线客服 7*24 全天为您提供咨询服务
咨询电话(全球): +86 17530857517
客服QQ:2405269519
微信咨询:zz-x2580
关于我们
微信订阅号
© 2012-2021 ABC网站 站点地图:Google Sitemap | 服务条款 | 隐私政策
提示:ABC网站所开展服务及提供的文稿基于客户所提供资料,客户可用于研究目的等方面,本机构不鼓励、不提倡任何学术欺诈行为。