SOST 30062: Data Science Modelling
Statistical learning
I a set of tools
I for understanding data
Introduction
Main text:
Click HERE for a link to the website.
Introduction
Premises
1. Many statistical learning (SL) methods are relevant in a wide rage
of settings
2. No single SL method will perform well in all possible applications
3. The internal mechanisms of the methods we will cover are complex
(and interesting) but will not need them to apply SL techniques
succesfully
4. The focus is on real-life applications.
Introduction
Two types of tools:
I Supervised: Build a statistical model for predicting or estimating
an output based on one or more inputs
I Unsupervised: Apply to situations with only inputs and one wants
to learn relationships and structure from such data
Supervised learning: Overview
Motivating example
100
200
300
20 40 60 80
Age
W
a
ge
100
200
300
2004 2006 2008
Year
W
a
ge
100
200
300
1. < HS Grad2. HS Grad3. Some College4. College Grad5. Adv nced Degree
Wage$education
W
a
ge
Motivating example
In the example
I Years of education, age and year are input variables. Also
predictors, independent variables, features, covariates
I Wages is the output variable, also response variable, dependent
variable
Despite the variation/noise, the figures suggest there are some overall
relationships between the inputs and the output.
That overall relationship is what interests us.
Some notation
I Inputs are denoted X1, X2, . . . , Xp
I Ouputs are denoted Y
I A unit (person, firm, village, etc) is denoted i, and there are N of
these units, so i = 1, 2, . . . , N
I Unit’s i value for variable j is xij (where j = 1, 2, . . . , p).
We believe there is some relationship between Y and
X = (X1, X2, . . . , Xp) which can be writen, in general form
Y = f(X1, X2, ..., Xp) + ε ≡ f(X) + ε
I f is some fixed but unknown function
I ε is a random error term
I ε is independent of X1, X2, . . . , Xp
I ε has 0 mean
Why estimate f?
Prediction
Values of the inputs are known, but output is not observed at the time.
If f can be estimated then we can predict Y given the levels of
X1 . . . Xp
Yˆ = fˆ(X1, . . . Xp) ≡ fˆ(X)
The accuracy of the prediction depends on
I Reducible error: In general fˆ 6= f but this error can be reduced
using the most appropriate statistical method
I Irreducible error: Recall Y = f(X) + ε, where ε cannot be
predicted. No matter how well we estimate f , we cannot reduce
the error introduced by ε
Why estimate f?
Prediction
More formally, the average squared error is a valid measure of accuracy
of the prediction, E(Y − Yˆ )2.
It can be shown
E(Y − Yˆ )2 =E
[
f(x)− fˆ(X)
]2
+ V (ε)
=Reducible+ Irreducible (1)
Why estimate f?
Inference
How do X1, X2, ...Xp affect Y ?
I Which predictor is associated with the response?
I Can the relationship between inputs and output be approximated
using a specific model?
How is f estimated?
There are i = 1, . . . n units, j = 1...p inputs and one output.
Let yi unit i’s value for the output
Let xij unit i’s value for input j
Let’s put these values into a vector xi = (xi1, xi2, ..., xip)′
The full dataset set is the set {(x1, y1), (x2, y2)...(xn, yn)}. In SL this is
called the training data.
We will use this training data to estimate (learn) the unknown
function f .
How is f estimated?
Parametric methods
1. First assume a functional form (a shape) for f . For instance,
2. Find a procedure that uses the training data to fit (or train) the
model
Example:
1. Assume
f(X) = β0 + β1 ·X1 + β2 ·X2 + ...+ βp ·Xp
The model specifies everything; the only unknown bit are the
parameters βj , j = 1...p.
2. Finding the βj , j = 1...p; the common way of doing this is using
Ordinary Least Squares.
This method is parametric in the sense that the problem of finding f is
reduced to estimating a small set of parameters.
How is f estimated?
Parametric methods
Pros
I Tend to rely on models that can be estimated quickly and easily
I Parametric models are easy to interpret, so they are good for
inference
Cons - The choice of the functional forms is subjective - The model we
choose will usually not match the true f and this will lead to poor
inference and prediction if the differences are too big
To address the last point, we can devise flexible parametric models, but
this will generally reduce the interpretability of the model and might
lead to a problem of overfitting
Example
The motorcycle data; We have fitted a model of the form
Acceleration = β0 + β1 ·Milliseconds+ ε
−100
−50
0
50
10 20 30 40
Milliseconds
Ac
ce
le
ra
tio
n
How is f estimated?
Non-Parametric methods
Do not make explicit assumptions about the functional form of f
Instead, they try to estimate an f that gets as close as possible to the
data points without being too wiggly or rough.
I These methods can potentially fit a wider range of shapes for f
I They avoid the risk of misspecification1
However
I These results produced by these methods are more difficult to
interpret
I Since they do not rely on prior information (in the form of
parameters) they need a lot more data (information) to work
optimally
I They normally rely on tuning parameters that determine the
amount of smoothing; these tuning parameters need to be chosen
before estimation.
1While in a parametric model the proposed shape might be far away from f , this
is avoided in nonparametric methods which do not impose any a priori shape on f )
Example
The motorcycle data; We have fitted a nonparametric model (a local
linear regression)
−100
−50
0
50
10 20 30 40
Milliseconds
Ac
ce
le
ra
tio
n