1. Introduction
1.1. About this practical
This practical builds on the previous practical, using R but also Excel to make calculations and compare the relative support data provide for models in a set of multiple candidate models. We will be using the UKJanClimate.csv file we also used last week. Please refer to the handout and answers from Practical 4 for R functions and libraries for correlation analysis and simple linear regression.
1.2. Getting started in R
Before you start the data analysis in R, ensure that you have:
· Created a folder on your computer for this week’s practical
· Set the working directory for the practical work in R Studio
· Read the ‘UKJanClimate.csv’ and ‘deer51.csv’ files into R.
2. Multiple linear regression
2.1. A ‘Full Model’ for Minimum Temperature
We are building on Part 4 of practical 4 by first fitting multiple linear regression models for minTemp. Later we will fit multiple linear regression models for maxTemp and rainfall. First, we need to check our independent variables (lat, long and elevation) for colinearity (i.e. correlations between independent variables). As we will be using the same independent variables for each dependent variable we need to do this only once. You can do this using the function rcorr and the package Hmisc like we did in practical 4.
Q1. Are there any correlations between the independent variables that we need to worry about?
If you are happy that all independent variables can be used in the same multiple linear regression model, fit a multiple linear regression model for minTemp using all three independent (predictor) variables. We fit multiple linear regression models in R, the same way we fitted simple linear regression models during practical 4 using the lm() and summary() functions.
full.model <- lm(minTemp ~ VARIABLE1 + VARIABLE2 + VARIABLE_N, data = Jan.data)
summary(full.model)
To fit the multiple linear regression model, you will need to replace the placeholders for the independent variables with the variable names in our ‘Jan.data’ dataset.
Q2. Do the results for multiple linear regression model match your results for simple linear regression? If so, how?
Q3. From your diagnostic plots do you think the assumptions of multiple linear regression have been met?
2.2. A ‘Reduced Model’ for Minimum Temperature
Next, fit another multiple regression model for minTemp but with only the independent variables in the last model that had coefficients that were statistically significantly different from zero. From the lecture we know that the Elevation variable is not normally distributed and that this leads to problems with the residuals from a model that includes it. So as well as fitting a model with only Latitude and Elevation we use the log of Elevation. Yet, furthermore, because one of the elevation values is equal to zero we will need to take log(Elev + 1), as log(0) is not possible.
We can do this quite simply within the line of code for our regression model. For example:
min.reduced <- lm(minTemp ~ log(Elev + 1) + Lat, data = Jan.data)
Complete the table below for both this ‘reduced’ model and the previous ‘full’ model to compare them.
Model Statistic |
Full Model |
Reduced Model |
||
|
β |
p-value |
β |
p-value |
Constant |
xx.xxx |
xx.xxx |
xx.xxx |
xx.xxx |
Elevation |
xx.xxx |
xx.xxx |
|
|
Latitude |
xx.xxx |
xx.xxx |
xx.xxx |
xx.xxx |
Longitude |
xx.xxx |
xx.xxx |
|
|
Log(Elevation + 1) |
|
|
xx.xxx |
xx.xxx |
|
|
|
|
|
Model R2 |
xx.xxx |
xx.xxx |
||
Adjusted R2 |
xx.xxx |
xx.xxx |
||
Model F statistic |
xx.xxx |
xx.xxx |
||
Model p-value |
xx.xxx |
xx.xxx |
Q4. Compare the R2 values in your table. Which model do you prefer and why? [Hint: your answer should depend on whether you want the model that provides most predictive power, the most parsimonious model, or the model that makes most physical sense]
The alternative to calculating log(Elev + 1) when fitting the model, we could have created a new column in the DataFrame. and used that to fit the model:
logElev <- log(Jan.data$Elev + 1)
Jan.data <- cbind(Jan.data, logElev)
min.reduced <- lm(minTemp ~ logElev + Lat, data = Jan.data)
Stop here and move on to Section 3 – Multi-Model Inference of this week’s practical. Come back to finish Section 2.3 after completing Section 3. [This is to ensure you get through the section on MMI during the scheduled time].
2.3. Modelling Maximum Temperature
Now we will move to look at the maxTemp variable. Fit a multiple linear regression model using all three independent variables (lat, long & elevation) and check the diagnostic plots to check you are happy the assumptions of linear regression have been met. If, not try to reduce your model using different iterations of the independent variables. It may be that you produce more than one reduced model so add columns to the table below accordingly.
Model Statistic |
Full Model |
Reduced Model |
||
|
β |
p-value |
β |
p-value |
Constant |
|
|
|
|
Elevation |
|
|
|
|
Latitude |
|
|
|
|
Longitude |
|
|
|
|
Log(Elevation + 1) |
|
|
|
|
|
|
|
|
|
Model R2 |
|
|
||
Adjusted R2 |
|
|
||
Model F statistic |
|
|
||
Model p-value |
|
|
Q5. Why did you pick the final model you did?
3. Multi-Model Inference
3.1. Deer in Forests
In hardwood-conifer forests of North America, white-tailed deer have been found to drive changes in understory structure and species composition, cause species composition change of overstory trees and reduce stand timber value by slowing the recruitment of saplings to canopy positions. During winter in these forests, white-tailed deer generally shelter in mature conifer swamps, venturing out to browse in nearby stands, including northern hardwood stands. This behaviour is a response to the trade-off between conserving heat and energy in the shelter beneath the closed canopies of the (evergreen) conifer stands versus negotiating deeper snow and colder temperatures in the more open (deciduous) mixed hardwood stands to find adequate forage
To better understand influence on deer density in mixed hardwood-conifer forests, Millington et al. (2010) collected data on forest structure and estimates of deer density in 51 forest stand locations across the Upper Peninsula of Michigan. In this part of the practical we will use some of these data to explore models representing hypotheses about the influences of forest structure on deer density.
2.2 The data set
The data for this part of the practical are available in deer51.csv. The variables in the data set are:
· Site – id number for the site the data in this row are for
· DD – deer density (deer km-2) at the centre of the site
· DistLC – distance to nearest lowland conifer stand from the centre of the site (km)
· DBH – mean diameter-at-breast-height of trees at this site (cm), an indicator of tree size
· BA – mean basal area of trees at this site (m2 ha-1), an indicator of tree density
· SnowD – mean estimated snow depth for winter at this site (cm)
· LCProp – the proportion of lowland conifer tree at this site
Load this data set in R and familiarise yourself with the data:
- Plot histograms or box-plots of the individual variables to check data distribution and normality
- Check correlations between variables using a matrix scatterplot
In this multi-model inference practical we will use a linear regression.
Fit a simple linear regression to predict DD from DistLC and examine diagnostic plots (e.g. histograms, QQ plots and scatter plots of residuals) to establish is residuals for this model are normally distributed (see instructions for Practical 4 if you need a reminder how to do this).
Q6. Is this a useful model? i.e. Does it explain more variation than by chance? Does it meet the assumptions of simple linear regression?
From your results for this initial simple linear regression, you should be able to see that the residuals are not normally distributed (e.g. see the Normal Q-Q plot in the diagnostic plots), violating the assumptions of the model. To remedy this we will try transforming the deer density data.
Plot a histogram of the DD variable. The distribution of these data indicates that the data are log-normally distributed. Therefore, we will examine if a log transform. of these data will enable them to meet the assumptions of ordinary least squares regression. First, let’s create a new variable and append it to the original data set. For example:
logDD <- log(deer.data$DD, 10)
deer.data <- cbind(deer.data, logDD)
Check that new variable has been added to the existing DataFrame. and how the values in the new variable relate to the DD variable. Also, check what the distribution of the new variable looks like by plotting a histogram for it.
Now fit a simple linear regression to predict logDD from DistLC and examine diagnostic plots to establish of residuals for this model are normally distributed.
Q7. Is this a useful model? i.e. Does it explain more variation than by chance? Does it meet the assumptions of simple linear regression?
Hopefully you can see that the residuals of the regression for logDD suggest this transformed dependent variable is more likely to meet the assumptions of linear regression. We will use logDD for the remainder of the analyses.
3.2. Fitting simple linear regression models
In the next section we will be completing the MMI_Summary.xlsx Excel file with results of testing the utility of individual simple linear regression models. These simple linear regression models represent single-process hypotheses; the most basic models/hypotheses we can make. In the worksheet each row will provide details of the model, with fitted parameters in the columns B to G (some of these cells will remain empty where variables are not present in a model) and other information in the remaining column (review the comments in column headers of those other columns).
In R, fit simple linear regressions to predict logDD for each of the other variables (except DD of course).
As you fit the models complete the Excel worksheet with information from R in Columns A to M. Put the parameter estimates for the Intercept and the predictor variables in the appropriate columns.
- Values for columns A to J come from the output when fitting a linear regression model.
- The value for n is the total number of data points used to fit the regression. You can check this value in the original data or use: length(model$fitted.values)where ‘model’ is the name of your model in R.
- K is the number of parameters in the model (you might use the count() formula in Excel for columns B:G).
To obtain an AIC value in R you can use the AIC function built into R. For example:
AIC(model)
where ‘model’ is the name of your model in R.
Save your data! The figure below shows how the template should look with the information for the first model.
2.4 Fitting multivariate linear regression models
Now we have an idea about the variance explained by the individual predictor variables (via p values and r2) we will fit multivariate linear regressions to represent multi-process hypotheses. We will only fit multivariate models for the predictor variables which in simple linear regression models had p < 0.05 [Think back to week 4: what is the null hypothesis of the test this p value refers to]. If you fit models for all combinations of these models you will fit four multivariate models.
Fit these models now and add the information for them to the model comparison worksheet. Save your data!
2.5 Measures of model performance
Now we have fit what we think are the most appropriate models we will compare them using measures of model performance. The measures will be r2, Adjusted r2, AIC and AICc (see Eq. 1 and lecture slides for reasoning). You should already have for all models for the first three of these in columns I, J and M (if not, complete the information for all the models you have fit so far). In column N you now need to add a formula to calculate AICc using Eq. 1 as a guide:
Eq. (1)
Where:
n is the number of data points used to fit the regression model
K is the number of parameters in the regression model
Next, calculate deltaAICc for the models using Eq. 2 and enter values in column O [Hint: you may need to sort your data before calculating these values]:
Eq. (2)
You are now ready to compare your models (which represent hypotheses) and answer the following questions:
Q8. Which model is ‘best’ according to AICc?
Q9. According to AICc (and the rules of thumb offered by Millington and Perry 2011, p.452) which models in the set you have fit have strong support?
Q10. Which model is ‘best’ according to r2 (i.e. most variance explained)? Is this the same models as for the ‘best’ according to Adj r2 and AICc?
Q11. What is your ecological interpretation of the model comparison you have done?
If you want a further challenge calculate the AIC weights for the models using Eq 4. in Millington and Perry (2011).
References
Millington, J.D.A. and Perry, G.L.W. (2011) Multi-model inference in biogeography Geography Compass 5(7) 448–530
Millington, J.D.A., Walters, M.B., Matonis, M.S. and Liu, J. (2010) Effects of local and regional landscape characteristics on wildlife distribution across managed forests Forest Ecology and Management 259 1102–1110