Now its time to practice what we have learned in class and learn even more! Note that from now on your homework should be written in R Markdown. Turn in your html file and Rmd files, and any other relevant files in a tarball. Then turn it in by uploading to canvas.
Data Structures (20 points)
1. (10 points) Learn more about the scan, readLines, read html, readr, and readxl functions for getting data into R. Use these functions to read data from a files into a tibble using R. You can use your own data example or a dataset used in class. How are these different from the functions we learned in class? Report what you find and give some examples!
2. (10 points) Learn about the S3, S4, and R6 classes in R. When do you think you would use these? Describe what you learned and give some examples!
R Markdown (30 points; +20 points extra credit)
1. (20 points EXTRA CREDIT) Complete the markdown tutorial at https://www.markdowntutorial.com/. Confirm here that you completed it (on your honor!).
2. (30 points) R markdown is a powerful tool for literate programming. To gain more practice, as part recreate the .Rmd file for the example file in the homework folder: “Rmarkdown_example.html.”
Tidyverse (50 points; 2 points each)
These exercises will give you some introductory experience with the tidyverse. Please complete the following:
1. Examine the built-in dataset co2. Which of the following is true:
a. co2 is tidy data: it has one year for each row.
b. co2 is not tidy: we need at least one column with a character vector.
c. co2 is not tidy: it is a matrix instead of a data frame.
d. co2 is not tidy: to be tidy we would have to wrangle it to have three columns (year, month and value), then each co2 observation would have a row.
2. Examine the built-in dataset ChickWeight. Which of the following is true:
a. ChickWeight is not tidy: each chick has more than one row.
b. ChickWeight is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables.
c. ChickWeight is not tidy: we are missing the year column.
d. ChickWeight is tidy: it is stored in a data frame.
3. Examine the built-in dataset BOD. Which of the following is true:
a. BOD is not tidy: it only has six rows.
b. BOD is not tidy: the first column is just an index.
c. BOD is tidy: each row is an observation with two values (time and demand)
d. BOD is tidy: all small datasets are tidy by definition.
4. Which of the following built-in datasets is tidy (you can pick more than one):
a. BJsales
b. EuStockMarkets
c. DNase
d. Formaldehyde
e. Orange
f. UCBAdmissions
5. Load the dplyr package and the murders dataset.
library (dplyr)
library (dslabs)
data (murders)
You can add columns using the dplyr function mutate. This function is aware of the column names and inside the function you can call them unquoted:
murders <- mutate (murders, population_in_millions = population / 10 ˆ6)
We can write population rather than murders$population. The function mutate knows we are grabbing columns from murders.
Use the function mutate to add a murders column named rate with the per 100,000 murder rate as in the example code above. Make sure you redefine murders as done in the example code above ( murders <- [your code]) so we can keep using this variable.
6. If rank(x) gives you the ranks of x from lowest to highest, rank(-x) gives you the ranks from highest to lowest. Use the function mutate to add a column rank containing the rank, from highest to lowest murder rate. Make sure you redefine murders so we can keep using this variable.
7. With dplyr, we can use select to show only certain columns. For example, with this code we would only show the states and population sizes:
select (murders, state, population) %>% head ()
Use select to show the state names and abbreviations in murders. Do not redefine murders, just show the results.
8. The dplyr function filter is used to choose specific rows of the data frame to keep. Unlike select which is for columns, filter is for rows. For example, you can show just the New York row like this:
filter (murders, state == "New York")
You can use other logical vectors to filter rows.
Use filter to show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the murders dataset, just show the result. Remember that you can filter based on the rank column.
9. We can remove rows using the != operator. For example, to remove Florida, we would do this:
no_florida <- filter (murders, state != "Florida")
Create a new data frame called no_south that removes states from the South region. How many states are in this category? You can use the function nrow for this.
10. We can also use %in% to filter with dplyr. You can therefore see the data from New York and Texas like this:
filter (murders, state %in% c("New York" , "Texas"))
Create a new data frame called murders_nw with only the states from the Northeast and the West. How many states are in this category?
11. Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with filter. Here is an example in which we filter to keep only small states in the Northeast region.
filter (murders, population < 5000000 & region == "Northeast")
Make sure murders has been defined with rate and rank and still has all states. Create a table called my_states that contains rows for states satisfying both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use select to show only the state name, the rate, and the rank.
12. The pipe %>% can be used to perform operations sequentially without having to define intermediate objects. Start by redefining murder to include rate and rank.
murders <- mutate (murders, rate = total / population * 100000 , rank = rank (-rate))
In the solution to the previous exercise, we did the following:
my_states <- filter (murders, region %in% c("Northeast" , "West") & rate < 1)
select (my_states, state, rate, rank)
The pipe %>% permits us to perform both operations sequentially without having to define an intermediate variable my_states. We therefore could have mutated and selected in the same line like this:
mutate (murders, rate = total / population * 100000 ,
rank = rank (-rate)) %>% select (state, rate, rank)
Notice that select no longer has a data frame as the first argument. The first argument is assumed to be the result of the operation conducted right before the %>%.
Repeat the previous exercise, but now instead of creating a new object, show the result and only include the state, rate, and rank columns. Use a pipe %>% to do this in just one line.
13. Reset murders to the original table by using data(murders). Use a pipe to create a new data frame called my_states that considers only states in the Northeast or West which have a murder rate lower than 1, and contains only the state, rate and rank columns. The pipe should also have four components separated by three %>%. The code should look something like this:
my_states <- murders %>%
mutate SOMETHING %>%
filter SOMETHING %>%
select SOMETHING
For exercises 14-20, we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this: