SECOND SEMESTER, 2017
1. [5 marks]
Figure 1 shows the content of an HTML file, "2017-07-29 .html". This is asimplified extract from the HOT 100 songs web site that was used for the web scraping lab.
Write five R expressions that use functions from the xml2 package (and XPath expres- sions) to perform the following steps:
• Read the HTML file into R.
• Extract the song title from the HTML (output shown below).
[1] "Despacito"
• Extract the artist name from the HTML (output shown below; note that white space has been removed).
[1] "Luis Fonsi & Daddy Yankee Featuring Justin Bieber"
• Extract the song rank from the HTML (output shown below).
[1] "1"
• Extract the rank from the previous week from the HTML (output shown below; note that the result is more than one character value).
[1] "Last Week" "1"
<! doctype html> <html class="" lang=""> <body> <article class="chart-row chart-row--1" data-songtitle="Despacito"> <div class="chart-row__primary"> <div class="chart-row__history chart-row__history--steady"></div> <div class="chart-row main-display"> <div class="chart-row rank"> <span class="chart-row current-week">1</span> <span class="chart-row last-week">Last Week: 1</span> </div> <div class="chart-row container"> <div class="chart-row title"> <h2 class="chart-row song">Despacito</h2> <a class="chart-row artist" data-tracklabel="Artist Name"> Luis Fonsi & Daddy Yankee Featuring Justin Bieber </a> </div> </div> </div> </div> <div id="chart-row-1-secondary" class="chart-row secondary"> <div class="chart-row stats"> <div class="chart-row last-week"> <span class="chart-row label">Last Week</span> <span class="chart-row value">1</span> </div> <div class="chart-row__top-spot"> <span class="chart-row label">Peak Position</span> <span class="chart-row value">1</span> </div> <div class="chart-row weeks-on-chart"> <span class="chart-row label">Wks on Chart</span> <span class="chart-row value">26</span> </div> </div> </div> </article> </body> </html> |
Figure 1: The HTML file "2017-07-29 .html".
2. [5 marks]
Write a paragraph explaining the purpose of the flatten() function from the jsonlite package. You should provide at least one example of its use.
3. [10 marks]
Explain what each of the following shell commands is doing and, where there is output, what the output means. These commands were all run on one of the virtual machines that were used in the course.
pmur002@stats769prd01:~/$ mkdir exam pmur002@stats769prd01:~/$ cd exam
pmur002@stats769prd01:~/exam$ ls -1 /course/AT/BUSDATA/ | wc -l 98973
pmur002@stats769prd01:~/exam$ ls -l /course/AT/BUSDATA/ | awk ' { print($5) } ' > sizes .txt pmur002@stats769prd01:~/exam$ head sizes .txt
343
345
345
345
436
437
438
531
438
pmur002@stats769prd01:~/exam$ grep --no-filename ' ,6215, ' \
> /course/AT/BUSDATA/trip_updates_20170401* .csv > bus-6215-2017-04-01 .csv
pmur002@stats769prd01:~/exam$ head bus-6215-2017-04-01 .csv
8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,252,6,7168,1490975922 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,293,9,8516,1490976233 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,299,NA,9,8516,1490976239 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,319,10,8524,1490976349 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,403,11,8532,1490976523
4. [10 marks]
The following R code was run on one of the virtual machines that were used in the course to investigate how much memory would be required to load a large CSV file with several million rows into R. The plot that this code produces is also shown.
Explain what the code is doing and discuss whether this will lead to a good estimate of the memory required to read the complete CSV into R. Is there another way to estimate the memory required (without reading the entire CSV file into R) ?
numLines <- 10^(1:5)
samples <- lapply(numLines,
function(i) {
read .csv("/course/AT/alldata .csv",
nrows=i, stringsAsFactors=FALSE) })
plot(numLines, sapply(samples, object .size), log="xy", xlab="number of lines", ylab="data frame size")
5. [10 marks]
The following R code was run on one of the virtual machines that were used in the course to measure how much time is required to read diferent subsets of a large CSV file into R.
sapply(numLines,
function(i) {
system.time(read .csv("/course/AT/alldata .csv",
nrows=i, stringsAsFactors=FALSE))[1]
})
The result of running this code is shown below.
user .self user .self user .self user .self user .self 0.001 0.001 0.005 0.036 0.417
The following code was run to perform profiling. The profiling result is shown below the code.
library(profvis) p <- profvis({
lapply(numLines,
function(i) {
read .csv("/course/AT/alldata .csv",
nrows=i, stringsAsFactors=FALSE)
}) })
htmlwidgets::saveWidget(p, "profile.html")
Explain what the timing and profiling results mean. Suggest how you could make the code run faster.
6. [10 marks]
Write R code to perform a parallel version of the lapply() call from Question 4. Discuss the advantages and disadvantages of using the mclapply() (forking) approach compared to the makeCluster() (socket) approach for this task. Also discuss whether load balancing would make sense for this task.