成功案例设置

SECOND SEMESTER, 2017

1. [5 marks]

Figure 1 shows the content of an HTML ﬁle, "2017-07-29 .html". This is asimpliﬁed extract from the HOT 100 songs web site that was used for the web scraping lab.

Write ﬁve R expressions that use functions from the xml2 package (and XPath expres- sions) to perform the following steps:

• Read the HTML ﬁle into R.

• Extract the song title from the HTML (output shown below).

[1] "Despacito"

• Extract the artist name from the HTML (output shown below; note that white space has been removed).

[1] "Luis Fonsi & Daddy Yankee Featuring Justin Bieber"

• Extract the song rank from the HTML (output shown below).

[1] "1"

• Extract the rank from the previous week from the HTML (output shown below; note that the result is more than one character value).

[1] "Last Week" "1"

<! doctype html>

<h2 class="chart-row song">Despacito</h2>

<a class="chart-row artist" data-tracklabel="Artist Name"> Luis Fonsi & Daddy Yankee Featuring Justin Bieber

</a> </div>

</div>

</div> </div>

<span class="chart-row label">Peak Position</span>

<span class="chart-row label">Wks on Chart</span>

</div>

</article> </body>

</html>

Figure 1: The HTML ﬁle "2017-07-29 .html".

2. [5 marks]

Write a paragraph explaining the purpose of the flatten() function from the jsonlite package. You should provide at least one example of its use.

3. [10 marks]

Explain what each of the following shell commands is doing and, where there is output, what the output means. These commands were all run on one of the virtual machines that were used in the course.

pmur002@stats769prd01:~/$ mkdir exam pmur002@stats769prd01:~/$ cd exam

pmur002@stats769prd01:~/exam$ ls -1 /course/AT/BUSDATA/ | wc -l 98973

pmur002@stats769prd01:~/exam$ ls -l /course/AT/BUSDATA/ | awk ' { print($5) } ' > sizes .txt pmur002@stats769prd01:~/exam$ head sizes .txt

343

345

436

437

438

531

438

pmur002@stats769prd01:~/exam$ grep --no-filename ' ,6215, ' \

> /course/AT/BUSDATA/trip_updates_20170401* .csv > bus-6215-2017-04-01 .csv

pmur002@stats769prd01:~/exam$ head bus-6215-2017-04-01 .csv

8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,252,6,7168,1490975922 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,293,9,8516,1490976233 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,299,NA,9,8516,1490976239 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,319,10,8524,1490976349 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,403,11,8532,1490976523

4. [10 marks]

The following R code was run on one of the virtual machines that were used in the course to investigate how much memory would be required to load a large CSV ﬁle with several million rows into R. The plot that this code produces is also shown.

Explain what the code is doing and discuss whether this will lead to a good estimate of the memory required to read the complete CSV into R. Is there another way to estimate the memory required (without reading the entire CSV ﬁle into R) ?

numLines <- 10^(1:5)

samples <- lapply(numLines,

function(i) {

read .csv("/course/AT/alldata .csv",

nrows=i, stringsAsFactors=FALSE) })

plot(numLines, sapply(samples, object .size), log="xy", xlab="number of lines", ylab="data frame size")

5. [10 marks]

The following R code was run on one of the virtual machines that were used in the course to measure how much time is required to read diferent subsets of a large CSV ﬁle into R.

sapply(numLines,

function(i) {

system.time(read .csv("/course/AT/alldata .csv",

nrows=i, stringsAsFactors=FALSE))[1]

})

The result of running this code is shown below.

user .self user .self user .self user .self user .self 0.001 0.001 0.005 0.036 0.417

The following code was run to perform proﬁling. The proﬁling result is shown below the code.

library(profvis) p <- profvis({

lapply(numLines,

function(i) {

read .csv("/course/AT/alldata .csv",

nrows=i, stringsAsFactors=FALSE)

}) })

htmlwidgets::saveWidget(p, "profile.html")

Explain what the timing and proﬁling results mean. Suggest how you could make the code run faster.

6. [10 marks]

Write R code to perform a parallel version of the lapply() call from Question 4. Discuss the advantages and disadvantages of using the mclapply() (forking) approach compared to the makeCluster() (socket) approach for this task. Also discuss whether load balancing would make sense for this task.