STATS 769
Data Science Practice
项目类别:统计学

SECOND SEMESTER, 2017

1.                                                                                                                                    [5 marks]

Figure 1 shows the content of an HTML le, "2017-07-29 .html". This is asimplified extract from the HOT 100 songs web site that was used for the web scraping lab.

Write five R expressions that use functions from the xml2 package (and XPath expres- sions) to perform the following steps:

  Read the HTML file into R.

 Extract the song title from the HTML (output shown below).

[1]  "Despacito"

• Extract the artist name from the HTML (output shown below; note that white space has been removed).

[1]  "Luis  Fonsi  &  Daddy  Yankee  Featuring  Justin  Bieber"

 Extract the song rank from the HTML (output shown below).

[1]  "1"

• Extract the rank from the previous week from the HTML (output shown below; note that the result is more than one character value).

[1]  "Last  Week"  "1"

<! doctype  html>

<html  class=""  lang=""> <body>

<article  class="chart-row  chart-row--1"  data-songtitle="Despacito"> <div  class="chart-row__primary">

<div  class="chart-row__history  chart-row__history--steady"></div> <div  class="chart-row    main-display">

<div  class="chart-row     rank">

<span  class="chart-row     current-week">1</span>

<span  class="chart-row     last-week">Last Week:  1</span> </div>

<div  class="chart-row     container"> <div  class="chart-row     title">

<h2  class="chart-row     song">Despacito</h2>

<a  class="chart-row     artist"  data-tracklabel="Artist  Name"> Luis  Fonsi  &  Daddy  Yankee  Featuring  Justin  Bieber

</a> </div>

</div>

</div> </div>

<div  id="chart-row-1-secondary"  class="chart-row     secondary"> <div  class="chart-row     stats">

<div  class="chart-row     last-week">

<span  class="chart-row     label">Last Week</span>

<span  class="chart-row     value">1</span> </div>

<div  class="chart-row__top-spot">

<span  class="chart-row     label">Peak  Position</span>

<span  class="chart-row     value">1</span>  </div> <div  class="chart-row     weeks-on-chart">

<span  class="chart-row     label">Wks  on  Chart</span>

<span  class="chart-row     value">26</span>  </div> </div>

</div>

</article> </body>

</html>

Figure 1: The HTML le "2017-07-29 .html".

2.                                                                                                                                   [5 marks]

Write a paragraph explaining the purpose of the flatten() function from the jsonlite package. You should provide at least one example of its use.

3.                                                                                                                                 [10 marks]

Explain what each of the following shell commands is doing and, where there is output, what the output means.  These commands were all run on one of the virtual machines that were used in the course.

pmur002@stats769prd01:~/$ mkdir  exam pmur002@stats769prd01:~/$  cd  exam

pmur002@stats769prd01:~/exam$  ls  -1  /course/AT/BUSDATA/  |   wc  -l 98973

pmur002@stats769prd01:~/exam$  ls  -l  /course/AT/BUSDATA/  |   awk  ' {  print($5)  } '   >  sizes .txt pmur002@stats769prd01:~/exam$  head  sizes .txt

343

345

345

345

436

437

438

531

438

pmur002@stats769prd01:~/exam$  grep  --no-filename  ' ,6215, '   \

>  /course/AT/BUSDATA/trip_updates_20170401* .csv  >  bus-6215-2017-04-01 .csv

pmur002@stats769prd01:~/exam$  head  bus-6215-2017-04-01 .csv

8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,252,6,7168,1490975922   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,275,NA,7,8502,1490976035   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,293,9,8516,1490976233   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,299,NA,9,8516,1490976239   8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,319,10,8524,1490976349 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,388,NA,10,8524,1490976418 8300033770-20170322104732_v52 .21,30002-20170322104732_v52 .21,6215,NA,403,11,8532,1490976523

4.                                                                                                                                   [10 marks]

The following R code was run on one of the virtual machines that were used in the course to investigate how much memory would be required to load a large CSV le with several million rows into R. The plot that this code produces is also shown.

Explain what the code is doing and discuss whether this will lead to a good estimate of the memory required to read the complete CSV into R. Is there another way to estimate the memory required (without reading the entire CSV file into R) ?

numLines  <-  10^(1:5)

samples  <-  lapply(numLines,

function(i)  {

read .csv("/course/AT/alldata .csv",

nrows=i,  stringsAsFactors=FALSE) })

plot(numLines,  sapply(samples,  object .size),  log="xy", xlab="number  of  lines",  ylab="data  frame  size")

5.                                                                                                                                 [10 marks]

The following R code was run on one of the virtual machines that were used in the course to measure how much time is required to read diferent subsets of a large CSV file into R.

sapply(numLines,

function(i)  {

system.time(read .csv("/course/AT/alldata .csv",

nrows=i,  stringsAsFactors=FALSE))[1]

})

The result of running this code is shown below.

user .self  user .self  user .self  user .self  user .self 0.001         0.001         0.005         0.036         0.417

The following code was run to perform profiling.  The profiling result is shown below the code.

library(profvis) p  <-  profvis({

lapply(numLines,

function(i)  {

read .csv("/course/AT/alldata .csv",

nrows=i,  stringsAsFactors=FALSE)

})  })

htmlwidgets::saveWidget(p,  "profile.html")

Explain what the timing and profiling results mean.  Suggest how you could make the code run faster.

6.                                                                                                                                 [10 marks]

Write R code to perform a parallel version of the lapply() call from Question 4. Discuss the advantages and disadvantages of using the mclapply() (forking) approach compared to the makeCluster()  (socket) approach for this task.  Also discuss whether load balancing would make sense for this task.

留学ICU™️ 留学生辅助指导品牌
在线客服 7*24 全天为您提供咨询服务
咨询电话(全球): +86 17530857517
客服QQ:2405269519
微信咨询:zz-x2580
关于我们
微信订阅号
© 2012-2021 ABC网站 站点地图:Google Sitemap | 服务条款 | 隐私政策
提示:ABC网站所开展服务及提供的文稿基于客户所提供资料,客户可用于研究目的等方面,本机构不鼓励、不提倡任何学术欺诈行为。