统计(R语言)代写:简单的网页信息提取

从给定的HTML中提取信息,需要用到正则表达式等工具,最终把信息以dataframe的形式存下来。

Goals: regular expressions, character functions in R, and web scraping.
In this assignment, we’re going to scrape the 2017-2018 Brooklyn Nets Regular Season Schedule (they’re a basketball team from Brooklyn that plays in the NBA). We will take the regular season schedule from http://www.espn.com/nba/ and reassemble the game listings in an R data frame for computational use.
To do this, perform the following tasks:
i. Use the readLines() command we studied in class to load the NetsSchedule.html file into a character vector in R. Call the vector nets1718.
a. How many lines are in the NetsSchedule.html file?
b. What is the total number of characters in the file?
c. What is the maximum number of characters in a single line of the file?
ii. Open NetsSchedule.html as a webpage. This should happen if you simply click on the file. You should see a table listing all the games scheduled for the 2017-2018 NBA season. There are a total of 82 regular season games scheduled. Who and when are they playing first? Who and when are they playing last?
iii. Now, open NetsSchedule.html using a text editor. To do this you may need to rightclick on the file and tell your computer to use a text editor to open the file. What line in the file holds information about the first game of the regular season (date, time, opponent)? What line provides the date, time, and opponent for the final game? It may be helpful to use CTRL-F or COMMAND-F here and also work between the file in R and in the text editor.
Using NetsSchedule.html we’d like to extract the following variables: the date, the game time (ET), the opponent, and whether the game is home or away. Looking at the file inthe text editor, locate each of these variables. For the next part of the homework we use regular expressions to extract this information.
iv. Write a regular expression that will capture the date of the game. Then using the grep() function find the lines in the file that correspond to the games. Make sure that grep() finds 82 lines, and the first and last locations grep() finds match the first and last games you found in (ii).
v. Using the expression you wrote in (v) along with the functions regexp() and regmatches(), extract the dates from the text file. Store this information in a vector called date to save to use below. HINT: We did something like this in class.
vi. Use the same strategy as in (v) and (vi) to create a time vector that stores the time of the game.
vii. We would now like to gather information about whether the game is home or away. This information is indicated in the schedule by either an ‘@’ or a ‘vs’ in front of the opponent. If the Nets are playing ‘@’ their opponent’s court, the game is away. If the Nets are playing ‘vs’ the opponent, the game is at home.
Capture this information using a regular expression. You may want to use the HTML code around these values to guide your search. Then extract this information and use it to create a vector called home which takes the value 1 if the game is played at home or 0 if it is away.
HINT: In my solution, I use the fact that in each line, the string <li class= “game-status “> appears before this information. So my regular expression searches for that string followed by ‘@’ or that string followed by ‘vs’. After I’ve extracted these strings, I use gsub() to finally extract just the ‘@’ or the ‘vs’.
viii. Finally we would like to find the opponent, again capture this information using a regular expression. Extract these values and save them to a vector called opponent. Again, to write your regular expression you may want to use the HTML code around the names to guide your search.
ix. Construct a data frame of the four variables in the following order: date, time, opponent, home. Print the frame from rows 1 to 10 Does the data match the first 10 games as seen from the web browser?

kamisama wechat
KamiSama