Webscraping with R

2020-11-09

BAG-OFSP-FOPH

library(rvest)

url <- "https://www.bag.admin.ch/bag/en/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/situation-schweiz-und-international.html"

read_html(url) %>%
  html_table() %>%
  .[[1]]
##               6.11.2020, 8am   New* Total since the start of the epidemic
## 1 Laboratory-confirmed cases  9,409                               211,913
## 2           Hospitalisations    231                                 8,669
## 3                     Deaths     70                                 2,407
## 4             COVID-19 tests 38,219                             2,157,721

http://roygava.com/aboutme

Slides : http://roygava.com/webscraping-lunch

Support Slides: http://roygava.com/webscraping-lunch/webscraping-lunch.R

library(rvest) # scraping
library(tibble) # data storage
library(dplyr) # data manipulation
library(stringr) # string manipulation
library(lubridate) # date manipulation
library(RSelenium) # browser automation

Webscraping with R

God scraping is always the last resource”

Alexandre Dumas

God scraping is always the last resource”

  • APIs
  • Download button
  • Contacting the data provider

APIs

  • Application Programming Interface
  • Gold standard for getting data from the web
  • Data requests from R
    • library(httr)
    • API wrappers
  • Dependence on data provider

APIs

  • R Wrappers: R interface to API
  • Hide complexity of APIs and return R objects
  • Find them through Google (e.g., CRAN [website], [website] R API)


Provider Registration Wrapper
Financial Times Y Y
newsapi Y Y
Swiss Parliament N Y
Twitter Y Y
Genius Y Y

APIs

library(pageviews)

top_articles("en.wikipedia",
             start = (Sys.Date()-1)) %>%
  select(article, views) %>%
  top_n(10)
##                                     article   views
## 1                                 Main_Page 7340538
## 2                             Kamala_Harris 3501236
## 3                                 Joe_Biden 3113384
## 4                            Special:Search 1271972
## 5  2020_United_States_presidential_election 1019999
## 6                              Donald_Trump  864800
## 7  2016_United_States_presidential_election  804620
## 8                                Jill_Biden  780011
## 9                            Douglas_Emhoff  618461
## 10                               Beau_Biden  610944

HTTP Request/Response Cycle

https://youtu.be/keo0dglCj7I

Developer tools


Firefox


Chrome

BAG-OFSP-FOPH

HTML

<!DOCTYPE html>
<html>    
  <body>
    <h1>Webscraping with R</h1>
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with the <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ol>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ol>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note</strong>:
    <em>rvest</em> is included in the <em>tidyverse</em></p>.
  </body>
</html>

Let’s try this!

HTML

tag meaning
p Paragraph
h1 Top-level heading
h2, h3, … Lower level headings
ol Ordered list
ul Unorder list
li List item
img Image
a Anchor (Hyperlink)
div Section wrapper (block-level)
span Text wrapper (in-line)

HTML elements

HTML

Let’s try this!

read_html(html_page) %>%
  xml_structure()
## <html>
##   <body>
##     {text}
##     <h1>
##       {text}
##     {text}
##     <p>
##       {text}
##       <a [href]>
##         {text}
##       {text}
##       <em>
##         {text}
##       {text}
##     {text}
##     <h2>
##       {text}
##     {text}
##     <ol>
##       <li>
##         {text}
##         <em>
##           {text}
##       {text}
##       <li>
##         {text}
##         <em>
##           {text}
##       {text}
##     <h2>
##       {text}
##     {text}
##     <ul>
##       <a [href]>
##         <li>
##           {text}
##       {text}
##     <p>
##       <strong>
##         {text}
##         <em>
##           {text}
##         {text}
##         <em>
##           {text}
##     {text}

CSS

<!DOCTYPE html>
<html>
<head>
<style>
body {
  background-color: lightblue;
}

h1 {
  color: white;
  text-align: center;
}

.content {
font-family: monospace;
font-size: 1.5em;
color: black;
}

#intro {
  background-color: lightgrey;
  border-style: solid;
  border-width: 5px;
  padding: 5px;
  margin: 5px;
  text-align: center;
}

</style>
</head>
  <body>
    <h1>Webscraping with R</h1>
    <div id="intro">
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with the <em>Tidyverse</em> is recommended.</p>
    </div>
    <div class="content">
        <h2>Technologies</h2>
        <ol>
        <li>HTML: <em>Hypertext Markup Language</em></li>
        <li>CSS: <em>Cascading Style Sheets</em></li>
        </ol>
    </div>
    <div class="content">
        <h2>Packages</h2>
        <ul>
        <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
        </ul>
    <p><strong>Note</strong>:
    <em>rvest</em> is included in the <em>tidyverse</em>.</p>
    </div>
  </body>
</html>

Let’s try this!

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("li")
## {xml_nodeset (3)}
## [1] <li>HTML: <em>Hypertext Markup Language</em>\n</li>
## [2] <li>CSS: <em>Cascading Style Sheets</em>\n</li>
## [3] <li>rvest</li>

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("li") %>%
  html_text()
## [1] "HTML: Hypertext Markup Language" "CSS: Cascading Style Sheets"    
## [3] "rvest"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("em") %>%
  html_text()
## [1] "Tidyverse"                 "Hypertext Markup Language"
## [3] "Cascading Style Sheets"    "rvest"                    
## [5] "tidyverse"

CSS selectors

selector meaning
, grouping
space descendant
> child
adjacent sibling
:first-child first element
:nth-child(n) n element
:last-child last element
. class selector
# id selector

CSS selectors
CSS Diner

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("li, em") %>%
  html_text()
## [1] "Tidyverse"                       "HTML: Hypertext Markup Language"
## [3] "Hypertext Markup Language"       "CSS: Cascading Style Sheets"    
## [5] "Cascading Style Sheets"          "rvest"                          
## [7] "rvest"                           "tidyverse"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("li em") %>%
  html_text()
## [1] "Hypertext Markup Language" "Cascading Style Sheets"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("em li") %>%
  html_text()
## character(0)

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("p em") %>%
  html_text()
## [1] "Tidyverse" "rvest"     "tidyverse"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("p > em") %>%
  html_text()
## [1] "Tidyverse"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("em + em") %>%
  html_text()
## [1] "tidyverse"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("li:first-child") %>%
  html_text()
## [1] "HTML: Hypertext Markup Language" "rvest"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("li:nth-child(2)") %>%
  html_text()
## [1] "CSS: Cascading Style Sheets"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("ol > li:last-child") %>%
  html_text()
## [1] "CSS: Cascading Style Sheets"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("a") %>%
  html_attr("href")
## [1] "www.r-project.org"                  "https://github.com/tidyverse/rvest"

...
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with <em>Tidyverse</em> is recommended.</p>
    <h2>Technologies</h2>
    <ul>
      <li>HTML: <em>Hypertext Markup Language</em></li>
      <li>CSS: <em>Cascading Style Sheets</em></li>
    </ul>
    <h2>Packages</h2>
    <ul>
      <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
    </ul>
    <p><strong>Note:
    <em>rvest</em> is included in the <em>tidyverse</em></strong></p>.
...
read_html(html_page) %>%
  html_nodes("ul a") %>%
  html_attr("href")
## [1] "https://github.com/tidyverse/rvest"

HTML tables

tag meaning
table Table section
tr Table row
td Table cell
th Table header

...
     <table>
      <tr>
        <th>Country</th><th>Capital</th><th>Population</th>
      </tr>
      <tr>
        <td>UK</td><td>London</td><td>66.65</td>
      </tr>
      <tr>
        <td>Switzerland</td><td>Bern</td><td>8.57</td>
      </tr>
    </table>
...
Country Capital Population
UK London 66.65
Switzerland Bern 8.57

...
     <table>
      <tr>
        <th>Country</th><th>Capital</th><th>Population</th>
      </tr>
      <tr>
        <td>UK</td><td>London</td><td>66.65</td>
      </tr>
      <tr>
        <td>Switzerland</td><td>Bern</td><td>8.57</td>
      </tr>
    </table>
...
read_html(basic_table) %>%
  html_table()
## [[1]]
##       Country Capital Population
## 1          UK  London      66.65
## 2 Switzerland    Bern       8.57

...
    <div id="intro">
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with the <em>Tidyverse</em> is recommended.</p>
    </div>
    <div class="content">
        <h2>Technologies</h2>
        <ol>
        <li>HTML: <em>Hypertext Markup Language</em></li>
        <li>CSS: <em>Cascade Styling Sheets</em></li>
        </ol>
    </div>
    <div class="content">
        <h2>Packages</h2>
        <ul>
        <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
        </ul>
    <p><strong>Note</strong>:
    <em>rvest</em> is included in the <em>tidyverse</em>.</p>
    </div>
...
read_html(html_page_css) %>%
  html_nodes(".content") %>%
  html_text()
## [1] "\n    \tTechnologies\n    \tHTML: Hypertext Markup Language\n      \tCSS: Cascading Style Sheets\n    \t"
## [2] "\n    \tPackages\n    \trvest\n    \tNote:\n    rvest is included in the tidyverse.\n    "

...
    <div id="intro">
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with the <em>Tidyverse</em> is recommended.</p>
    </div>
    <div class="content">
        <h2>Technologies</h2>
        <ol>
        <li>HTML: <em>Hypertext Markup Language</em></li>
        <li>CSS: <em>Cascade Styling Sheets</em></li>
        </ol>
    </div>
    <div class="content">
        <h2>Packages</h2>
        <ul>
        <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
        </ul>
    <p><strong>Note</strong>:
    <em>rvest</em> is included in the <em>tidyverse</em>.</p>
    </div>
...
read_html(html_page_css) %>%
  html_nodes(".content a") %>%
  html_text()
## [1] "rvest"

...
    <div id="intro">
    <p> Basic experience with <a href="www.r-project.org">R</a> and
    familiarity with the <em>Tidyverse</em> is recommended.</p>
    </div>
    <div class="content">
        <h2>Technologies</h2>
        <ol>
        <li>HTML: <em>Hypertext Markup Language</em></li>
        <li>CSS: <em>Cascade Styling Sheets</em></li>
        </ol>
    </div>
    <div class="content">
        <h2>Packages</h2>
        <ul>
        <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a>
        </ul>
    <p><strong>Note</strong>:
    <em>rvest</em> is included in the <em>tidyverse</em>.</p>
    </div>
...
read_html(html_page_css) %>%
  html_nodes("#intro") %>%
  html_text()
## [1] "\n     Basic experience with R and\n    familiarity with the Tidyverse is recommended.\n    "

library(rvest)

source: https://github.com/yusuzech/r-web-scraping-cheat-sheet/

wikipedia.org: Oscar winners

wikipedia.org: Oscar winners

oscar_parsed <- read_html("https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films")

oscar_nominees <- oscar_parsed %>%
  html_table(fill=TRUE) %>%
  .[[1]] %>%
  as_data_frame()
## # A tibble: 1,316 x 4
##    Film                                                 Year  Awards Nominations
##    <chr>                                                <chr> <chr>  <chr>      
##  1 Parasite                                             2019  4      6          
##  2 Ford v Ferrari                                       2019  2      4          
##  3 Learning to Skateboard in a Warzone (If You're a Gi… 2019  1      1          
##  4 The Neighbors' Window                                2019  1      2019       
##  5 Little Women                                         2019  2019   6          
##  6 Marriage Story                                       2019  2019   2019       
##  7 Jojo Rabbit                                          2019  2019   2019       
##  8 Toy Story 4                                          2019  2019   2          
##  9 Joker                                                2019  2      11         
## 10 Once Upon a Time in Hollywood                        2019  2019   10         
## # … with 1,306 more rows

wikipedia.org: Oscar winners

movie_title <- oscar_parsed %>%
  html_nodes("tr[style='background:#EEDD82'] > td:first-child") %>%
  html_text()
##  [1] "Parasite"                                     
##  [2] "Green Book"                                   
##  [3] "The Shape of Water"                           
##  [4] "Moonlight"                                    
##  [5] "Spotlight"                                    
##  [6] "Birdman"                                      
##  [7] "12 Years a Slave"                             
##  [8] "Argo"                                         
##  [9] "The Artist"                                   
## [10] "The King's Speech"                            
## [11] "The Hurt Locker"                              
## [12] "Slumdog Millionaire"                          
## [13] "No Country for Old Men"                       
## [14] "The Departed"                                 
## [15] "Crash"                                        
## [16] "Million Dollar Baby"                          
## [17] "The Lord of the Rings: The Return of the King"
## [18] "Chicago"                                      
## [19] "A Beautiful Mind"                             
## [20] "Gladiator"                                    
## [21] "American Beauty"                              
## [22] "Shakespeare in Love"                          
## [23] "Titanic"                                      
## [24] "The English Patient"                          
## [25] "Braveheart"                                   
## [26] "Forrest Gump"                                 
## [27] "Schindler's List"                             
## [28] "Unforgiven"                                   
## [29] "The Silence of the Lambs"                     
## [30] "Dances with Wolves"                           
## [31] "Driving Miss Daisy"                           
## [32] "Rain Man"                                     
## [33] "The Last Emperor"                             
## [34] "Platoon"                                      
## [35] "Out of Africa"                                
## [36] "Amadeus"                                      
## [37] "Terms of Endearment"                          
## [38] "Gandhi"                                       
## [39] "Chariots of Fire"                             
## [40] "Ordinary People"                              
## [41] "Kramer vs. Kramer"                            
## [42] "The Deer Hunter"                              
## [43] "Annie Hall"                                   
## [44] "Rocky"                                        
## [45] "One Flew over the Cuckoo's Nest"              
## [46] "The Godfather Part II"                        
## [47] "The Sting"                                    
## [48] "The Godfather"                                
## [49] "The French Connection"                        
## [50] "Patton"                                       
## [51] "Midnight Cowboy"                              
## [52] "Oliver!"                                      
## [53] "In the Heat of the Night"                     
## [54] "A Man for All Seasons"                        
## [55] "The Sound of Music"                           
## [56] "All About Eve"                                
## [57] "All Quiet on the Western Front"               
## [58] "All the King's Men"                           
## [59] "An American in Paris"                         
## [60] "The Apartment"                                
## [61] "Around the World in 80 Days"                  
## [62] "Ben-Hur"                                      
## [63] "The Best Years of Our Lives"                  
## [64] "The Bridge on the River Kwai"                 
## [65] "The Broadway Melody"                          
## [66] "Casablanca"                                   
## [67] "Cavalcade"                                    
## [68] "Cimarron"                                     
## [69] "From Here to Eternity"                        
## [70] "Gentleman's Agreement"                        
## [71] "Gigi"                                         
## [72] "Going My Way"                                 
## [73] "Gone with the Wind"                           
## [74] "Grand Hotel"                                  
## [75] "The Great Ziegfeld"                           
## [76] "The Greatest Show on Earth"                   
## [77] "Hamlet"                                       
## [78] "How Green Was My Valley"                      
## [79] "It Happened One Night"                        
## [80] "Lawrence of Arabia"                           
## [81] "The Life of Emile Zola"                       
## [82] "The Lost Weekend"                             
## [83] "Marty"                                        
## [84] "Mrs. Miniver"                                 
## [85] "Mutiny on the Bounty"                         
## [86] "My Fair Lady"                                 
## [87] "On the Waterfront"                            
## [88] "Rebecca"                                      
## [89] "Tom Jones"                                    
## [90] "West Side Story"                              
## [91] "Wings"                                        
## [92] "You Can't Take It with You"

wikipedia.org: Oscar winners

movie_year <- oscar_parsed %>%
  html_nodes("tr[style='background:#EEDD82'] > td:nth-child(2)") %>%
  html_text()
##  [1] "2019"    "2018"    "2017"    "2016"    "2015"    "2014"    "2013"   
##  [8] "2012"    "2011"    "2010"    "2009"    "2008"    "2007"    "2006"   
## [15] "2005"    "2004"    "2003"    "2002"    "2001"    "2000"    "1999"   
## [22] "1998"    "1997"    "1996"    "1995"    "1994"    "1993"    "1992"   
## [29] "1991"    "1990"    "1989"    "1988"    "1987"    "1986"    "1985"   
## [36] "1984"    "1983"    "1982"    "1981"    "1980"    "1979"    "1978"   
## [43] "1977"    "1976"    "1975"    "1974"    "1973"    "1972"    "1971"   
## [50] "1970"    "1969"    "1968"    "1967"    "1966"    "1965"    "1950"   
## [57] "1929/30" "1949"    "1951"    "1960"    "1956"    "1959"    "1946"   
## [64] "1957"    "1928/29" "1943"    "1932/33" "1930/31" "1953"    "1947"   
## [71] "1958"    "1944"    "1939"    "1931/32" "1936"    "1952"    "1948"   
## [78] "1941"    "1934"    "1962"    "1937"    "1945"    "1955"    "1942"   
## [85] "1935"    "1964"    "1954"    "1940"    "1963"    "1961"    "1927/28"
## [92] "1938"

wikipedia.org: Oscar winners

oscar_winners <- tibble(title = movie_title,
                        year = movie_year)
## # A tibble: 92 x 2
##    title              year 
##    <chr>              <chr>
##  1 Parasite           2019 
##  2 Green Book         2018 
##  3 The Shape of Water 2017 
##  4 Moonlight          2016 
##  5 Spotlight          2015 
##  6 Birdman            2014 
##  7 12 Years a Slave   2013 
##  8 Argo               2012 
##  9 The Artist         2011 
## 10 The King's Speech  2010 
## # … with 82 more rows

oecd.org: Media releases

url <- "https://www.oecd.org/newsroom/publicationsdocuments/bydate/"

oecd_parsed <- read_html(url)

oecd.org: Media releases

release_title <- oecd_parsed %>%
  html_nodes("h4 > a") %>%
  html_text()
##  [1] "\r\n                    OECD launches Global Outlook on Financing for Sustainable Development - Monday 9 November 2020 at 12:00 Paris time\r\n                "          
##  [2] "\r\n                    Climate finance for developing countries rose to USD 78.9 billion in 2018\r\n                "                                                   
##  [3] "\r\n                    The Netherlands has increased foreign bribery enforcement but there are concerns about the number of concluded cases to date\r\n                "
##  [4] "\r\n                    Consumer Prices, OECD - Updated: 4 November 2020\r\n                "                                                                            
##  [5] "\r\n                    OECD announces candidates for next Secretary-General\r\n                "                                                                        
##  [6] "\r\n                    2020 Ministerial Council Statement: A strong, resilient, inclusive and sustainable recovery from COVID-19\r\n\r\n                "               
##  [7] "\r\n                    O avanço da transformação digital no Brasil pode reforçar a recuperação econômica da crise da COVID-19\r\n                "                      
##  [8] "\r\n                    Stepping up digital transformation in Brazil could reinforce economic recovery from COVID-19 crisis\r\n\r\n\r\n                "                 
##  [9] "\r\n                    Big gender gap in students attitudes and engagement in global and multicultural issues, finds new OECD PISA report\r\n                "          
## [10] "\r\n                    OECD recognises Finland’s commitment to combat corruption, but is concerned about lack of foreign bribery enforcement\r\n                "

oecd.org: Media releases

release_title <- oecd_parsed %>%
  html_nodes("h4 > a") %>%
  html_text() %>%
  str_trim()
##  [1] "OECD launches Global Outlook on Financing for Sustainable Development - Monday 9 November 2020 at 12:00 Paris time"          
##  [2] "Climate finance for developing countries rose to USD 78.9 billion in 2018"                                                   
##  [3] "The Netherlands has increased foreign bribery enforcement but there are concerns about the number of concluded cases to date"
##  [4] "Consumer Prices, OECD - Updated: 4 November 2020"                                                                            
##  [5] "OECD announces candidates for next Secretary-General"                                                                        
##  [6] "2020 Ministerial Council Statement: A strong, resilient, inclusive and sustainable recovery from COVID-19"                   
##  [7] "O avanço da transformação digital no Brasil pode reforçar a recuperação econômica da crise da COVID-19"                      
##  [8] "Stepping up digital transformation in Brazil could reinforce economic recovery from COVID-19 crisis"                         
##  [9] "Big gender gap in students attitudes and engagement in global and multicultural issues, finds new OECD PISA report"          
## [10] "OECD recognises Finland’s commitment to combat corruption, but is concerned about lack of foreign bribery enforcement"

oecd.org: Media releases

release_url <- oecd_parsed %>%
  html_nodes("h4 > a") %>%
  html_attr("href")
##  [1] "/newsroom/oecd-launches-global-outlook-on-financing-for-sustainable-development-monday-9-november-2020-at-12-00-paris-time.htm"            
##  [2] "/newsroom/climate-finance-for-developing-countries-rose-to-usd-78-9-billion-in-2018oecd.htm"                                               
##  [3] "/newsroom/the-netherlands-has-increased-foreign-bribery-enforcement-but-there-are-concerns-about-the-number-of-concluded-cases-to-date.htm"
##  [4] "/newsroom/consumer-prices-oecd-updated-4-november-2020.htm"                                                                                
##  [5] "/newsroom/oecd-announces-candidates-for-next-secretary-general.htm"                                                                        
##  [6] "/newsroom/2020-ministerial-council-statement-a-strong-resilient-inclusive-and-sustainable-recovery-from-covid-19.htm"                      
##  [7] "/newsroom/o-avano-da-transformaao-digital-no-brasil-pode-reforar-a-recuperaao-economica-da-crise-da-covid-19.htm"                          
##  [8] "/newsroom/stepping-up-digital-transformation-in-brazil-could-reinforce-economic-recovery-from-covid-19-crisis.htm"                         
##  [9] "/newsroom/big-gender-gap-in-students-attitudes-and-engagement-in-global-and-multicultural-issues-finds-new-oecd-pisa-report.htm"           
## [10] "/newsroom/oecd-recognises-finland-s-commitment-to-combat-corruption-but-is-concerned-about-lack-of-foreign-bribery-enforcement.htm"

oecd.org: Media releases

release_url <- oecd_parsed %>%
  html_nodes("h4 > a") %>%
  html_attr("href") %>%
  str_c("https://www.oecd.org", .)
##  [1] "https://www.oecd.org/newsroom/oecd-launches-global-outlook-on-financing-for-sustainable-development-monday-9-november-2020-at-12-00-paris-time.htm"            
##  [2] "https://www.oecd.org/newsroom/climate-finance-for-developing-countries-rose-to-usd-78-9-billion-in-2018oecd.htm"                                               
##  [3] "https://www.oecd.org/newsroom/the-netherlands-has-increased-foreign-bribery-enforcement-but-there-are-concerns-about-the-number-of-concluded-cases-to-date.htm"
##  [4] "https://www.oecd.org/newsroom/consumer-prices-oecd-updated-4-november-2020.htm"                                                                                
##  [5] "https://www.oecd.org/newsroom/oecd-announces-candidates-for-next-secretary-general.htm"                                                                        
##  [6] "https://www.oecd.org/newsroom/2020-ministerial-council-statement-a-strong-resilient-inclusive-and-sustainable-recovery-from-covid-19.htm"                      
##  [7] "https://www.oecd.org/newsroom/o-avano-da-transformaao-digital-no-brasil-pode-reforar-a-recuperaao-economica-da-crise-da-covid-19.htm"                          
##  [8] "https://www.oecd.org/newsroom/stepping-up-digital-transformation-in-brazil-could-reinforce-economic-recovery-from-covid-19-crisis.htm"                         
##  [9] "https://www.oecd.org/newsroom/big-gender-gap-in-students-attitudes-and-engagement-in-global-and-multicultural-issues-finds-new-oecd-pisa-report.htm"           
## [10] "https://www.oecd.org/newsroom/oecd-recognises-finland-s-commitment-to-combat-corruption-but-is-concerned-about-lack-of-foreign-bribery-enforcement.htm"

oecd.org: Media releases

release_date <- oecd_parsed %>%
  html_nodes(".date") %>%
  html_text()
##  [1] "6-November-2020" "6-November-2020" "5-November-2020" "4-November-2020"
##  [5] "2-November-2020" "29-October-2020" "26-October-2020" "26-October-2020"
##  [9] "22-October-2020" "22-October-2020"

oecd.org: Media releases

release_date <- oecd_parsed %>%
  html_nodes(".date") %>%
  html_text() %>%
  dmy()
##  [1] "2020-11-06" "2020-11-06" "2020-11-05" "2020-11-04" "2020-11-02"
##  [6] "2020-10-29" "2020-10-26" "2020-10-26" "2020-10-22" "2020-10-22"

oecd.org: Media releases

release_language <- oecd_parsed %>%
  html_nodes(".infos > em") %>%
  html_text()
##  [1] "English"    "English"    "English"    "English"    "English"   
##  [6] "English"    "Portuguese" "English"    "English"    "English"

oecd.org: Media releases

oecd <- tibble(
  "date" = release_date,
  "title" = release_title,
  "language" = release_language,
  "url" = release_url
)
## # A tibble: 10 x 4
##    date       title                        language  url                        
##    <date>     <chr>                        <chr>     <chr>                      
##  1 2020-11-06 OECD launches Global Outloo… English   https://www.oecd.org/newsr…
##  2 2020-11-06 Climate finance for develop… English   https://www.oecd.org/newsr…
##  3 2020-11-05 The Netherlands has increas… English   https://www.oecd.org/newsr…
##  4 2020-11-04 Consumer Prices, OECD - Upd… English   https://www.oecd.org/newsr…
##  5 2020-11-02 OECD announces candidates f… English   https://www.oecd.org/newsr…
##  6 2020-10-29 2020 Ministerial Council St… English   https://www.oecd.org/newsr…
##  7 2020-10-26 O avanço da transformação d… Portugue… https://www.oecd.org/newsr…
##  8 2020-10-26 Stepping up digital transfo… English   https://www.oecd.org/newsr…
##  9 2020-10-22 Big gender gap in students … English   https://www.oecd.org/newsr…
## 10 2020-10-22 OECD recognises Finland’s c… English   https://www.oecd.org/newsr…

oecd.org: Media releases

oecd$text <- NA

for (i in seq_along(oecd$url)){
  
  print(str_c("Loading... ", oecd$url[i]))
  
  oecd$text[i] <- read_html(oecd$url[i]) %>%
    html_nodes("#webEditContent") %>%
    html_text() %>%
    str_trim()
  
  Sys.sleep(10)

}
## [1] "Loading... https://www.oecd.org/newsroom/oecd-launches-global-outlook-on-financing-for-sustainable-development-monday-9-november-2020-at-12-00-paris-time.htm"
## [1] "Loading... https://www.oecd.org/newsroom/climate-finance-for-developing-countries-rose-to-usd-78-9-billion-in-2018oecd.htm"
## [1] "Loading... https://www.oecd.org/newsroom/the-netherlands-has-increased-foreign-bribery-enforcement-but-there-are-concerns-about-the-number-of-concluded-cases-to-date.htm"
## [1] "Loading... https://www.oecd.org/newsroom/consumer-prices-oecd-updated-4-november-2020.htm"
## [1] "Loading... https://www.oecd.org/newsroom/oecd-announces-candidates-for-next-secretary-general.htm"
## [1] "Loading... https://www.oecd.org/newsroom/2020-ministerial-council-statement-a-strong-resilient-inclusive-and-sustainable-recovery-from-covid-19.htm"
## [1] "Loading... https://www.oecd.org/newsroom/o-avano-da-transformaao-digital-no-brasil-pode-reforar-a-recuperaao-economica-da-crise-da-covid-19.htm"
## [1] "Loading... https://www.oecd.org/newsroom/stepping-up-digital-transformation-in-brazil-could-reinforce-economic-recovery-from-covid-19-crisis.htm"
## [1] "Loading... https://www.oecd.org/newsroom/big-gender-gap-in-students-attitudes-and-engagement-in-global-and-multicultural-issues-finds-new-oecd-pisa-report.htm"
## [1] "Loading... https://www.oecd.org/newsroom/oecd-recognises-finland-s-commitment-to-combat-corruption-but-is-concerned-about-lack-of-foreign-bribery-enforcement.htm"

oecd.org: Media releases

## # A tibble: 10 x 5
##    date       title               language url                text              
##    <date>     <chr>               <chr>    <chr>              <chr>             
##  1 2020-11-06 OECD launches Glob… English  https://www.oecd.… "06/11/2020 - The…
##  2 2020-11-06 Climate finance fo… English  https://www.oecd.… "06/11/2020 - Cli…
##  3 2020-11-05 The Netherlands ha… English  https://www.oecd.… "05/11/2020 – For…
##  4 2020-11-04 Consumer Prices, O… English  https://www.oecd.… "OECD annual infl…
##  5 2020-11-02 OECD announces can… English  https://www.oecd.… "02/11/2020 - The…
##  6 2020-10-29 2020 Ministerial C… English  https://www.oecd.… "29/10/2020 - OEC…
##  7 2020-10-26 O avanço da transf… Portugu… https://www.oecd.… "26/10/2020 - O B…
##  8 2020-10-26 Stepping up digita… English  https://www.oecd.… "26/10/2020 - Bra…
##  9 2020-10-22 Big gender gap in … English  https://www.oecd.… "22/10/2020 - Sch…
## 10 2020-10-22 OECD recognises Fi… English  https://www.oecd.… "20/10/2020 – The…
oecd$text[1]
## [1] "06/11/2020 - The OECD will launch its latest Global Outlook on Financing for Sustainable Development on Monday 9 November with estimates of the looming shortfall in SDG financing for 2020 due to the economic impact of the COVID-19 crisis on public finances, foreign direct investment, portfolio investments and remittances, as well as ongoing tax evasion and illicit financial flows.\nThe report analyses the shortcomings in international and domestic financial and taxation systems that hold back investment in sustainable development and mean that the financial system continues to fuel inequalities and unsustainable investments.\nOECD Secretary-General Angel Gurría will present the report during a virtual discussion from 12:00 to 13:30 Paris time on the sidelines of a high-level meeting of the OECD’s Development Assistance Committee with participants including:\nH.E. Erna Solberg, Prime Minister, Norway\nProfessor Klaus Schwab, Executive Chairman, WEF\nH.E. Sri Mulyani Indrawati, Minister of Finance, Indonesia\nManuel Muñiz, Secretary of State, Spain\nDaniel Zelikow, Chairman of the Governing Board of JP Morgan’s Development Finance Institution\nBertrand Piccard, Chairman and CEO, Solar Impulse Foundation\nCatherine Howarth, CEO, ShareAction\nBertrand Badré, CEO, Blue Like an Orange\nProfessor Jeffrey Sachs, University Professor, Columbia University\nSee more details.\nFollow the discussion live.\nJournalists can request a copy of the Global Outlook under embargo, thereby undertaking to respect the OECD’s embargo procedures, by emailing embargo@oecd.org. For any further information, please contact Catherine Bremer in the OECD Media Office (+33 1 45 24 80 97).\n \nWorking with over 100 countries, the OECD is a global policy forum that promotes policies to improve the economic and social well-being of people around the world.\r\n             \r\n\r\n            \r\n            Related Documents"

oecd.org: Media releases

next_page <- oecd_parsed %>%
  html_nodes(".currentpage + a") %>%
  html_attr("href")
## [1] "/newsroom/publicationsdocuments/bydate/2/"

oecd.org: Media releases

url_list <- oecd_parsed %>%
  html_nodes(".paginate > div:nth-child(1) > a") %>%
  html_attr("href")
##   [1] "/newsroom/publicationsdocuments/bydate/2/"  
##   [2] "/newsroom/publicationsdocuments/bydate/3/"  
##   [3] "/newsroom/publicationsdocuments/bydate/4/"  
##   [4] "/newsroom/publicationsdocuments/bydate/5/"  
##   [5] "/newsroom/publicationsdocuments/bydate/6/"  
##   [6] "/newsroom/publicationsdocuments/bydate/7/"  
##   [7] "/newsroom/publicationsdocuments/bydate/8/"  
##   [8] "/newsroom/publicationsdocuments/bydate/9/"  
##   [9] "/newsroom/publicationsdocuments/bydate/10/" 
##  [10] "/newsroom/publicationsdocuments/bydate/11/" 
##  [11] "/newsroom/publicationsdocuments/bydate/12/" 
##  [12] "/newsroom/publicationsdocuments/bydate/13/" 
##  [13] "/newsroom/publicationsdocuments/bydate/14/" 
##  [14] "/newsroom/publicationsdocuments/bydate/15/" 
##  [15] "/newsroom/publicationsdocuments/bydate/16/" 
##  [16] "/newsroom/publicationsdocuments/bydate/17/" 
##  [17] "/newsroom/publicationsdocuments/bydate/18/" 
##  [18] "/newsroom/publicationsdocuments/bydate/19/" 
##  [19] "/newsroom/publicationsdocuments/bydate/20/" 
##  [20] "/newsroom/publicationsdocuments/bydate/21/" 
##  [21] "/newsroom/publicationsdocuments/bydate/22/" 
##  [22] "/newsroom/publicationsdocuments/bydate/23/" 
##  [23] "/newsroom/publicationsdocuments/bydate/24/" 
##  [24] "/newsroom/publicationsdocuments/bydate/25/" 
##  [25] "/newsroom/publicationsdocuments/bydate/26/" 
##  [26] "/newsroom/publicationsdocuments/bydate/27/" 
##  [27] "/newsroom/publicationsdocuments/bydate/28/" 
##  [28] "/newsroom/publicationsdocuments/bydate/29/" 
##  [29] "/newsroom/publicationsdocuments/bydate/30/" 
##  [30] "/newsroom/publicationsdocuments/bydate/31/" 
##  [31] "/newsroom/publicationsdocuments/bydate/32/" 
##  [32] "/newsroom/publicationsdocuments/bydate/33/" 
##  [33] "/newsroom/publicationsdocuments/bydate/34/" 
##  [34] "/newsroom/publicationsdocuments/bydate/35/" 
##  [35] "/newsroom/publicationsdocuments/bydate/36/" 
##  [36] "/newsroom/publicationsdocuments/bydate/37/" 
##  [37] "/newsroom/publicationsdocuments/bydate/38/" 
##  [38] "/newsroom/publicationsdocuments/bydate/39/" 
##  [39] "/newsroom/publicationsdocuments/bydate/40/" 
##  [40] "/newsroom/publicationsdocuments/bydate/41/" 
##  [41] "/newsroom/publicationsdocuments/bydate/42/" 
##  [42] "/newsroom/publicationsdocuments/bydate/43/" 
##  [43] "/newsroom/publicationsdocuments/bydate/44/" 
##  [44] "/newsroom/publicationsdocuments/bydate/45/" 
##  [45] "/newsroom/publicationsdocuments/bydate/46/" 
##  [46] "/newsroom/publicationsdocuments/bydate/47/" 
##  [47] "/newsroom/publicationsdocuments/bydate/48/" 
##  [48] "/newsroom/publicationsdocuments/bydate/49/" 
##  [49] "/newsroom/publicationsdocuments/bydate/50/" 
##  [50] "/newsroom/publicationsdocuments/bydate/51/" 
##  [51] "/newsroom/publicationsdocuments/bydate/52/" 
##  [52] "/newsroom/publicationsdocuments/bydate/53/" 
##  [53] "/newsroom/publicationsdocuments/bydate/54/" 
##  [54] "/newsroom/publicationsdocuments/bydate/55/" 
##  [55] "/newsroom/publicationsdocuments/bydate/56/" 
##  [56] "/newsroom/publicationsdocuments/bydate/57/" 
##  [57] "/newsroom/publicationsdocuments/bydate/58/" 
##  [58] "/newsroom/publicationsdocuments/bydate/59/" 
##  [59] "/newsroom/publicationsdocuments/bydate/60/" 
##  [60] "/newsroom/publicationsdocuments/bydate/61/" 
##  [61] "/newsroom/publicationsdocuments/bydate/62/" 
##  [62] "/newsroom/publicationsdocuments/bydate/63/" 
##  [63] "/newsroom/publicationsdocuments/bydate/64/" 
##  [64] "/newsroom/publicationsdocuments/bydate/65/" 
##  [65] "/newsroom/publicationsdocuments/bydate/66/" 
##  [66] "/newsroom/publicationsdocuments/bydate/67/" 
##  [67] "/newsroom/publicationsdocuments/bydate/68/" 
##  [68] "/newsroom/publicationsdocuments/bydate/69/" 
##  [69] "/newsroom/publicationsdocuments/bydate/70/" 
##  [70] "/newsroom/publicationsdocuments/bydate/71/" 
##  [71] "/newsroom/publicationsdocuments/bydate/72/" 
##  [72] "/newsroom/publicationsdocuments/bydate/73/" 
##  [73] "/newsroom/publicationsdocuments/bydate/74/" 
##  [74] "/newsroom/publicationsdocuments/bydate/75/" 
##  [75] "/newsroom/publicationsdocuments/bydate/76/" 
##  [76] "/newsroom/publicationsdocuments/bydate/77/" 
##  [77] "/newsroom/publicationsdocuments/bydate/78/" 
##  [78] "/newsroom/publicationsdocuments/bydate/79/" 
##  [79] "/newsroom/publicationsdocuments/bydate/80/" 
##  [80] "/newsroom/publicationsdocuments/bydate/81/" 
##  [81] "/newsroom/publicationsdocuments/bydate/82/" 
##  [82] "/newsroom/publicationsdocuments/bydate/83/" 
##  [83] "/newsroom/publicationsdocuments/bydate/84/" 
##  [84] "/newsroom/publicationsdocuments/bydate/85/" 
##  [85] "/newsroom/publicationsdocuments/bydate/86/" 
##  [86] "/newsroom/publicationsdocuments/bydate/87/" 
##  [87] "/newsroom/publicationsdocuments/bydate/88/" 
##  [88] "/newsroom/publicationsdocuments/bydate/89/" 
##  [89] "/newsroom/publicationsdocuments/bydate/90/" 
##  [90] "/newsroom/publicationsdocuments/bydate/91/" 
##  [91] "/newsroom/publicationsdocuments/bydate/92/" 
##  [92] "/newsroom/publicationsdocuments/bydate/93/" 
##  [93] "/newsroom/publicationsdocuments/bydate/94/" 
##  [94] "/newsroom/publicationsdocuments/bydate/95/" 
##  [95] "/newsroom/publicationsdocuments/bydate/96/" 
##  [96] "/newsroom/publicationsdocuments/bydate/97/" 
##  [97] "/newsroom/publicationsdocuments/bydate/98/" 
##  [98] "/newsroom/publicationsdocuments/bydate/99/" 
##  [99] "/newsroom/publicationsdocuments/bydate/100/"
## [100] "/newsroom/publicationsdocuments/bydate/101/"
## [101] "/newsroom/publicationsdocuments/bydate/102/"
## [102] "/newsroom/publicationsdocuments/bydate/103/"
## [103] "/newsroom/publicationsdocuments/bydate/104/"
## [104] "/newsroom/publicationsdocuments/bydate/105/"
## [105] "/newsroom/publicationsdocuments/bydate/106/"
## [106] "/newsroom/publicationsdocuments/bydate/107/"
## [107] "/newsroom/publicationsdocuments/bydate/108/"
## [108] "/newsroom/publicationsdocuments/bydate/109/"
## [109] "/newsroom/publicationsdocuments/bydate/110/"
## [110] "/newsroom/publicationsdocuments/bydate/111/"
## [111] "/newsroom/publicationsdocuments/bydate/112/"
## [112] "/newsroom/publicationsdocuments/bydate/113/"
## [113] "/newsroom/publicationsdocuments/bydate/114/"
## [114] "/newsroom/publicationsdocuments/bydate/115/"
## [115] "/newsroom/publicationsdocuments/bydate/116/"
## [116] "/newsroom/publicationsdocuments/bydate/117/"
## [117] "/newsroom/publicationsdocuments/bydate/118/"
## [118] "/newsroom/publicationsdocuments/bydate/119/"
## [119] "/newsroom/publicationsdocuments/bydate/120/"
## [120] "/newsroom/publicationsdocuments/bydate/121/"
## [121] "/newsroom/publicationsdocuments/bydate/122/"
## [122] "/newsroom/publicationsdocuments/bydate/123/"
## [123] "/newsroom/publicationsdocuments/bydate/124/"
## [124] "/newsroom/publicationsdocuments/bydate/125/"
## [125] "/newsroom/publicationsdocuments/bydate/126/"
## [126] "/newsroom/publicationsdocuments/bydate/127/"
## [127] "/newsroom/publicationsdocuments/bydate/128/"
## [128] "/newsroom/publicationsdocuments/bydate/129/"
## [129] "/newsroom/publicationsdocuments/bydate/130/"
## [130] "/newsroom/publicationsdocuments/bydate/131/"
## [131] "/newsroom/publicationsdocuments/bydate/132/"
## [132] "/newsroom/publicationsdocuments/bydate/133/"
## [133] "/newsroom/publicationsdocuments/bydate/134/"
## [134] "/newsroom/publicationsdocuments/bydate/135/"
## [135] "/newsroom/publicationsdocuments/bydate/136/"
## [136] "/newsroom/publicationsdocuments/bydate/137/"
## [137] "/newsroom/publicationsdocuments/bydate/138/"
## [138] "/newsroom/publicationsdocuments/bydate/139/"
## [139] "/newsroom/publicationsdocuments/bydate/140/"
## [140] "/newsroom/publicationsdocuments/bydate/141/"
## [141] "/newsroom/publicationsdocuments/bydate/142/"
## [142] "/newsroom/publicationsdocuments/bydate/143/"
## [143] "/newsroom/publicationsdocuments/bydate/144/"
## [144] "/newsroom/publicationsdocuments/bydate/145/"
## [145] "/newsroom/publicationsdocuments/bydate/146/"
## [146] "/newsroom/publicationsdocuments/bydate/147/"
## [147] "/newsroom/publicationsdocuments/bydate/148/"
## [148] "/newsroom/publicationsdocuments/bydate/149/"
## [149] "/newsroom/publicationsdocuments/bydate/150/"
## [150] "/newsroom/publicationsdocuments/bydate/151/"
## [151] "/newsroom/publicationsdocuments/bydate/152/"
## [152] "/newsroom/publicationsdocuments/bydate/153/"
## [153] "/newsroom/publicationsdocuments/bydate/154/"
## [154] "/newsroom/publicationsdocuments/bydate/155/"
## [155] "/newsroom/publicationsdocuments/bydate/156/"
## [156] "/newsroom/publicationsdocuments/bydate/157/"
## [157] "/newsroom/publicationsdocuments/bydate/158/"
## [158] "/newsroom/publicationsdocuments/bydate/159/"
## [159] "/newsroom/publicationsdocuments/bydate/160/"
## [160] "/newsroom/publicationsdocuments/bydate/161/"
## [161] "/newsroom/publicationsdocuments/bydate/162/"
## [162] "/newsroom/publicationsdocuments/bydate/163/"
## [163] "/newsroom/publicationsdocuments/bydate/164/"
## [164] "/newsroom/publicationsdocuments/bydate/165/"
## [165] "/newsroom/publicationsdocuments/bydate/166/"
## [166] "/newsroom/publicationsdocuments/bydate/167/"
## [167] "/newsroom/publicationsdocuments/bydate/168/"
## [168] "/newsroom/publicationsdocuments/bydate/169/"
## [169] "/newsroom/publicationsdocuments/bydate/170/"
## [170] "/newsroom/publicationsdocuments/bydate/171/"
## [171] "/newsroom/publicationsdocuments/bydate/172/"
## [172] "/newsroom/publicationsdocuments/bydate/173/"
## [173] "/newsroom/publicationsdocuments/bydate/174/"
## [174] "/newsroom/publicationsdocuments/bydate/175/"
## [175] "/newsroom/publicationsdocuments/bydate/176/"
## [176] "/newsroom/publicationsdocuments/bydate/177/"
## [177] "/newsroom/publicationsdocuments/bydate/178/"
## [178] "/newsroom/publicationsdocuments/bydate/179/"
## [179] "/newsroom/publicationsdocuments/bydate/180/"
## [180] "/newsroom/publicationsdocuments/bydate/181/"
## [181] "/newsroom/publicationsdocuments/bydate/182/"
## [182] "/newsroom/publicationsdocuments/bydate/183/"
## [183] "/newsroom/publicationsdocuments/bydate/184/"
## [184] "/newsroom/publicationsdocuments/bydate/185/"
## [185] "/newsroom/publicationsdocuments/bydate/186/"
## [186] "/newsroom/publicationsdocuments/bydate/187/"
## [187] "/newsroom/publicationsdocuments/bydate/188/"
## [188] "/newsroom/publicationsdocuments/bydate/189/"
## [189] "/newsroom/publicationsdocuments/bydate/190/"
## [190] "/newsroom/publicationsdocuments/bydate/191/"
## [191] "/newsroom/publicationsdocuments/bydate/192/"
## [192] "/newsroom/publicationsdocuments/bydate/193/"
## [193] "/newsroom/publicationsdocuments/bydate/194/"
## [194] "/newsroom/publicationsdocuments/bydate/195/"
## [195] "/newsroom/publicationsdocuments/bydate/196/"
## [196] "/newsroom/publicationsdocuments/bydate/197/"
## [197] "/newsroom/publicationsdocuments/bydate/198/"
## [198] "/newsroom/publicationsdocuments/bydate/199/"
## [199] "/newsroom/publicationsdocuments/bydate/200/"
## [200] "/newsroom/publicationsdocuments/bydate/201/"
## [201] "/newsroom/publicationsdocuments/bydate/202/"
## [202] "/newsroom/publicationsdocuments/bydate/203/"
## [203] "/newsroom/publicationsdocuments/bydate/204/"
## [204] "/newsroom/publicationsdocuments/bydate/205/"
## [205] "/newsroom/publicationsdocuments/bydate/206/"
## [206] "/newsroom/publicationsdocuments/bydate/207/"
## [207] "/newsroom/publicationsdocuments/bydate/208/"
## [208] "/newsroom/publicationsdocuments/bydate/209/"
## [209] "/newsroom/publicationsdocuments/bydate/210/"
## [210] "/newsroom/publicationsdocuments/bydate/211/"
## [211] "/newsroom/publicationsdocuments/bydate/212/"
## [212] "/newsroom/publicationsdocuments/bydate/213/"
## [213] "/newsroom/publicationsdocuments/bydate/214/"
## [214] "/newsroom/publicationsdocuments/bydate/215/"
## [215] "/newsroom/publicationsdocuments/bydate/216/"
## [216] "/newsroom/publicationsdocuments/bydate/217/"
## [217] "/newsroom/publicationsdocuments/bydate/218/"
## [218] "/newsroom/publicationsdocuments/bydate/219/"
## [219] "/newsroom/publicationsdocuments/bydate/220/"
## [220] "/newsroom/publicationsdocuments/bydate/221/"
## [221] "/newsroom/publicationsdocuments/bydate/222/"
## [222] "/newsroom/publicationsdocuments/bydate/2/"  
## [223] "/newsroom/publicationsdocuments/bydate/222/"

bis.org: Media releases

url <- "https://www.bis.org/press/pressrels.htm?r=1"

read_html(url) %>%
  html_nodes("td div.title") %>%
  html_text()
## character(0)

bis.org: Media releases

library(RSelenium)

# Load RSelenium server and client
rd <- rsDriver(browser = "firefox")
remDr <- rd[["client"]]

# Go to URL
remDr$navigate(url)

bis.org: Media releases

 parsed_bis %>%
  html_nodes("td div.title") %>%
  html_text() %>%
  str_trim()
##  [1] "Basel Committee reports to G20 Leaders on Basel III implementation"                                                     
##  [2] "CPMI-IOSCO assessment concludes that Brazil has implemented the PFMI, but recommends further improvements in some areas"
##  [3] "BIS Innovation Hub and the Hong Kong Monetary Authority announce TechChallenge winners"                                 
##  [4] "FX execution algorithms contribute to market functioning but bring new challenges"                                      
##  [5] "Focus on the future of banking supervision in a changing world"                                                         
##  [6] "Central banks and BIS publish first central bank digital currency (CBDC) report laying out key requirements"            
##  [7] "BIS Innovation Hub and Saudi G20 Presidency announce TechSprint winners"                                                
##  [8] "Payment aspects of financial inclusion - tools to facilitate the application of the guidance and measure progress"      
##  [9] "Basel Committee approves annual G-SIBs assessment,  updates workplan to evaluate post-crisis reforms"                   
## [10] "Markets rose despite subdued economic recovery: BIS Quarterly Review"                                                   
## [11] "Basel Committee and Canada reimagine the 2020 International Conference of Banking Supervisors"                          
## [12] "Saudi G20 Presidency and BIS Innovation Hub update on the progress made in the G20 TechSprint initiative"               
## [13] "Basel Committee releases consultative documents on principles for operational risk and operational resilience"          
## [14] "BIS Innovation Hub and HKMA invite global innovators to participate in a trade finance digitisation TechChallenge"      
## [15] "CPMI report identifies steps to enhance cross-border payments"                                                          
## [16] "FSB and Basel Committee set out supervisory recommendations for benchmark transition"                                   
## [17] "Basel Committee publishes final revisions to the credit valuation adjustment risk framework"                            
## [18] "Basel Committee reports on Basel III implementation progress"                                                           
## [19] "Basel Committee finalises AML/CFT guidelines on supervisory cooperation"                                                
## [20] "BIS Innovation Hub to expand to new locations in Europe and North America"

bis.org: Media releases

read_html(remDr$getPageSource()[[1]]) %>%
  html_nodes(".pageof") %>%
  html_text()
## [1] "Page 1 of 58"
# Select and click "Next"
next_btn <- remDr$findElement(using = "css",
                              ".listbottom span.icon.icon-chevron-right")
next_btn$clickElement()
read_html(remDr$getPageSource()[[1]]) %>%
  html_nodes(".pageof") %>%
  html_text()
## [1] "Page 2 of 58"

bis.org: Media releases

# Stop RSelenium client and server
remDr$closeall()
## [[1]]
## NULL
rd$server$stop()
## [1] TRUE

admin.ch: Media releases

# Building target URL

url_base <- "https://www.admin.ch"
url_releases <- "/gov/en/start/documentation/media-releases.html"
url_start_date <- "?dyn_startDate="
url_end_date <- "&dyn_endDate="
url_page_number <- "&dyn_pageIndex="
url_organization <- "&dyn_organization="

start_date <- "01.10.2020"
end_date <- "31.10.2020"
page_number <- "0"
organization_number <- "1"

url <- str_c(url_base,
             url_releases,
             url_start_date,
             start_date,
             url_end_date,
             end_date,
             url_organization,
             organization_number,
             url_page_number,
             page_number)
## [1] "https://www.admin.ch/gov/en/start/documentation/media-releases.html?dyn_startDate=01.10.2020&dyn_endDate=31.10.2020&dyn_organization=1&dyn_pageIndex=0"

Wrap up

  • Static or dynamic webpage?
    • RSelenium
    • httr
  • Extract the text elements you are interested in
    • CSS
    • XPath
  • Clean the data and construct a dataframe
  • Loop if necessary:
    • visit further pages (Pagination? Links?)
    • download files

More resources