![Page 1: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/1.jpg)
SMI 2018 :: web scrapingparte I
Augusto Fadel
6 de novembro de 2018
![Page 2: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/2.jpg)
parte I
R e RStudio
Web data e web scraping
Coleta de dados na web
·
·
·
download
API
web scraping
-
-
-
2/28
![Page 3: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/3.jpg)
sugestões
Repositório no github: web-scraping-smi2018
IDE: RStudio
Pacotes: tidyverse | rvest | httr | xml2 | jsonlite | DBI
Outros recursos:
·
·
·
·
R for Data Science
Happy Git and GitHub for the useR
rstudio::conf 2018 | webinars | cheat sheets
DataCamp | edX | Cousera | Udacity
Stackoverflow
-
-
-
-
-
3/28
![Page 4: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/4.jpg)
Web Scraping
![Page 5: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/5.jpg)
web scraping
5/28
![Page 6: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/6.jpg)
web data
Download
APIs (Application Programming Interfaces)
Web scraping
Crawler
·
·
·
·
6/28
![Page 7: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/7.jpg)
web data
7/28
![Page 8: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/8.jpg)
Download
![Page 9: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/9.jpg)
tipos de arquivos
tipos de arquivoscsv (comma-separated values)
tsv (tab-separated values)
MS Excel (xls, xlsx)
zip
JSON (JavaScript Object Notation)
XML (Extensible Markup Language)
·
·
·
·
·
·
9/28
![Page 10: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/10.jpg)
download
Portal Brasileiro de Dados Abertos: dados.gov.br Preços de comercialização de combustíveis: anp.gov.br IPCA: ftp.ibge.gov.br
10/28
![Page 11: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/11.jpg)
APIs
![Page 12: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/12.jpg)
APIs
12/28
![Page 13: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/13.jpg)
APIs
13/28
![Page 14: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/14.jpg)
APIs
14/28
![Page 15: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/15.jpg)
APIs
HTTP requestsGET
POST
DELETE
HEAD
outros
·
·
·
·
·
15/28
![Page 16: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/16.jpg)
APIs
Respostas
Códigos de resposta HTTP
200: sucesso!
300: redirecionamento
400: erro de cliente
500: erro de servidor
·
·
·
·
16/28
![Page 17: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/17.jpg)
APIs
The Star Wars API: swapi.co (pacote rwars) Sistema IBGE de Recuperação Automática: SIDRA Banco Central do Brasil: Portal de Dados Abertos
17/28
![Page 18: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/18.jpg)
Web Scraping
![Page 19: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/19.jpg)
web scraping
Estrutura HTML (tags)
<a href="http://www.ibge.gov.br/">IBGE</a>
título: <title> ... </title>
parágrafo de texto: <p> ... </p>
blocos: <div> ... </div>
tabela: <table> ... </table>
hiperlink (âncora): <a> ... </a>
·
·
·
·
·
19/28
![Page 20: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/20.jpg)
web scraping
Extrair com rvest:
<a href="http://www.ibge.gov.br/">IBGE</a>
html_name()
html_attr()
html_text()
html_table()
·
·
·
·
20/28
![Page 22: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/22.jpg)
web scraping
Selenium
Executar o servidor
Selenium 2: WebDriver
Standalone server (v3.9.1)
Pacote RSelenium
·
·
·
Usando Docker
Usando RSelenium::rsDriver()
Manualmente
·
·
·
22/28
![Page 23: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/23.jpg)
web scraping
Funções RSelenium:
navigate()
goBack
goForward()
refresh()
findElement()
highlightElement()
clickElement()
mouseMoveToLocation()
click()
sendKeysToElement()
·
·
·
·
·
·
·
·
·
·
23/28
![Page 25: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/25.jpg)
Boas práticas
![Page 26: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/26.jpg)
boas práticas
Verificar e respeitar o robot.txt.
Identificação.
Usar httr::user_agent() com e-mail de contato.
Respeitar o limite de solicitações (rate-limiting).
Se não houver limite explícito, usar o bom senso.
Regra geral: intervalo de um segundo entre solicitações, usarSys.sleep(1).
Priorizar horários de menor tráfego.
·
·
·
·
·
·
·
26/28
![Page 27: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/27.jpg)
Obrigado!
![Page 28: SMI 2018 :: web scraping · Web data e web scraping Coleta de dados na web ... JSON ( JavaScript Object Notation) XML (Extensible Markup Language) ... Pacote RSelenium](https://reader036.vdocument.in/reader036/viewer/2022063007/5fb8e2f579015b1ad31cb22c/html5/thumbnails/28.jpg)
obrigado
Augusto Fadel
DPE/CEEC/GCAD
21 2142-0452
augustofadel
28/28