Download - Using Web Data for Finance
Scraping the Web with ScrapinghubFor Finance
We turn web content into useful data
About Scrapinghub
Scrapinghub specializes in data extraction. Our platform is used to scrape over 4 billion web pages a month.
We offer:
● Professional Services to handle the web scraping for you
● Off-the-shelf datasets so you can get data hassle free
● A cloud-based platform that makes scraping a breeze
Founded in 2010, largest 100% remote company based outside of the US
We’re 134 teammates in 48 countries
“Getting information off the Internet is like taking a drink from a fire hydrant.”
– Mitchell Kapor
Scrapy
Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way.
Benefits● No platform lock-in: Open Source● Very popular (13k+ ★)● Battle tested● Highly extensible● Great documentation
Portia
Portia is a Visual Scraping tool that lets you get data without needing to write code.
Benefits● No platform lock-in: Open Source● JavaScript dynamic content
generation● Ideal for non-developers● Extensible● It’s as easy as annotating a page
Portia
Large Scale Infrastructure
Meet Scrapy Cloud , our PaaS for web crawlers:
● Scalable: Crawlers run on EC2 instances or dedicated servers● Crawlera add-on● Control your spiders: Command line, API or web UI● Machine learning integration: BigML, MonkeyLearn● No lock-in: scrapyd to run Scrapy spiders on your own
infrastructure
Broad Crawls
Frontera allows us to build large scale web crawlers in Python:
● Scrapy support out of the box● Distribute and scale custom web crawlers across servers● Crawl Frontier Framework: large scale URL prioritization logic● Aduana to prioritize URLs based on link analysis (PageRank,
HITS)
Web Scraping Use Cases
Competitive Pricing
Companies use web scraping to monitor the pricing and the ratings of competitors:
● Scrape online retailers● Structure the data in a search engine or
DB● Create an interface to search for
products● Sentiment analysis for product rankings
We help a leading IT manufacturer monitor the activities of their resellers:
● Tracking and watching out for stolen goods
● Pricing agreement violations
● Customer support responses on complaints ● Product line quality checks
Monitor Resellers
Lead Generation
Mine scraped data to identify who to target in a company for your outbound sales campaigns:
● Locate possible leads in your target market● Identify the right contacts within each one● Augment the information you already have on them
Real Estate
Crawl property websites and use the data obtained in order to:
● Estimate house prices● Rental values● Housing stock movements● Give insight into real estate agents and homeowners
Fraud Detection
Monitor for sellers that offer products violating the ToS of credit card companies including:● Drugs● Weapons● Gambling
Identify stolen cards and IDs on the Dark Web● Forums where hackers share ID numbers / pins
Company Reputation
Sentiment analysis of a company or product through newsletters, social networks and other natural language data sources.
● NLP to create an associated sentiment indicator.● Track the relevant news supporting the indicator can lead to
market insights for long-term trends.
Consumer Behavior
Extract data from forums and websites like Reddit to evaluate consumer reviews and commentary:
● Volume of comments across brands● Topics of discussion● Comparisons with other brands and products ● Evaluate product launches and marketing tactics
Tracking Legislation
Monitor bills and regulations that are being discussed in Congress. Access court judgments and opinions in order to:
● Follow discussions ● Try to forecast legislative outcomes● Track regulations that impact different economic sectors
Hiring
Crawl and extract data from job boards and other sources in order to understand:● Hiring trends in different sectors or regions● Find candidates for jobs, or future leaders● Spot and rescue employees that are
shopping for a new job
Monitoring Corruption
Journalists and analysts can create Open Data by extracting information from difficult to access government websites:
● Track the activities of lobbyists
● Patterns in the behavior of government officials● Disruptions in the economy due to corruption allegations
Thank you!
scrapinghub.com
Thank you!