using web data for finance
Post on 13-Feb-2017
89 Views
Preview:
TRANSCRIPT
Scraping the Web with ScrapinghubFor Finance
We turn web content into useful data
About Scrapinghub
Scrapinghub specializes in data extraction. Our platform is used to scrape over 4 billion web pages a month.
We offer:
● Professional Services to handle the web scraping for you
● Off-the-shelf datasets so you can get data hassle free
● A cloud-based platform that makes scraping a breeze
Founded in 2010, largest 100% remote company based outside of the US
We’re 134 teammates in 48 countries
“Getting information off the Internet is like taking a drink from a fire hydrant.”
– Mitchell Kapor
Scrapy
Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way.
Benefits● No platform lock-in: Open Source● Very popular (13k+ ★)● Battle tested● Highly extensible● Great documentation
Portia
Portia is a Visual Scraping tool that lets you get data without needing to write code.
Benefits● No platform lock-in: Open Source● JavaScript dynamic content
generation● Ideal for non-developers● Extensible● It’s as easy as annotating a page
Portia
Large Scale Infrastructure
Meet Scrapy Cloud , our PaaS for web crawlers:
● Scalable: Crawlers run on EC2 instances or dedicated servers● Crawlera add-on● Control your spiders: Command line, API or web UI● Machine learning integration: BigML, MonkeyLearn● No lock-in: scrapyd to run Scrapy spiders on your own
infrastructure
Broad Crawls
Frontera allows us to build large scale web crawlers in Python:
● Scrapy support out of the box● Distribute and scale custom web crawlers across servers● Crawl Frontier Framework: large scale URL prioritization logic● Aduana to prioritize URLs based on link analysis (PageRank,
HITS)
Web Scraping Use Cases
Competitive Pricing
Companies use web scraping to monitor the pricing and the ratings of competitors:
● Scrape online retailers● Structure the data in a search engine or
DB● Create an interface to search for
products● Sentiment analysis for product rankings
We help a leading IT manufacturer monitor the activities of their resellers:
● Tracking and watching out for stolen goods
● Pricing agreement violations
● Customer support responses on complaints ● Product line quality checks
Monitor Resellers
Lead Generation
Mine scraped data to identify who to target in a company for your outbound sales campaigns:
● Locate possible leads in your target market● Identify the right contacts within each one● Augment the information you already have on them
Real Estate
Crawl property websites and use the data obtained in order to:
● Estimate house prices● Rental values● Housing stock movements● Give insight into real estate agents and homeowners
Fraud Detection
Monitor for sellers that offer products violating the ToS of credit card companies including:● Drugs● Weapons● Gambling
Identify stolen cards and IDs on the Dark Web● Forums where hackers share ID numbers / pins
Company Reputation
Sentiment analysis of a company or product through newsletters, social networks and other natural language data sources.
● NLP to create an associated sentiment indicator.● Track the relevant news supporting the indicator can lead to
market insights for long-term trends.
Consumer Behavior
Extract data from forums and websites like Reddit to evaluate consumer reviews and commentary:
● Volume of comments across brands● Topics of discussion● Comparisons with other brands and products ● Evaluate product launches and marketing tactics
Tracking Legislation
Monitor bills and regulations that are being discussed in Congress. Access court judgments and opinions in order to:
● Follow discussions ● Try to forecast legislative outcomes● Track regulations that impact different economic sectors
Hiring
Crawl and extract data from job boards and other sources in order to understand:● Hiring trends in different sectors or regions● Find candidates for jobs, or future leaders● Spot and rescue employees that are
shopping for a new job
Monitoring Corruption
Journalists and analysts can create Open Data by extracting information from difficult to access government websites:
● Track the activities of lobbyists
● Patterns in the behavior of government officials● Disruptions in the economy due to corruption allegations
Thank you!
scrapinghub.com
Thank you!
top related