stormcrawler in the wild
TRANSCRIPT
Who We Are
ONTOPIC
• Ontopic is an early-stage FinTech startup located in Los Angeles, CA
• We’re building an engine that empowers qualitative financial research by taming information overload
Our Requirements
• Need to discover news as soon as it appears on the web
• Involves monitoring several hundred thousand content sources
• This is more adequately described as web monitoring than web crawling
What We Tried
• + Perhaps the gold-standard for open source web crawling • + Capable of handling millions of pages per day • - We decided that we were trying to force Nutch to do
something for which it wasn’t designed—specifically, real-time monitoring
• + Open source python web scraping framework • + Incredibly simple to get started • + Processing pipelines are dead-simple to develop • - No built-in distributed mode. Building an in-house
distributed and continuous-crawl framework for Scrapy seemed like a fragile solution
• - Designed, and primarily used, as a web scraper (again, not precisely the same as web monitoring)
Storm Crawler at Ontopic
• The storm-crawler project is our workhorse for web monitoring
• Integrated with Apache Kafka, Redis, and several other technologies
• Running on a cluster managed by Hortonworks HDP 2.2
High-Level Architecture
• Seed List • Domain locks • Outlink List • Logstash events
Redis
URL Manager
(Ruby app)
manages
• One topology • Seed stream and
outlink stream
storm-crawler
• One topic, two partitions
kafka
Publishes seeds and outlinks to Kafka
Kafka Spout with two executors (one for each topic partition)
Elasticsearch
indexing logstash
R&D Environment (AWS)
• Seed List • Domain locks • Outlink List • Logstash events
Redis
URL Manager
(Ruby app)
manages
• One topology • Seed stream and
outlink stream
storm-crawler
• One topic, two partitions
kafka
Publishes seeds and outlinks to Kafka
Kafka Spout with two executors (one for each topic partition)
Elasticsearch
indexing logstash
1 x m1.small instance (Redis and Ruby app)
1 x r3.large
1 x c3.large
Nimbus: 1 x r3.large Supervisors: 3 x c3.large (in a placement group)
Eye Candy (Kibana) Kibana Dashboard
Metrics from storm-crawler sent to Logstash enable easy real-time monitoring
~800,000 URLS per day
Conclusion • Storm-crawler has enabled us to build a reliable, distributed, web-
scale news monitoring solution • Screen grabs are from our R&D environment, in which we’ve been
able to monitor ~2,000 sources with a revisit time of one minute, at 10% utilization on a small cluster
• We have zero scalability concerns with storm-crawler—upping the number of tasks and nodes has demonstrated the ability to fetch 100,000s of pages per minute
• Ontopic is committed to open-sourcing our work on top of storm-crawler and being a core contributor to the project
• We’re working to generalize our integration points with Redis, Kafka, and Logstash, and providing tutorials so that storm-crawler users can easily leverage these technologies (or their equivalents) on projects using storm-crawler