stormcrawler in the wild

ONTOPIC

Storm-crawler in the wild

Jake K. Dodd [email protected]

http://www.ontopic.io

Who We Are

ONTOPIC

•  Ontopic is an early-stage FinTech startup located in Los Angeles, CA

•  We’re building an engine that empowers qualitative financial research by taming information overload

Our Requirements

•  Need to discover news as soon as it appears on the web

•  Involves monitoring several hundred thousand content sources

•  This is more adequately described as web monitoring than web crawling

What We Tried

•  + Perhaps the gold-standard for open source web crawling •  + Capable of handling millions of pages per day •  - We decided that we were trying to force Nutch to do

something for which it wasn’t designed—specifically, real-time monitoring

•  + Open source python web scraping framework •  + Incredibly simple to get started •  + Processing pipelines are dead-simple to develop •  - No built-in distributed mode. Building an in-house

distributed and continuous-crawl framework for Scrapy seemed like a fragile solution

•  - Designed, and primarily used, as a web scraper (again, not precisely the same as web monitoring)

Storm Crawler at Ontopic

•  The storm-crawler project is our workhorse for web monitoring

•  Integrated with Apache Kafka, Redis, and several other technologies

•  Running on a cluster managed by Hortonworks HDP 2.2

High-Level Architecture

•  Seed List •  Domain locks •  Outlink List •  Logstash events

Redis

URL Manager

(Ruby app)

manages

•  One topology •  Seed stream and

outlink stream

storm-crawler

•  One topic, two partitions

kafka

Publishes seeds and outlinks to Kafka

Kafka Spout with two executors (one for each topic partition)

Elasticsearch

indexing logstash

R&D Environment (AWS)

•  Seed List •  Domain locks •  Outlink List •  Logstash events

Redis

URL Manager

(Ruby app)

manages

•  One topology •  Seed stream and

outlink stream

storm-crawler

•  One topic, two partitions

kafka

Publishes seeds and outlinks to Kafka

Kafka Spout with two executors (one for each topic partition)

Elasticsearch

indexing logstash

1 x m1.small instance (Redis and Ruby app)

1 x r3.large

1 x c3.large

Nimbus: 1 x r3.large Supervisors: 3 x c3.large (in a placement group)

Eye Candy (Ambari)

Ambari Dashboard

Cluster utilization only ~7%

Eye Candy (Storm) Storm UI

~800,000 URLS per day

Eye Candy (Kibana) Kibana Dashboard

Metrics from storm-crawler sent to Logstash enable easy real-time monitoring

~800,000 URLS per day

Conclusion •  Storm-crawler has enabled us to build a reliable, distributed, web-

scale news monitoring solution •  Screen grabs are from our R&D environment, in which we’ve been

able to monitor ~2,000 sources with a revisit time of one minute, at 10% utilization on a small cluster

•  We have zero scalability concerns with storm-crawler—upping the number of tasks and nodes has demonstrated the ability to fetch 100,000s of pages per minute

•  Ontopic is committed to open-sourcing our work on top of storm-crawler and being a core contributor to the project

•  We’re working to generalize our integration points with Redis, Kafka, and Logstash, and providing tutorials so that storm-crawler users can easily leverage these technologies (or their equivalents) on projects using storm-crawler

stormcrawler in the wild

Software

web monitoring storm

stormcrawler project

stormcrawler users

outlink stream stormcrawler

day conclusion stormcrawler

eye candy storm storm

kafka kafka spout

partitions kafka