oncrawl elasticsearch meetup france #12
TRANSCRIPT
Elasticsearch + Oncrawl =
<3
A SaaS SEO Monitoring solution by
Presentation by Tanguy Moal@tuxnco
Meetup Elasticsearch Paris #12
2015/01/22
22/01/15 Oncrawl · Elasticsearch Meetup France #12 2
[tuxnco@hal]:/opt$ whoami
- age: 0x20
- kids: 0x02
- hobbies:- tech founder & cto at cogniteev
- search, natural language processing, datamining
- misc.
- history:- r&d engineer @ exalead
- r&d engineer @ jobijoba
22/01/15 Oncrawl · Elasticsearch Meetup France #12 3
Presentation plan
Introduction to Oncrawl
Oncrawl technical overview
hadoop-elasticsearch within Oncrawl
Oncrawl API
Scaling Oncrawl infrastructure with Saltstack.
Conclusion / Questions
Introduction
22/01/15 Oncrawl · Elasticsearch Meetup France #12 5
Oncrawl: SEO Monitoring
- SEO Game has changed:
- Websites are getting bigger, harder to maintain
- Several indicators to monitor
- SaaS to the rescue (Moz, Ranks, Majestic SEO, Botify, Deepcrawl, …)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 6
Oncrawl: SEO Monitoring
- Analysis performed through crawl reports
- SEO monitoring follows 5 axis:- Performance
- HTML quality
- Inlinks
- Outlinks
- Content
- Interactive Analysis (URL explorer)
- Planned: crawl over crawl trends spotting
22/01/15 Oncrawl · Elasticsearch Meetup France #12 7
Oncrawl: Pricing
Oncrawl: technical overview
Oncrawl: application architecture
22/01/15 Oncrawl · Elasticsearch Meetup France #12 9
22/01/15 Oncrawl · Elasticsearch Meetup France #12 10
Boom.
Boom2.
Application scenario
- User has a plan and configured projects
- Plan grants privileges
- Used to : allow project creation and triggering of crawls
- Each project may have associated crawls
- Each crawl contains a report
What data are involved in a crawl report?
22/01/15 Oncrawl · Elasticsearch Meetup France #12 11
Links
22/01/15 Oncrawl · Elasticsearch Meetup France #12 12
- Important piece in serious SEO campaigns- Key fields:
- origin, origin_domain, origin_depth- target, target_domain, target_depth- context:
- position in origin page- anchor text- wraps significant tags (hn, img, …)
- Use cases:- list outlinks (resp. inlinks) of a given page- distinguish links used to go up (resp. down) the site’s tree- anchor text analysis, …
Page model
22/01/15 Oncrawl · Elasticsearch Meetup France #12 13
- Key fields- url
- domain
- hash
- fetch
- date, size, time
- HTTP headers
- HTTP status code | ignored (robots.txt|settings)
- parse
- title, hn, metas,
- canonical
- seo
- depth. popularity. total inlinks
- outlinks breakdown (internal vsexternal, follow vs nofollow)
- word count, text to code ratio, duplicated fields, simhash
- Use cases- stats on size/fetch time/status code, by depth or for pages having any
combination of criterion- find pages with highest similarity to a given one- find pages with duplicated properties (title, hn, …)
- The central piece of the puzzle. Wraps all metadata relating to a given URL
Hadoop & Elasticsearch.
Elasticsearch for Hadoop
- references- overview http://www.elasticsearch.org/overview/hadoop/- online documentation
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html
- github- repo https://github.com/elasticsearch/elasticsearch-hadoop- author https://github.com/costin
- features- compatibility- simplicity- low footprint- flexible
22/01/15 Oncrawl · Elasticsearch Meetup France #12 15
Oncrawl: hadoop-elasticsearch
- Apache Nutch (v1.x) uses HDFS (v2.x supports several storages
through Apache Gora -- including elasticsearch -- but…)
- Stacked different custom hadoop jobs to compute
Oncrawl’s custom attributes (duplicates, …)
- What about Apache Nutch’s ESIndexer ?
- hadoop-elasticsearch does the job pretty well
- Relies on job’s configuration:
- es.resource(.read|.write)? : « index/type » (supports “late”
type routing from fields in collected output, e.g.
« my_index/{some_field} »)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 16
Oncrawl: hadoop-elasticsearch
• Reading from elasticsearch– job.setInputFormat(EsInputFormat.class);
• Writing to elasticsearch– job.setOutputFormat(EsOutputFormat.class);
– Map<Object, Object> value = new
LinkedHashMap <Object, Object> ();
– collector.collect(key,
WritableUtils.toWritable(value));
22/01/15 Oncrawl · Elasticsearch Meetup France #12 17
Read \ Write HDFS Elasticsearch
HDFS builtin yes
Elasticsearch yes yes
Elasticsearch & Python
Oncrawl API
• Python / Flask :– Lightweight
– Easy to deploy / mirror
– Clean syntax
• elasticsearch python client:– simple API
– allows for fine tuning of the client (HTTP connection parameters, …)
• API’s mission : populate application’s report’s graphs
22/01/15 Oncrawl · Elasticsearch Meetup France #12 19
Oncrawl API- Each graph on the app has a dedicated API endpoint
- Binds graph semantics to an elasticsearch query. Returns json data ready for the rendering (d3.js, …)
- Example : Summary of page load times
22/01/15 Oncrawl · Elasticsearch Meetup France #12 20
- 4 buckets : - perfect (under 500ms)- medium (between 500ms and
1000ms)- slow (between 1000ms and
2000ms)- too slow (beyond 2000ms)
- Expected output by plotting library:
Oncrawl API- Queries are easy to compose using python
- Write & test it in Marvel
- Integrate in Flask API
22/01/15 Oncrawl · Elasticsearch Meetup France #12 21
Elastic: Scale it
May I have the salt, please ?
Oncrawl scalability constraints
- 1 index per crawl
- size of indices ? S-M-L-XL
- sharding policy:- S: 1 shard
- M: 3 shards
- L: 5 shards
- XL: 10 shards
- Hadoop cluster management- Provisioned for a given number
of concurrent crawl cycles
- HDFS grows with total clients
- Elasticsearch cluster management- Build: same provision as
hadoop cluster
- Storage / service:- provisionned for 3 months of
subscription
- Old indices:
- close & snapshot
- reopen on demand
22/01/15 Oncrawl · Elasticsearch Meetup France #12 23
Saltstack
• Cluster with members having roles: master vs minions
• Each minion can be fully administratedthrough the master
• Minions ask master for enrollment
• Administrator on master can either acceptor decline minions
• Once minion is accepted, can be fullyoperated remotely
22/01/15 Oncrawl · Elasticsearch Meetup France #12 24
Saltstack
• A set of « recipes » define what states are made of, and how to get there
• Recipes can use « jinja » templating so variable parts of configuration files can be rendered at deployment time
• Minions can have their role defined by several means:– grains defined on the minion– deployment specific rules, defined in « the pillar »
• Within Oncrawl, saltstack is used :– To maintain indices templates (config/templates/*json)– To maintain elasticsearch clusters, nodes and shards allocation
(config/settings.yml)– To deploy the elasticsearch cluster, the hadoop cluster, staging
and prod servers
• Deploy anything, anywhere (Droplets @ Digital Ocean, VMs @ Vultr, Instances @ AWS, dedicated servers @ OVH)
22/01/15 Oncrawl · Elasticsearch Meetup France #12 25
Thank you!
Follow us:@tuxnco (me)@cogniteev (company)@oncrawl (product)
Part of the gang
Any question ?