oncrawl elasticsearch meetup france #12
Post on 20-Jan-2017
354 Views
Preview:
TRANSCRIPT
Elasticsearch + Oncrawl = <3
A SaaS SEO Monitoring solution by
Presentation by Tanguy Moal@tuxnco
Meetup Elasticsearch Paris #12
2015/01/22
Oncrawl · Elasticsearch Meetup France #12 222/01/15
[tuxnco@hal]:/opt$ whoami
- age: 0x20- kids: 0x02- hobbies:
- tech founder & cto at cogniteev- search, natural language processing, datamining- misc.
- history:- r&d engineer @ exalead- r&d engineer @ jobijoba
Oncrawl · Elasticsearch Meetup France #12 322/01/15
Presentation plan
Introduction to Oncrawl
Oncrawl technical overview
hadoop-elasticsearch within Oncrawl
Oncrawl API
Scaling Oncrawl infrastructure with Saltstack.
Conclusion / Questions
Introduction
Oncrawl · Elasticsearch Meetup France #12 522/01/15
Oncrawl: SEO Monitoring
- SEO Game has changed:
- Websites are getting bigger, harder to maintain- Several indicators to monitor- SaaS to the rescue (Moz, Ranks, Majestic SEO,
Botify, Deepcrawl, …)
Oncrawl · Elasticsearch Meetup France #12 622/01/15
Oncrawl: SEO Monitoring
- Analysis performed through crawl reports - SEO monitoring follows 5 axis:
- Performance- HTML quality- Inlinks- Outlinks- Content
- Interactive Analysis (URL explorer)- Planned: crawl over crawl trends spotting
Oncrawl · Elasticsearch Meetup France #12 722/01/15
Oncrawl: Pricing
Oncrawl: technical overview
Oncrawl · Elasticsearch Meetup France #12 9
Oncrawl: application architecture
22/01/15
Oncrawl · Elasticsearch Meetup France #12 1022/01/15
Boom.
Boom2.
Oncrawl · Elasticsearch Meetup France #12 11
Application scenario
- User has a plan and configured projects- Plan grants privileges
- Used to : allow project creation and triggering of crawls
- Each project may have associated crawls- Each crawl contains a report
What data are involved in a crawl report?
22/01/15
Oncrawl · Elasticsearch Meetup France #12 12
Links
22/01/15
- Important piece in serious SEO campaigns- Key fields:
- origin, origin_domain, origin_depth- target, target_domain, target_depth- context:
- position in origin page- anchor text- wraps significant tags (hn, img, …)
- Use cases:- list outlinks (resp. inlinks) of a given page- distinguish links used to go up (resp. down) the site’s tree- anchor text analysis, …
Oncrawl · Elasticsearch Meetup France #12 13
Page model
22/01/15
- Key fields- url- domain- hash- fetch
- date, size, time- HTTP headers- HTTP status code | ignored
(robots.txt|settings)
- parse- title, hn, metas,- canonical
- seo- depth. popularity. total inlinks- outlinks breakdown (internal vs
external, follow vs nofollow)- word count, text to code ratio,
duplicated fields, simhash
- Use cases- stats on size/fetch time/status code, by depth or for pages having any
combination of criterion- find pages with highest similarity to a given one- find pages with duplicated properties (title, hn, …)
- The central piece of the puzzle. Wraps all metadata relating to a given URL
Hadoop & Elasticsearch.
Oncrawl · Elasticsearch Meetup France #12 15
Elasticsearch for Hadoop- references
- overview http://www.elasticsearch.org/overview/hadoop/- online documentation
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html
- github- repo https://github.com/elasticsearch/elasticsearch-hadoop- author https://github.com/costin
- features- compatibility- simplicity- low footprint- flexible
22/01/15
Oncrawl · Elasticsearch Meetup France #12 16
Oncrawl: hadoop-elasticsearch- Apache Nutch (v1.x) uses HDFS (v2.x supports several storages through
Apache Gora -- including elasticsearch -- but…)
- Stacked different custom hadoop jobs to compute Oncrawl’s
custom attributes (duplicates, …)
- What about Apache Nutch’s ESIndexer ?
- hadoop-elasticsearch does the job pretty well
- Relies on job’s configuration:
- es.resource(.read|.write)? : « index/type » (supports “late” type
routing from fields in collected output, e.g.
« my_index/{some_field} »)
22/01/15
Oncrawl · Elasticsearch Meetup France #12 17
Oncrawl: hadoop-elasticsearch
• Reading from elasticsearch– job.setInputFormat(EsInputFormat.class);
• Writing to elasticsearch– job.setOutputFormat(EsOutputFormat.class);– Map<Object, Object> value = new LinkedHashMap <Object, Object> ();
– collector.collect(key, WritableUtils.toWritable(value));
22/01/15
Read \ Write HDFS Elasticsearch
HDFS builtin yes
Elasticsearch yes yes
Elasticsearch & Python
Oncrawl · Elasticsearch Meetup France #12 19
Oncrawl API
• Python / Flask :– Lightweight– Easy to deploy / mirror– Clean syntax
• elasticsearch python client:– simple API– allows for fine tuning of the client (HTTP connection
parameters, …)• API’s mission : populate application’s report’s
graphs
22/01/15
Oncrawl · Elasticsearch Meetup France #12 20
Oncrawl API- Each graph on the app has a dedicated API endpoint- Binds graph semantics to an elasticsearch query. Returns json data ready for
the rendering (d3.js, …)- Example : Summary of page load times
22/01/15
- 4 buckets : - perfect (under 500ms)- medium (between 500ms and
1000ms)- slow (between 1000ms and
2000ms)- too slow (beyond 2000ms)
- Expected output by plotting library:
Oncrawl · Elasticsearch Meetup France #12 21
Oncrawl API- Queries are easy to compose using python- Write & test it in Marvel- Integrate in Flask API
22/01/15
Elastic: Scale it
May I have the salt, please ?
Oncrawl · Elasticsearch Meetup France #12 23
Oncrawl scalability constraints- 1 index per crawl- size of indices ? S-M-L-XL- sharding policy:
- S: 1 shard- M: 3 shards- L: 5 shards- XL: 10 shards
- Hadoop cluster management- Provisioned for a given number
of concurrent crawl cycles- HDFS grows with total clients
- Elasticsearch cluster management- Build: same provision as
hadoop cluster- Storage / service:
- provisionned for 3 months of subscription
- Old indices:- close & snapshot- reopen on demand
22/01/15
Oncrawl · Elasticsearch Meetup France #12 24
Saltstack
• Cluster with members having roles: master vs minions
• Each minion can be fully administrated through the master
• Minions ask master for enrollment• Administrator on master can either accept or
decline minions• Once minion is accepted, can be fully
operated remotely22/01/15
Oncrawl · Elasticsearch Meetup France #12 25
Saltstack• A set of « recipes » define what states are made of, and how to get there• Recipes can use « jinja » templating so variable parts of configuration
files can be rendered at deployment time• Minions can have their role defined by several means:
– grains defined on the minion– deployment specific rules, defined in « the pillar »
• Within Oncrawl, saltstack is used :– To maintain indices templates (config/templates/*json)– To maintain elasticsearch clusters, nodes and shards allocation
(config/settings.yml)– To deploy the elasticsearch cluster, the hadoop cluster, staging and prod
servers• Deploy anything, anywhere (Droplets @ Digital Ocean, VMs @ Vultr,
Instances @ AWS, dedicated servers @ OVH)
22/01/15
Thank you!Follow us:@tuxnco (me)@cogniteev (company)@oncrawl (product)
Part of the gang
Any question ?
top related