oncrawl elasticsearch meetup france #12

26
Elasticsearch + Oncrawl = <3 A SaaS SEO Monitoring solution by Presentation by Tanguy Moal @tuxnco Meetup Elasticsearch Paris #12 2015/01/22

Upload: tanguy-moal

Post on 16-Jul-2015

363 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Oncrawl elasticsearch meetup france #12

Elasticsearch + Oncrawl =

<3

A SaaS SEO Monitoring solution by

Presentation by Tanguy Moal@tuxnco

Meetup Elasticsearch Paris #12

2015/01/22

Page 2: Oncrawl elasticsearch meetup france #12

22/01/15 Oncrawl · Elasticsearch Meetup France #12 2

[tuxnco@hal]:/opt$ whoami

- age: 0x20

- kids: 0x02

- hobbies:- tech founder & cto at cogniteev

- search, natural language processing, datamining

- misc.

- history:- r&d engineer @ exalead

- r&d engineer @ jobijoba

Page 3: Oncrawl elasticsearch meetup france #12

22/01/15 Oncrawl · Elasticsearch Meetup France #12 3

Presentation plan

Introduction to Oncrawl

Oncrawl technical overview

hadoop-elasticsearch within Oncrawl

Oncrawl API

Scaling Oncrawl infrastructure with Saltstack.

Conclusion / Questions

Page 4: Oncrawl elasticsearch meetup france #12

Introduction

Page 5: Oncrawl elasticsearch meetup france #12

22/01/15 Oncrawl · Elasticsearch Meetup France #12 5

Oncrawl: SEO Monitoring

- SEO Game has changed:

- Websites are getting bigger, harder to maintain

- Several indicators to monitor

- SaaS to the rescue (Moz, Ranks, Majestic SEO, Botify, Deepcrawl, …)

Page 6: Oncrawl elasticsearch meetup france #12

22/01/15 Oncrawl · Elasticsearch Meetup France #12 6

Oncrawl: SEO Monitoring

- Analysis performed through crawl reports

- SEO monitoring follows 5 axis:- Performance

- HTML quality

- Inlinks

- Outlinks

- Content

- Interactive Analysis (URL explorer)

- Planned: crawl over crawl trends spotting

Page 7: Oncrawl elasticsearch meetup france #12

22/01/15 Oncrawl · Elasticsearch Meetup France #12 7

Oncrawl: Pricing

Page 8: Oncrawl elasticsearch meetup france #12

Oncrawl: technical overview

Page 9: Oncrawl elasticsearch meetup france #12

Oncrawl: application architecture

22/01/15 Oncrawl · Elasticsearch Meetup France #12 9

Page 10: Oncrawl elasticsearch meetup france #12

22/01/15 Oncrawl · Elasticsearch Meetup France #12 10

Boom.

Boom2.

Page 11: Oncrawl elasticsearch meetup france #12

Application scenario

- User has a plan and configured projects

- Plan grants privileges

- Used to : allow project creation and triggering of crawls

- Each project may have associated crawls

- Each crawl contains a report

What data are involved in a crawl report?

22/01/15 Oncrawl · Elasticsearch Meetup France #12 11

Page 12: Oncrawl elasticsearch meetup france #12

Links

22/01/15 Oncrawl · Elasticsearch Meetup France #12 12

- Important piece in serious SEO campaigns- Key fields:

- origin, origin_domain, origin_depth- target, target_domain, target_depth- context:

- position in origin page- anchor text- wraps significant tags (hn, img, …)

- Use cases:- list outlinks (resp. inlinks) of a given page- distinguish links used to go up (resp. down) the site’s tree- anchor text analysis, …

Page 13: Oncrawl elasticsearch meetup france #12

Page model

22/01/15 Oncrawl · Elasticsearch Meetup France #12 13

- Key fields- url

- domain

- hash

- fetch

- date, size, time

- HTTP headers

- HTTP status code | ignored (robots.txt|settings)

- parse

- title, hn, metas,

- canonical

- seo

- depth. popularity. total inlinks

- outlinks breakdown (internal vsexternal, follow vs nofollow)

- word count, text to code ratio, duplicated fields, simhash

- Use cases- stats on size/fetch time/status code, by depth or for pages having any

combination of criterion- find pages with highest similarity to a given one- find pages with duplicated properties (title, hn, …)

- The central piece of the puzzle. Wraps all metadata relating to a given URL

Page 14: Oncrawl elasticsearch meetup france #12

Hadoop & Elasticsearch.

Page 15: Oncrawl elasticsearch meetup france #12

Elasticsearch for Hadoop

- references- overview http://www.elasticsearch.org/overview/hadoop/- online documentation

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html

- github- repo https://github.com/elasticsearch/elasticsearch-hadoop- author https://github.com/costin

- features- compatibility- simplicity- low footprint- flexible

22/01/15 Oncrawl · Elasticsearch Meetup France #12 15

Page 16: Oncrawl elasticsearch meetup france #12

Oncrawl: hadoop-elasticsearch

- Apache Nutch (v1.x) uses HDFS (v2.x supports several storages

through Apache Gora -- including elasticsearch -- but…)

- Stacked different custom hadoop jobs to compute

Oncrawl’s custom attributes (duplicates, …)

- What about Apache Nutch’s ESIndexer ?

- hadoop-elasticsearch does the job pretty well

- Relies on job’s configuration:

- es.resource(.read|.write)? : « index/type » (supports “late”

type routing from fields in collected output, e.g.

« my_index/{some_field} »)

22/01/15 Oncrawl · Elasticsearch Meetup France #12 16

Page 17: Oncrawl elasticsearch meetup france #12

Oncrawl: hadoop-elasticsearch

• Reading from elasticsearch– job.setInputFormat(EsInputFormat.class);

• Writing to elasticsearch– job.setOutputFormat(EsOutputFormat.class);

– Map<Object, Object> value = new

LinkedHashMap <Object, Object> ();

– collector.collect(key,

WritableUtils.toWritable(value));

22/01/15 Oncrawl · Elasticsearch Meetup France #12 17

Read \ Write HDFS Elasticsearch

HDFS builtin yes

Elasticsearch yes yes

Page 18: Oncrawl elasticsearch meetup france #12

Elasticsearch & Python

Page 19: Oncrawl elasticsearch meetup france #12

Oncrawl API

• Python / Flask :– Lightweight

– Easy to deploy / mirror

– Clean syntax

• elasticsearch python client:– simple API

– allows for fine tuning of the client (HTTP connection parameters, …)

• API’s mission : populate application’s report’s graphs

22/01/15 Oncrawl · Elasticsearch Meetup France #12 19

Page 20: Oncrawl elasticsearch meetup france #12

Oncrawl API- Each graph on the app has a dedicated API endpoint

- Binds graph semantics to an elasticsearch query. Returns json data ready for the rendering (d3.js, …)

- Example : Summary of page load times

22/01/15 Oncrawl · Elasticsearch Meetup France #12 20

- 4 buckets : - perfect (under 500ms)- medium (between 500ms and

1000ms)- slow (between 1000ms and

2000ms)- too slow (beyond 2000ms)

- Expected output by plotting library:

Page 21: Oncrawl elasticsearch meetup france #12

Oncrawl API- Queries are easy to compose using python

- Write & test it in Marvel

- Integrate in Flask API

22/01/15 Oncrawl · Elasticsearch Meetup France #12 21

Page 22: Oncrawl elasticsearch meetup france #12

Elastic: Scale it

May I have the salt, please ?

Page 23: Oncrawl elasticsearch meetup france #12

Oncrawl scalability constraints

- 1 index per crawl

- size of indices ? S-M-L-XL

- sharding policy:- S: 1 shard

- M: 3 shards

- L: 5 shards

- XL: 10 shards

- Hadoop cluster management- Provisioned for a given number

of concurrent crawl cycles

- HDFS grows with total clients

- Elasticsearch cluster management- Build: same provision as

hadoop cluster

- Storage / service:- provisionned for 3 months of

subscription

- Old indices:

- close & snapshot

- reopen on demand

22/01/15 Oncrawl · Elasticsearch Meetup France #12 23

Page 24: Oncrawl elasticsearch meetup france #12

Saltstack

• Cluster with members having roles: master vs minions

• Each minion can be fully administratedthrough the master

• Minions ask master for enrollment

• Administrator on master can either acceptor decline minions

• Once minion is accepted, can be fullyoperated remotely

22/01/15 Oncrawl · Elasticsearch Meetup France #12 24

Page 25: Oncrawl elasticsearch meetup france #12

Saltstack

• A set of « recipes » define what states are made of, and how to get there

• Recipes can use « jinja » templating so variable parts of configuration files can be rendered at deployment time

• Minions can have their role defined by several means:– grains defined on the minion– deployment specific rules, defined in « the pillar »

• Within Oncrawl, saltstack is used :– To maintain indices templates (config/templates/*json)– To maintain elasticsearch clusters, nodes and shards allocation

(config/settings.yml)– To deploy the elasticsearch cluster, the hadoop cluster, staging

and prod servers

• Deploy anything, anywhere (Droplets @ Digital Ocean, VMs @ Vultr, Instances @ AWS, dedicated servers @ OVH)

22/01/15 Oncrawl · Elasticsearch Meetup France #12 25

Page 26: Oncrawl elasticsearch meetup france #12

Thank you!

Follow us:@tuxnco (me)@cogniteev (company)@oncrawl (product)

Part of the gang

Any question ?