elasticsearch for data analytics

66
Elasticsearch for Data Analytics Felipe Almeida http://queirozf.com Introduction, examples and tips Rio de Janeiro Elastic Meetup November 2016

Upload: felipe-almeida

Post on 16-Apr-2017

316 views

Category:

Technology


2 download

TRANSCRIPT

Page 2: Elasticsearch for Data Analytics

Structure

● Introduction● Aggregations● Mappings● General tips

● Note: All Examples are based on Elasticsearch version 2.x

2

Page 3: Elasticsearch for Data Analytics

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene

3

Page 4: Elasticsearch for Data Analytics

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on

later versions only

4

Page 5: Elasticsearch for Data Analytics

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on

later versions only

● It’s generally used as an index for another database

5

Page 6: Elasticsearch for Data Analytics

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on

later versions only

● It’s generally used as an index for another database○ I.e. actual data is stored somewhere else; it’s only pointed to by

the index

6

Page 7: Elasticsearch for Data Analytics

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

7

Page 8: Elasticsearch for Data Analytics

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

● For example, Elasticsearch is a good place to store

8

Page 9: Elasticsearch for Data Analytics

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

● For example, Elasticsearch is a good place to store○ arbitrary key-value dictionaries (json objects)

9

Page 10: Elasticsearch for Data Analytics

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

● For example, Elasticsearch is a good place to store○ arbitrary key-value dictionaries (json objects)○ time series data

10

Page 11: Elasticsearch for Data Analytics

Introduction

● You can get very interesting information by running aggregations on such data

11

Page 12: Elasticsearch for Data Analytics

Aggregations

● The idea is that you obtain aggregate information about your data

12

Page 13: Elasticsearch for Data Analytics

Aggregations

● The idea is that you obtain aggregate information about your data

● Elasticsearch Aggregations are somewhat similar to GROUP BY clauses in regular SQL

13

Page 14: Elasticsearch for Data Analytics

Aggregations

● The idea is that you obtain aggregate information about your data

● Elasticsearch Aggregations are somewhat similar to GROUP BY clauses in regular SQL

14

SQL ELASTICSEARCH

select query

group by aggregations

rows JSON objects

Page 15: Elasticsearch for Data Analytics

Aggregations

● An Elasticsearch query is composed of at least two parts:

15

Page 16: Elasticsearch for Data Analytics

Aggregations

● An Elasticsearch query is composed of at least two parts:

{

"query":{

// matchers and filters

},

"aggregations":{

// aggregations

}

}16

query

aggregation

Page 17: Elasticsearch for Data Analytics

Aggregations

We’ll use the following sample database for the examples:

17

{ "name":"john", "city": "ny", "age": 40}

{ "name":"john", "city": "sf", "age": 45}

{ "name":"mary", "city": "ny", "age": 22}

{ "name":"pam", "city": "dc", "age": 41}

{ "name":"mary", "city": "london", "age": 20}

{ "name":"pete", "city": "ny", "age": 31}

Page 18: Elasticsearch for Data Analytics

Aggregations - terms

● One of the most useful aggregations is the terms aggregation

18

Page 19: Elasticsearch for Data Analytics

Aggregations - terms

● One of the most useful aggregations is the terms aggregation

● It tells you how many entries there are for each value a given attribute can take.

19

Page 20: Elasticsearch for Data Analytics

Aggregations - terms

● One of the most useful aggregations is the terms aggregation

● It tells you how many entries there are for each value a given attribute can take.

● For questions like:○ How many documents are there per city?

20

Page 21: Elasticsearch for Data Analytics

Aggregations - terms

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"per_city": {

"terms": {

"field": "city"

}

}

}

}

21

Page 22: Elasticsearch for Data Analytics

Aggregations - terms

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"per_city": {

"terms": {

"field": "city"

}

}

}

}

22

● Results:[

{

"key": "ny",

"doc_count": 3

}, {

"key": "dc",

"doc_count": 1

}, {

"key": "london",

"doc_count": 1

}, {

"key": "sf",

"doc_count": 1

}]

Page 23: Elasticsearch for Data Analytics

Aggregations - min, max, avg, sum

● These aggregations calculate statistics for numeric fields

23

Page 24: Elasticsearch for Data Analytics

Aggregations - min, max, avg, sum

● These aggregations calculate statistics for numeric fields

● For questions like:○ What is the maximum age over all query results?

24

Page 25: Elasticsearch for Data Analytics

Aggregations - min, max, avg, sum

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"max_age": {

"max": {

"field": "age"

}

}

}

}

25

Page 26: Elasticsearch for Data Analytics

Aggregations - min, max, avg, sum

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"max_age": {

"max": {

"field": "age"

}

}

}

}

26

● Result: {

"max_age": {

"value": 45

}

}

Page 27: Elasticsearch for Data Analytics

Aggregations - cardinality

● This aggregation calculates the number of distinct values for a given attribute.

27

Page 28: Elasticsearch for Data Analytics

Aggregations - cardinality

● This aggregation calculates the number of distinct values for a given attribute.

● For questions like:○ How many different cities are there in the database?

28

Page 29: Elasticsearch for Data Analytics

Aggregations - cardinality

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"cardinality": {

"field": "city"

}

}

}

}

29

Page 30: Elasticsearch for Data Analytics

Aggregations - cardinality

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"cardinality": {

"field": "city"

}

}

}

}

30

● Result: {

"cities": {

"value": 4

}

}

Page 31: Elasticsearch for Data Analytics

Aggregations - histogram

● This aggregation gives you information about the distribution of values for a given numeric attribute

31

Page 32: Elasticsearch for Data Analytics

Aggregations - histogram

● This aggregation gives you information about the distribution of values for a given numeric attribute

● For questions like:○ How are people’s ages distributed?

32

Page 33: Elasticsearch for Data Analytics

Aggregations - histogram

● This aggregation gives you information about the distribution of values for a given numeric attribute

● For questions like:○ How are people’s ages distributed? ○ How many are in their teens, how many are in the twenties,

and so on?

33

Page 34: Elasticsearch for Data Analytics

Aggregations - histogram

● Query:{ "query": { "match_all": {} }, "aggregations": { "distrib": { "histogram": { "field": "age", "interval": 10 } } }}

34

Page 35: Elasticsearch for Data Analytics

Aggregations - histogram

● Query:{ "query": { "match_all": {} }, "aggregations": { "distrib": { "histogram": { "field": "age", "interval": 10 } } }}

35

● Result: [ { "key": 20, "doc_count": 2 }, { "key": 30, "doc_count": 1 }, { "key": 40, "doc_count": 3 } ]

Page 36: Elasticsearch for Data Analytics

Aggregations - date_histogram

● This aggregation is similar to the previous one (histogram), but you can specify intervals and bounds using date macros, for date fields

36

Page 37: Elasticsearch for Data Analytics

Aggregations - date_histogram

● This aggregation is similar to the previous one (histogram), but you can specify intervals and bounds using date macros, for date fields

● This is one of the most useful aggregations if your data follow a time series

37

Page 38: Elasticsearch for Data Analytics

Nested aggregations

● Some aggregations allow extra aggregations to be performed on their results.

38

Page 39: Elasticsearch for Data Analytics

Nested aggregations

● Some aggregations allow extra aggregations to be performed on their results.

● For instance, you can perform a terms aggregation on the results of a histogram aggregation

39

Page 40: Elasticsearch for Data Analytics

Nested aggregations

● Some aggregations allow extra aggregations to be performed on their results.

● For instance, you can perform a terms aggregation on the results of a histogram aggregation

● For questions like:○ For each city, how many people are there in each age group?

40

Page 41: Elasticsearch for Data Analytics

Nested aggregations● Query:

{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"terms": {

"field": "city"

},

"aggregations": {

"distrib": {

"histogram": {

"field": "age",

"interval": 10

}

}

}

}

}

} 41

Page 42: Elasticsearch for Data Analytics

Nested aggregations● Query:

{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"terms": {

"field": "city"

},

"aggregations": {

"distrib": {

"histogram": {

"field": "age",

"interval": 10

}

}

}

}

}

} 42

● Result (partial) [ {

"key": "ny",

"doc_count": 3,

"distrib": {

"buckets": [{

"key": 20,

"doc_count": 1

}, {

"key": 30,

"doc_count": 1

}, {

"key": 40,

"doc_count": 1

}]

}

}, {

"key": "dc",

"doc_count": 1,

"distrib": {

"buckets": [

{

"key": 40,

"doc_count": 1

}

]

}

},

Page 43: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

43

Page 44: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:

44

Page 45: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings

45

Page 46: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings○ dates

46

Page 47: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings○ dates

● So you may want to control how your documents are indexed using appropriate mappings

47

Page 48: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

48

Page 49: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

● They can help you if:

49

Page 50: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

● They can help you if:○ You want to disable the analyzer for every new string field you

add in your documents

50

Page 51: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

● They can help you if:○ You want to disable the analyzer for every new string field you

add in your documents○ You want to index numeric attributes whose name end in

"timestamp" as dates

51

Page 52: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

● They can help you if:○ You want to disable the analyzer for every new string field you

add in your documents○ You want to index numeric attributes whose name end in

"timestamp" as dates○ You want to refuse to allow extra attributes to be indexed (i.e.

force a hard schema)

52

Page 53: Elasticsearch for Data Analytics

General tips

● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their

mapping to not_analyzed

53

Page 54: Elasticsearch for Data Analytics

General tips

● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their

mapping to not_analyzed

● Use filters whenever you can. Filters and aggregation results are aggressively cached by Elasticsearch

54

Page 55: Elasticsearch for Data Analytics

General tips

● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their

mapping to not_analyzed

● Use filters whenever you can. Filters and aggregation results are aggressively cached by Elasticsearch

● Any filters in the query area also affect the output of aggregations

55

Page 56: Elasticsearch for Data Analytics

General tips

● Always use bulk inserting rather than individual inserts to save bandwidth

56

Page 57: Elasticsearch for Data Analytics

General tips

● Always use bulk inserting rather than individual inserts to save bandwidth

● Use TTL (time to live) to expire documents that don’t need to be kept for long

57

Page 58: Elasticsearch for Data Analytics

General tips

● Always use bulk inserting rather than individual inserts to save bandwidth

● Use TTL (time to live) to expire documents that don’t need to be kept for long○ Changes in the TTL settings only affect new documents.

Documents that are already indexed are not affected.

58

Page 59: Elasticsearch for Data Analytics

General tips

● By default, ranges and histograms do not return buckets with zero documents. Use min_doc_count and, optionally, extended_bounds to include them.

59

Page 60: Elasticsearch for Data Analytics

General tips

● By default, ranges and histograms do not return buckets with zero documents. Use min_doc_count and, optionally, extended_bounds to include them.

● Dynamic templates can be created at the index level (for attributes that match some criteria) or at the cluster level (for indices that match some criteria)

60

Page 61: Elasticsearch for Data Analytics

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

61

Page 62: Elasticsearch for Data Analytics

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

● You can reference nested attributes when defining an aggregation

62

Page 63: Elasticsearch for Data Analytics

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

● You can reference nested attributes when defining an aggregation

● Scripts are useful but many Elasticsearch setups (e.g. AWS) do not support them due to security concerns

63

Page 64: Elasticsearch for Data Analytics

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

● You can reference nested attributes when defining an aggregation

● Scripts are useful but many Elasticsearch setups (e.g. AWS) do not support them due to security concerns

● Use "size":0 to suppress regular query results and return only aggregation results, thus saving processing and bandwidth

64

Page 65: Elasticsearch for Data Analytics

General tips

● The value_count aggregation does not count unique values for the attributes! The cardinality aggregation does that!

65

Page 66: Elasticsearch for Data Analytics

General tips

● The value_count aggregation does not count unique values for the attributes! The cardinality aggregation does that!

● By default, the terms aggregation does not return all possible values for the selected field. Tune the size attribute to control the result size (use 0 to bring all results)

66