elasticsearch for data analytics

Elasticsearch forData Analytics

Felipe Almeida http://queirozf.com

Introduction, examples and tips

Rio de Janeiro Elastic MeetupNovember 2016

http://queirozf.com

http://queirozf.com

http://www.meetup.com/Rio-de-Janeiro-Elastic-Fantastics/events/234498642/




Structure

● Introduction● Aggregations● Mappings● General tips

● Note: All Examples are based on Elasticsearch version 2.x

2

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene

3

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on

later versions only

4

Introduction


later versions only

● It’s generally used as an index for another database

5

Introduction


later versions only

● It’s generally used as an index for another database○ I.e. actual data is stored somewhere else; it’s only pointed to by

the index

6

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

7

Introduction


● For example, Elasticsearch is a good place to store

8

Introduction


● For example, Elasticsearch is a good place to store○ arbitrary key-value dictionaries (json objects)

9

Introduction


● For example, Elasticsearch is a good place to store○ arbitrary key-value dictionaries (json objects)○ time series data

10

Introduction

● You can get very interesting information by running aggregations on such data

11

Aggregations

● The idea is that you obtain aggregate information about your data

12

Aggregations


● Elasticsearch Aggregations are somewhat similar to GROUP BY clauses in regular SQL

13

Aggregations


● Elasticsearch Aggregations are somewhat similar to GROUP BY clauses in regular SQL

14

SQL ELASTICSEARCH

select query

group by aggregations

rows JSON objects

Aggregations

● An Elasticsearch query is composed of at least two parts:

15

Aggregations

● An Elasticsearch query is composed of at least two parts:

{

"query":{

// matchers and filters

},

"aggregations":{

// aggregations

}

}16

query

aggregation

Aggregations

We’ll use the following sample database for the examples:

17

{ "name":"john", "city": "ny", "age": 40}

{ "name":"john", "city": "sf", "age": 45}

{ "name":"mary", "city": "ny", "age": 22}

{ "name":"pam", "city": "dc", "age": 41}

{ "name":"mary", "city": "london", "age": 20}

{ "name":"pete", "city": "ny", "age": 31}

Aggregations - terms

● One of the most useful aggregations is the terms aggregation

18



● It tells you how many entries there are for each value a given attribute can take.

19



● It tells you how many entries there are for each value a given attribute can take.

● For questions like:○ How many documents are there per city?

20


● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"per_city": {

"terms": {

"field": "city"

}

}

}

}

21


● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"per_city": {

"terms": {

"field": "city"

}

}

}

}

22

● Results:[

{

"key": "ny",

"doc_count": 3

}, {

"key": "dc",

"doc_count": 1

}, {

"key": "london",

"doc_count": 1

}, {

"key": "sf",

"doc_count": 1

}]

Aggregations - min, max, avg, sum

● These aggregations calculate statistics for numeric fields

23


● These aggregations calculate statistics for numeric fields

● For questions like:○ What is the maximum age over all query results?

24


● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"max_age": {

"max": {

"field": "age"

}

}

}

}

25


● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"max_age": {

"max": {

"field": "age"

}

}

}

}

26

● Result: {

"max_age": {

"value": 45

}

}

Aggregations - cardinality

● This aggregation calculates the number of distinct values for a given attribute.

27


● This aggregation calculates the number of distinct values for a given attribute.

● For questions like:○ How many different cities are there in the database?

28


● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"cardinality": {

"field": "city"

}

}

}

}

29


● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"cardinality": {

"field": "city"

}

}

}

}

30

● Result: {

"cities": {

"value": 4

}

}

Aggregations - histogram

● This aggregation gives you information about the distribution of values for a given numeric attribute

31



● For questions like:○ How are people’s ages distributed?

32



● For questions like:○ How are people’s ages distributed? ○ How many are in their teens, how many are in the twenties,

and so on?

33


● Query:{ "query": { "match_all": {} }, "aggregations": { "distrib": { "histogram": { "field": "age", "interval": 10 } } }}

34


● Query:{ "query": { "match_all": {} }, "aggregations": { "distrib": { "histogram": { "field": "age", "interval": 10 } } }}

35

● Result: [ { "key": 20, "doc_count": 2 }, { "key": 30, "doc_count": 1 }, { "key": 40, "doc_count": 3 } ]

Aggregations - date_histogram

● This aggregation is similar to the previous one (histogram), but you can specify intervals and bounds using date macros, for date fields

36

Aggregations - date_histogram

● This aggregation is similar to the previous one (histogram), but you can specify intervals and bounds using date macros, for date fields

● This is one of the most useful aggregations if your data follow a time series

37

Nested aggregations

● Some aggregations allow extra aggregations to be performed on their results.

38

Nested aggregations


● For instance, you can perform a terms aggregation on the results of a histogram aggregation

39

Nested aggregations


● For instance, you can perform a terms aggregation on the results of a histogram aggregation

● For questions like:○ For each city, how many people are there in each age group?

40

Nested aggregations● Query:

{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"terms": {

"field": "city"

},

"aggregations": {

"distrib": {

"histogram": {

"field": "age",

"interval": 10

}

}

}

}

}

} 41

Nested aggregations● Query:

{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"terms": {

"field": "city"

},

"aggregations": {

"distrib": {

"histogram": {

"field": "age",

"interval": 10

}

}

}

}

}

} 42

● Result (partial) [ {

"key": "ny",

"doc_count": 3,

"distrib": {

"buckets": [{

"key": 20,

"doc_count": 1

}, {

"key": 30,

"doc_count": 1

}, {

"key": 40,

"doc_count": 1

}]

}

}, {

"key": "dc",

"doc_count": 1,

"distrib": {

"buckets": [

{

"key": 40,

"doc_count": 1

}

]

}

},

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

43

Mappings


● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:

44

Mappings


● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings

45

Mappings


● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings○ dates

46

Mappings


● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings○ dates

● So you may want to control how your documents are indexed using appropriate mappings

47

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

48

Dynamic Templates


● They can help you if:

49

Dynamic Templates


● They can help you if:○ You want to disable the analyzer for every new string field you

add in your documents

50

Dynamic Templates



add in your documents○ You want to index numeric attributes whose name end in

"timestamp" as dates

51

Dynamic Templates



add in your documents○ You want to index numeric attributes whose name end in

"timestamp" as dates○ You want to refuse to allow extra attributes to be indexed (i.e.

force a hard schema)

52

General tips

● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their

mapping to not_analyzed

53

General tips



● Use filters whenever you can. Filters and aggregation results are aggressively cached by Elasticsearch

54

General tips



● Use filters whenever you can. Filters and aggregation results are aggressively cached by Elasticsearch

● Any filters in the query area also affect the output of aggregations

55

General tips

● Always use bulk inserting rather than individual inserts to save bandwidth

56

General tips


● Use TTL (time to live) to expire documents that don’t need to be kept for long

57

General tips


● Use TTL (time to live) to expire documents that don’t need to be kept for long○ Changes in the TTL settings only affect new documents.

Documents that are already indexed are not affected.

58

General tips

● By default, ranges and histograms do not return buckets with zero documents. Use min_doc_count and, optionally, extended_bounds to include them.

59

General tips

● By default, ranges and histograms do not return buckets with zero documents. Use min_doc_count and, optionally, extended_bounds to include them.

● Dynamic templates can be created at the index level (for attributes that match some criteria) or at the cluster level (for indices that match some criteria)

60

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

61

General tips


● You can reference nested attributes when defining an aggregation

62

General tips



● Scripts are useful but many Elasticsearch setups (e.g. AWS) do not support them due to security concerns

63

General tips



● Scripts are useful but many Elasticsearch setups (e.g. AWS) do not support them due to security concerns

● Use "size":0 to suppress regular query results and return only aggregation results, thus saving processing and bandwidth

64

General tips

● The value_count aggregation does not count unique values for the attributes! The cardinality aggregation does that!

65

General tips

● The value_count aggregation does not count unique values for the attributes! The cardinality aggregation does that!

● By default, the terms aggregation does not return all possible values for the selected field. Tune the size attribute to control the result size (use 0 to bring all results)

66

elasticsearch for data analytics

Technology