using apache spark for generating elasticsearch indices offline · 2017-12-14 · elasticsearch...

23
Using Apache Spark for generating ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe 2016

Upload: others

Post on 04-Jul-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Using Apache Spark for generating ElasticSearch indices offline

Andrej BabolčaiESET Database systems engineer

Apache: Big Data Europe 2016

Page 2: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Who am I

• Software engineer in database systems team

• Responsible for collecting, moving and providing access to data

Page 3: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Context

Apache Kafka

Apache Hive/Impala

ElasticSearch

Page 4: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Agenda

• Approaches we tried and why they failed

• Solution used, Spark + ES

• Benchmark, summary and possible improvements

Page 5: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Agenda

• Approaches we tried and why they failed

• Solution used, Spark + ES

• Benchmark, summary and possible improvements

Page 6: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Indexing data to live cluster

• Failed because of

• Slowed search and near real-time (NRT) import

• Reduce ingestion speed - too slow

Page 7: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Spark job with Lucene library

• Approach

• Generate indices with Lucene and “import” them to ES

• Indexing with Lucene is fast, hundreds of GB/hour

• Failed because of

• ES types in Lucene

• ES translog and checksum

Page 8: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Agenda

• Approaches tried and why they failed

• Solution used, Spark + ES

• Benchmark, summary and possible improvements

Page 9: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Goal

• Offload ES cluster and generate indices on Spark cluster

• We want indices to be “ready to use”

• When appropriate copy them to ES

Page 10: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Spark + local ES

• Based on https://github.com/MyPureCloud/elasticsearch-

lambda

• Similar approach to Cloudera Solr MapReduceIndexerTool

Page 11: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

How do we generate indices offline

S

p

a

r

k

0 partition

1 partition

n partition

Start ESn shards/1 with data

Start ES

Start ES

HDFS repository

Index/0

Index/1

Index/n

InputData

1 partition 1. shard (data)

0. shard (empty)

n. shard (empty)

SnapshotConstant shard routing

0. sh. snapshot(empty)

1. sh. snapshot

n. sh. snapshot(empty)

Page 12: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

HDFS snapshot repository layout

Dest dir

indices

idxname-2015

idxname-2016

0

__r

__z

1meta-idxname-

2016.dat

snap-idxname-2016.dat

Page 13: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Creating local ES node

val nodeSettings: Settings = Settings.builder.put("http.enabled", false).put("processors", 1).put("index.merge.scheduler.max_thread_count", 1)

….build

val node: Node = nodeBuilder().settings(nodeSettings).local(true).node()

node.startval client: Client = node.clientclient.admin.indices. … .setSource(mapping).get

HTTP unnecessary, use transport interface

Only JVM local node discovery

Same json mapping as Index API (http)

Page 14: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

RDD export like saveAsTextFile

rddToIndex.repartition(config.numShards).saveToESSnapshot(config,…

Page 15: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

We use implicit conversions

package object spark {

implicit class

DBSysSparkRDDFunctions

[T <: Map[String,Object]] //Row

(val rdd: RDD[T]) extends AnyVal {

def saveToESSnapshot(config:String,…):Unit = {

rdd.sparkContext.runJob(rdd,esWriter.processPartition _)

Input RDD type bound

Indexing method

Infiltrate spark namespace

Page 16: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Useful ES commands

• Create HDFS snapshot repository:

curl -s -XPUT 'localhost:9200/_snapshot/<Repo name>' -d '{

"type": "hdfs",

"settings": {

"uri": "hdfs://namenode:8020/",

"path": "/user/<username>/<Snapshot repo hdfs path>",

"load_defaults": "false"

}

}’

Page 17: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Useful ES commands

• Start restore process:

curl -XPOST 'localhost:9200/_snapshot/<Repo name>/<snapshot name>/_restore’

• Monitor restore progress:curl -s –XGET 'localhost:9200/_cat/recovery?v' |

awk '{print $1 " " $11}' |

fgrep -v " 0.0%" |

fgrep -v "100.0%"

Page 18: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Agenda

• Approaches tried and why they failed

• Solution used, Spark + ES

• Benchmark, summary and possible improvements

Page 19: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

ES cluster configurationProperty Value

Number of nodes 24

ES heap size 29GB

CPUs 8 (/proc/cpuinfo)

HDD 2x3.5TB / node

No. of indices 130

No. of shards ~3900

Data size 16 TB

No. of docs > 60 billion

Indexing speed (what we can handle…) ~1000 docs/s

Page 20: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Offline indexing environmentProperty Value

Input size 135GB compr. parquet

Number of docs 470M

CPUs (indexing) 15 spark workers

Memory 4GB /worker

Output index layout 20 string fields, 15 part.

Job duration ~3.5h

Restore duration ~20m

Duration total ~4h

Indexing speed >30k docs/s

Page 21: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Future work

• Shard routing

• Indexing on local FS, use directly HDFS

• Speed up indexing

• Use for stream indexing

Page 22: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Summary

• Hard to directly compare RT with our offline approach

• What we wanted was to make historical data available for

users, without influencing production systems

https://github.com/andybab/OfflineESIndexGenerator

Page 23: Using Apache Spark for generating ElasticSearch indices offline · 2017-12-14 · ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe

Thank youQuestions?

[email protected]