solr distributed indexing in walmartlabs: presented by shengua wan, walmartlabs

26
OCTOBER 13-16, 2016 AUSTIN, TX

Upload: lucidworks

Post on 16-Apr-2017

877 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Page 2: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Shenghua Wan

Sr Software Engineer, @WalmartLabs [email protected]

Solr Distributed Indexing in WalmartLabs

Page 3: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Background

•  Search big data, part of Polaris Search Team in WalmartLabs •  Audience management, Axciom Inc. •  HPC computational scientist, UTSW Medical Center

3

Page 4: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Our perspective :

•  To help make Solr indexing more scalable •  From a big data engineer perspective •  Solr/Lucene internals are not covered in this talk

4

Page 5: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Problem definition •  Input

96 gzipped xml files

•  Output 3 shards of binary indexes, one for every 32 xml files •  Dedicated indexing servers not scalable •  Indexing time in dev environment at least 4 hours -> slow down development iteration

5

Page 6: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Existing “Wheels” for Solr Distributed Indexing •  “Indexing Files via Solr and Java MapReduce” (Adam

Smieszny since 2012)

•  LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)

•  MapReduceIndexerTool (Mark Miller since late 2013)

6

Page 7: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Existing “Wheels” for Solr Distributed Indexing

q “Indexing Files via Solr and Java MapReduce” (Adam Smieszny since 2012)

q LuenceIndexOutputFormat (Twitter’s Elephant-Bird since 2013)

ü MapReduceIndexerTool (Mark Miller since late 2013) This tool is closest to our use case.

7

Page 8: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Start from MapReduceIndexerTool Anatomy of this tool •  MorphlineMapper use Morphlines to convert document to SolrInputDocument •  SolrRecordWriter

create a embedded Solr instance to index the document •  TreeMergeRecordWriter

merge multiple binary indexes into one References: 1.  https://github.com/apache/lucene-solr/tree/trunk/solr/

contrib/map-reduce 2.  https://github.com/markrmiller/solr-map-reduce-example

8

Page 9: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Our Challenges •  Not using Solr Cloud •  Not using Zookeeper •  Solr version 4.0 (when we did experiments) •  Environment •  Hadoop version 1 •  MapR File System •  XML input format

•  Easy to maintain and debug •  Documentation A runnable example with source code is the best. Thanks to https://github.com/markrmiller/solr-map-reduce-example.

9

Page 10: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Customize Design to Our Use Case Breaking down to two fundamental utilities •  Index Generator

replace Morphlines with XmlInputFormat from Apache Mahout and reuse SolrOutputFormat

•  Index Merger reuse TreeMergeOutputFormat

References: 1.https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java 2.https://github.com/apache/lucene-solr/tree/trunk/solr/contrib/map-reduce

10

Page 11: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Customize Design to Our Use Case – cont. Breaking down to two fundamental utilities •  Index Generator •  Index Merger More complicated logic can be built on top of these two simple map-only jobs. Where is reduce? Our use case does not need it. We want it lean and fast. But you may need it.

11

Page 12: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Experiments and Observations

•  Index Generation ü CPU-bound ü  can easily scale and be parallel ü Map-only wins 12~15% over Map-Reduce in our

experiments ü ~5GB decompressed Xml document indexed within 10

minutes using 7x3 mappers

12

Page 13: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Experiments and Observations – cont.

•  Index Merging ü  IO-bound Disk and Network. But network was our pain ü  Two stages: logical merge and optimize o  Logical merge: file movement o  Optimize: reduce number of index segments

13

Page 14: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1

Nothing suspicious

14

Page 15: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1

Go sharp suddenly? •  Too many shards •  Resource

contention

15

Page 16: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Experiments and Observations – cont. n-Way Merge: merging n roughly same size shards into 1

Optimize time >> Logical merge time 5x ~ 8x (though 64-way is an exception, considered to be outlier because of shared environment)

16

Page 17: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

New Challenges

After contacting cluster owner team, we were told the connection of that cluster consist of almost five dozen nodes is 1Gb/s Ethernet.

17

Page 18: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Experiments and Observations – cont. How about “tree” structure merge?

Seems to be attractive

18

Page 19: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Experiments and Observations – cont. Comparing hierarchical merge and n-way merge total time

Kind of unexpected

19

Page 20: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Experiments and Observations Comparing hierarchical merge and n-way merge total time

Relatively isolated environment: no network, but disk IO (4 cores x 2 threads)

4 small reads + 2 large reads

4 small reads

20

Page 21: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Lessons Learnt

•  Index generation in parallel is easy

•  Merging is not

•  N-way merging all shards is better

•  Data locality is key

21

Page 22: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Our Solutions

•  Plan A “Hey, Sir/Madam, could you please get us 48Gb/s InfiniBand network ASAP? Or 10Gb/s is also fine.” •  Plan B A small dedicated indexing Hadoop cluster (starting from one node)

22

Page 23: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Our Solutions A small dedicated indexing Hadoop cluster (starting from one node)

environment! Disk IO (MB/s)!shared! ~44!

Mac Pro (SSD)! ~250!Dedicated! ~202!

Dedicated cluster: •  1 node •  32 cores •  128GB mem

23

Page 24: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Tips

Tunable Parameter •  Split Size (Map-Reduce) •  Batch Size (Solr Index) •  RAM Buffer Size (Solr Index) •  Max number of Segments (Solr Index)

24

Page 25: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Opportunities

There are some parts missing in our tool which are allowed by our use case but you may want to have them: 1.  Reduce functions (deduplication, other processing logic) 2.  Try Spark or equivalent (bottleneck is embedded Solr

instance when merging)

25

Page 26: Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs

Thanks! We are hiring!

Questions? 26