sfbay area solr meetup - june 18th: box + solr = content search for business

39
1 June 2014 Box + Solr = Content Search for Business

Upload: lucidworks-archived

Post on 11-May-2015

835 views

Category:

Technology


2 download

DESCRIPTION

"Box + Solr = Content Search for Business" - Wei Zhao, Box

TRANSCRIPT

Page 1: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

1

June 2014

Box + Solr = Content Search for Business

Page 2: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

2

Wei Zhao

Box backend [email protected]

Page 3: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

3

to make organizations more productive, competitive and collaborative by connecting people and their most important information

Box mission

Page 4: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

4

25MM+ Users

225K+ Businesses

99% Fortune 500

Page 5: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

5

Box search mission is to make user content easy to discover.

Page 6: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

6

10Billion+ Documents

10TB+ Index size

100M+Daily requests

Box uses Solr for search

Page 7: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

7

Quick Search

Page 8: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

8

Quick Search

Page 9: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

9

Full Search

Page 10: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

10

Sharding – splitting the index

Agenda

Highly available search

A few more things

1

2

3

4

5 Q&A

Currently working on

Page 11: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

11

We shard things

Page 12: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

12

Shard ID = File ID % Total Shards

Page 13: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

13

Multi-tenant – One big logical index for all users

Solr index

Shard1 Shard2 Shard3 ShardN

Page 14: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

14

Search scope

Page 15: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

15

File ID: 12345

OwnerID: user1

Parent Folders IDs: folder1, folder2

File Name: Solr.ppt

File Content: blah......

A typical Solr Document

Page 16: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

16

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Page 17: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

17

User1 with no share folder

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1Parent:Folder1Folder4

Filter: User1

File 1 File 2

File 3 File 4

Page 18: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

18

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Page 19: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

19

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1Parent:Folder1Folder4

Filter: User1 + Folder2

File 1 File 2

File 3 File 4

Page 20: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

20

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder5

Owner: User1Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Removed out of Folder2

Page 21: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

21

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder5

Owner: User1Parent:Folder1Folder4

Filter: User1 + Folder2

File 1 File 2

File 3 File 4

Removed out of Folder2

Page 22: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

22

Highly Available Search

Page 23: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

23

• Index is highly available

• Search functionality is highly available

Page 24: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

24

Index workflow

Page 25: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

25

Box Front End

UploadIndex Queue

Queue 1

Queue 2

Queue 3

Indexer 1

Indexer 3

Indexer 2

MySQL

Index1

Index2

Index2

Page 26: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

26

Search workflow

Page 27: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

27

Box Front End

query HA Proxy Head

nodeHA Proxy

1 2 3 N

Box Front End

query HA Proxy Head

nodeHA Proxy

1 2 3 N

Data center boundary

Page 28: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

28

A few more things

Page 29: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

29

File Content Search

Page 30: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

30

Box Front End

Upload

MySQL Box FileStorage

IndexerSolr Index

Text Extraction ExtractedText

Page 31: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

31

Multi-language support

Page 32: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

32

Raw file content

Language detector

English tokenizer

Spanish tokenizer

Japanese tokenizer

German tokenizer

file_content_en

File_content_es{hola}

file_content_ja....

File_content_de

Page 33: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

33

To Dos

• Scale language support

• Support document with mixed languages

Page 34: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

34

Search Warm-up

Page 35: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

35

• Front end informs backend to warm up on keyboard focus

• Backend prepares the search filter and caches it in a search session

• Backend sends a warm-up query to Solr

Page 36: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

36

What we are working on

Page 37: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

37

• Search suggestions

• Search operators

• Use machine learning to influence ranking

• Logical sharding

Things we are working on

Page 38: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

38

Question?

Page 39: SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

39

Contact: [email protected]

We are hiring!