box + solr = content search for business
TRANSCRIPT
1
June 2014
Box + Solr = Content Search for Business
3
to make organizations more productive,
competitive and collaborative by connecting
people and their most important information
Box mission
4
25MM+Users
225K+ Businesses
99%Fortune 500
5
Box search mission is to make user content
easy to discover.
6
10Billion+Documents
10TB+ Index size
100M+Daily requests
Box uses Solr for search
7
Quick Search
8
Quick Search
9
Full Search
10
Sharding – splitting the index
Agenda
Highly available search
A few more things
1
2
3
4
5 Q&A
Currently working on
11
We shard things
12
Shard ID = File ID % Total Shards
13
Multi-tenant – One big logical index for all users
Solr index
Shard1 Shard2 Shard3 ShardN
14
Search scope
15
File ID: 12345
OwnerID: user1
Parent Folders IDs: folder1, folder2
File Name: Solr.ppt
File Content: blah......
A typical Solr Document
16
Owner: User1Parent: Folder1
Owner: User2Parent: Folder3
Owner: User2Parent: Folder2
Owner: User1Parent:Folder1Folder4
File 1 File 2
File 3 File 4
17
User1 with no share folder
Owner: User1Parent: Folder1
Owner: User2Parent: Folder3
Owner: User2Parent: Folder2
Owner: User1Parent:Folder1Folder4
File 1 File 2
File 3 File 4
18
User2 shares Folder2 with User1
Owner: User1Parent: Folder1
Owner: User2Parent: Folder3
Owner: User2Parent: Folder2
Owner: User1Parent:Folder1Folder4
File 1 File 2
File 3 File 4
19
User2 shares Folder2 with User1
Owner: User1Parent: Folder1
Owner: User2Parent: Folder3
Owner: User2Parent: Folder2
Owner: User1Parent:Folder1Folder4
File 1 File 2
File 3 File 4
20
User2 shares Folder2 with User1
Owner: User1Parent: Folder1
Owner: User2Parent: Folder3
Owner: User2Parent: Folder5
Owner: User1Parent:Folder1Folder4
File 1 File 2
File 3 File 4
Removedout of Folder2
21
User2 shares Folder2 with User1
Owner: User1Parent: Folder1
Owner: User2Parent: Folder3
Owner: User2Parent: Folder5
Owner: User1Parent:Folder1Folder4
File 1 File 2
File 3 File 4
Removedout of Folder2
22
Highly Available Search
23
• Index is highly available
• Search functionality is highly available
24
Index workflow
25
Box Front End
UploadIndex Queue
Queue 1
Queue 2
Queue 3
Indexer 1
Indexer 3
Indexer 2
MySQL
Index1
Index2
Index2
26
Search workflow
27
Box Front End
query HA Proxy
Head node
HA Proxy
1 2 3 N
Box Front End
queryHA
ProxyHead node
HA Proxy
1 2 3 N
Data center boundary
28
A few more things
29
File Content Search
30
Box Front End
Upload
MySQL Box FileStorage
IndexerSolrIndex
Text ExtractionExtractedText
31
Multi-language support
32
Raw file content
Languagedetector
English tokenizer
Spanish tokenizer
Japanese tokenizer
German tokenizer
file_content_en
File_content_es{hola}
file_content_ja....
File_content_de
33
To Dos
• Scale language support
• Support document with mixed languages
34
Search Warm-up
35
• Front end informs backend to warm up on keyboard focus
• Backend prepares the search filter and caches it in a search session
• Backend sends a warm-up query to Solr
36
What we are working on
37
• Search suggestions
• Search operators
• Use machine learning to influence ranking
• Logical sharding
Things we are working on
38
Question?