solr search engine: optimize is (not) bad for you
TRANSCRIPT
Agenda
• Segments – where, what & how
• Writing segments
• Modifying segments
• Segment merging – what, where, how, why
• Force merging
• Force merging & SolrCloud
• Performance considerations
• Specialized merge policies
https://github.com/sematext/lr/tree/master/2017/optimize
6
01
Solr Collection Architecture
Zookeeper
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
9
01
Lucene Segment
Segment Info
Field Names
Stored Field Values
Point Values
Term Dictionary
Term Frequency
Term Proximity
Normalization
Per Document Vals
Live Documents
10
01
Inside the Segment – Term Dictionary
TERM DOCID
lucene <1>, <2>
revolution <1>, <2>
washington <1>
boston <2>
_1.tim
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
_1.tip
11
01
Inside the Segment – Doc Values
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
DOCID FIELD VALUE
1 Title Lucene Revolution Washington
1 City Washington D.C.
2 Title Lucene Revolution Boston
2 City Boston
_1.dvd
_1.dvm
12
01
Inside the Segment – Stored Fields
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
DOCID VALUE
1 Title: Lucene Revolution Washington
City: Washington D.C
2 Title: Lucene Revolution Boston
City: Boston
_1.fdx
_1.fdt
13
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
14
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
15
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
_2.cfs
_2.cfe
29
01
Atomic Updates$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ {
"id" : "3", "tags" : { "add" : [ "solr" ]
} }
]'
retrieve document{"id" : 3,"tags" : [ "lucene" ],"awesome" : true
}
30
01
Atomic Updates$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ {
"id" : "3", "tags" : { "add" : [ "solr" ]
} }
]'
{"id" : 3,"tags" : [ "lucene", "solr" ],"awesome" : true
}
apply changes
31
01
Atomic Updates$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ {
"id" : "3", "tags" : { "add" : [ "solr" ]
} }
]'
{"id" : 3,"tags" : [ "lucene", "solr" ],"awesome" : true
}
delete old document
32
01
Atomic Updates$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ {
"id" : "3", "tags" : { "add" : [ "solr" ]
} }
]'
{"id" : 3,"tags" : [ "lucene", "solr" ],"awesome" : true
}
33
01
Atomic Updates – In Place
Works on top of numeric, doc values based fields
Fields need to be not indexed and not stored
Doesn’t require delete/index
Support only inc and set modifers
$ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{ "id" : "3", "views" : {
"inc" : 100}
} ]'
34
01
Atomic Updates – In Place$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ {
"id" : "3", "views" : { "inc" : 100
} }
]'
retrieve document{"id" : 3,"tags" : [ "lucene", "solr" ],"awesome" : true
}
35
01
Atomic Updates – In Place$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ {
"id" : "3", "views" : { "inc" : 100
} }
]'
{"id" : 3,"tags" : [ "lucene", "solr" ],"awesome" : true,"views" : 100
}
apply changes
36
01
Atomic Updates – In Place$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ {
"id" : "3", "views" : { "inc" : 100
} }
]'
{"id" : 3,"tags" : [ "lucene", "solr" ],"awesome" : true,"views" : 100
}
update doc values
38
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
39
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
40
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
Fewer segments – smaller shard size
41
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
Fewer segments – smaller shard size
Rapid segment changes – worse I/O cache usage
42
01
Taking Control Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
43
01
Taking Control Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
44
01
Taking Control Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
Segment Warmer
<mergedSegmentWarmerclass="org.apache.lucene.index.SimpleMergedSegmentWarmer" />
45
01
Taking Control – Default Indexing ThroughputMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
47
01
Taking Control – Max Merged Segment SizeMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower higher indexing throughput – smaller segments
Higher better search latency (depends) – more merges
48
01
Taking Control – Lowering Max Merged SizeMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">512</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
49
01
Taking Control – Lowering Max Segment Size
throughput < 5k/sec @ ~15.5GB
11% throughput increase
50
01
Taking Control – Merge At OnceMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower better search latency (depends)
Higher higher indexing throughput
51
01
Taking Control – Lowering Merge At OnceMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">2</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
53
01
Taking Control – Merge At Once ExplicitMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls number of segments merged at once during force merge
54
01
Taking Control – Segments Per TierMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int><int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower value means more merging, but less segments
Along with maxMergeAtOnce can smoothen I/O spikes
For better indexing throughput set maxMergeAtOnce < segmentsPerTier
55
01
Taking Control – Combined Together
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">30</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">30</int><int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">512</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
59
01
Taking Control – Reclaim Deletes WeightMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls importance of merging segments with deleted documents
Increase to put priority on merging segments with deleted documents
60
01
Taking Control – No CFS RatioMerge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int><int name="maxMergeAtOnceExplicit">30</int><int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int><int name="maxMergedSegmentMB">5120</int><double name="noCFSRatio">0.1</double><int name="maxCFSSegmentSizeMB">2048</int><double name="reclaimDeletesWeight">2.0</double><double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls compound file system segments ratio
To completely disable CFS set to 0.0
61
01
Taking Control – Merge Scheduler
Controls maximum number of concurrent merges
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"><int name="maxMergeCount">4</int><int name="maxThreadCount">4</int>
</mergeScheduler>
62
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"><int name="maxMergeCount">4</int><int name="maxThreadCount">4</int>
</mergeScheduler>
63
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
For spinning drives set maxThreadCount to 1
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"><int name="maxMergeCount">4</int><int name="maxThreadCount">4</int>
</mergeScheduler>
64
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
For spinning drives set maxThreadCount to 1
For SSD set maxThreadCount to min(4, #CPUs / 2)
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"><int name="maxMergeCount">4</int><int name="maxThreadCount">4</int>
</mergeScheduler>
66
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
67
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
68
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
Can be very bad or very good – depending on the use case
69
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
Can be very bad or very good – depending on the use case
$ curl 'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'
72
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
Shrinks the index by pruning duplicated data
73
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
Shrinks the index by pruning duplicated data
Reduces number of used files
75
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
76
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
77
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
May cause performance issues
78
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
May cause performance issues
Will cause temporary increase of disk usage (up to 3x)
81
01
Force Merge – Legacy
Index on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
index
Documents
82
01
Force Merge – Legacy
Index on the master server
Force merge on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
force merge
83
01
Force Merge – Legacy
Index on the master server
Force merge on the master server
Replicate after optimize is done
Solr Master
Solr Slave
Solr Slave
Solr Slave
pull after optimize
84
01
Force Merge – SolrCloud (Solr 7 – pull replicas)
Create collection
Force merge
Solr will do the rest
Solr Solr
Solr Solr
Primary 1
Primary 2 Pull Replica 2
Pull Replica 1
85
01
Force Merge – SolrCloud (NRT, pre 7.0)
Ask yourself if you really need force merge
Solr Solr
Solr Solr
86
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Solr Solr
Solr Solr
Primary 1
Primary 2
87
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Solr Solr
Solr Solr
Primary 1
Primary 2
DocumentsDocumentsDocuments
Documents
88
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Force mergeSolr Solr
Solr Solr
Primary 1
Primary 2o
ptim
ize
89
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Force merge
Create replicasSolr Solr
Solr Solr
Primary 1
Primary 2 Replica 2
Replica 1
90
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"><str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str><int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double>
</mergePolicyFactory>
91
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"><str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str><int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double>
</mergePolicyFactory>
Pre-sorts data during merge for:- faster range queries- faster data retrieval- possibility of early query termination- convenient for time based data
92
01
http://sematext.com/jobs
You love like we do?
You want to work with ?
Want to work with open source?
You want to do fun stuff?
93
01
Get in touchRafał
@kucrafal
http://sematext.com
@sematext http://sematext.com/jobs
Come talk to usat the booth