sphinx tech overview - archive.fosdem.org...meet sphinx •created in early 200x as an alternative...

26
Sphinx search technical overview Vladimir Fedorkov Open Source Search Devroom FOSDEM’15

Upload: others

Post on 02-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Sphinx search technical overview

Vladimir Fedorkov

Open Source Search Devroom

FOSDEM’15

Page 2: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

About me

• Performance geek

– blog http://astellar.com

– Twitter @vfedorkov

• Enjoy LAMP stack tuning

– Especially database backend

• Love to speak on the conferences

• Use Sphinx in production from 2006

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 3: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Meet Sphinx

• Created in early 200x as an alternative to MySQL full-text search

• Written on C++

• Working as separate daemon

• Running on various platforms *nix, win*, etc

– Seen on iPhones and WiFi routers

• Now serving installations with billions or documents.

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 4: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Architecture sample: querying

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 5: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Agenda

• Loading data

• Current storage types

• Querying Sphinx

• Full text vs non-full-text

• Getting results

• Life after the search

• Grow Sphinx from node to cluster

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 6: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Loading data into Sphinx

• Sphinx is talking to databases to pull data

– MySQL, PostgreSQL, MSSQL and any ODBC source

• Loading structured data in XML format

– Useful to load data from NoSQL storages

• like Mongo, etc

– Can be used for document pre-processing

• SQL-style updates

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 7: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Storage types

• Real-time indexes – Push mode

• Application pushes data to Sphinx

– Ideal for frequently updated data

• On-disk (plain) indexes – Data pull mode

• Sphinx handling indexing on itself

– Ideal for static data

• Or else:

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 8: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

On disk vs Real-time indexes

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 9: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Querying

• SphinxQL: – Uses MySQL client lib to connect to sphinx – Available in most programming languages

• Legacy API – PHP, Python, Java, Ruby, C is included in distro – .NET, Rails (via Thinking Sphinx) via third party libs

mysql> SELECT * FROM sphinx_index

-> WHERE MATCH('I love Sphinx')

-> AND news_channel = 285

-> LIMIT 5;

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 10: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

How does it work?

• Query pre processing

• Full-text search stage

• Non-full text filtering

• Ranking / Grouping / Ordering

• Applying limit

• Sending results back

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 11: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Query & text pre-processing

• Removing stop words

• Transforming text

– Applying morphology, blended chars, filters, replacements

• Prefix/infix indexing

• Other “magic”

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 12: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Full-Text support

• And, Or – hello | world, hello & world

• Not – hello -world

• Per-field search – @title hello @body world

• Field combination – @(title, body) hello world

• Search within first N – @body[50] hello

• Phrase search – “hello world”

• Per-field weights

• Proximity search – “hello world”~10

• Distance support – hello NEAR/10 world

• Quorum matching – "the world is a wonderful

place"/3

• Exact form modifier – “raining =cats and =dogs”

• Strict order • Sentence / Zone / Paragraph • Custom documents weighting

& ranking, etc

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 13: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Non text filters

• in SphinxQL terms, WHERE conditions

– a = 5, a < 5, a > 5, a BETWEEN 3 AND 5

• Integers, floating point, strings are supported

• JSON

– SELECT ALL(x>3 AND x<7 FOR x IN j.intarray)

– SELECT j.users[3].address[2].streetname

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 14: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Special integers: MVAs

• Built in “one–to–many” attributes

• Set of integers in a single value

• Useful for

– Page tag IDs

– Multi category items

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 15: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

GEO-Distance support

• Bumping up and/or filtering local results

– Just add float latitude, longitude attributes, and..

• GEODIST(Lat, Long, Lat2, Long2) in Sphinx

• Has syntax for mi/km/m, deg/rad etc

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 16: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Relevance tuning

• Weighting

– Per field

– Per index

• Expression based ranking

– 15+ of text signals, N of yours non-text

• OPTION ranker=expr(‘1000*sum(lcs)+bm25’)

• OPTION ranker=expr(‘700*sum(lcs)+bm25f(1.4, 0.8, {title=3, content=1}’)

– Several built-in rankers available

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 17: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Reading results

mysql> SELECT * FROM idx

-> WHERE MATCH('I love Sphinx') LIMIT 5

-> OPTION field_weights=(title=100, content=1);

+---------+--------+------------+------------+

| id | weight | channel_id | ts |

+---------+--------+------------+------------+

| 7637682 | 101652 | 358842 | 1112905663 |

| 6598265 | 101612 | 454928 | 1102858275 |

| 6941386 | 101612 | 424983 | 1076253605 |

| 6913297 | 101584 | 419235 | 1087685912 |

| 7139957 | 1667 | 403287 | 1078242789 |

+---------+--------+------------+------------+

5 rows in set (0.00 sec)

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 18: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Life after search

• CALL SNIPPETS, making excerpts

• Building facets (Brands, price ranges)

• Showing related items

• Performing misspells corrections

• “Did you mean” service

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 19: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Combining indexes

• On the single box

– Main + Delta

– Main + Delta + RT

• On the cluster

– Local and distributed

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 20: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Distributed search

• Yet static nodes configuration

• Weighted round-robin querying

• Load-based distribution

• Failover node

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 21: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Sphinx search cluster architecture

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 22: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Sphinx cluster data flow

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 23: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

News from the Lab

• New index format in Sphinx 3.0

– Faster indexing and search

• No legacy 4/16Gb attribute limits per index

• Data replication between nodes

• HTTP/REST interface

• Even faster snippets

• Some secret projects I can’t talk about

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 24: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

Find more about Sphinx

• Official website: http://sphinxsearch.com

• My blog http://astellar.com

– Some information you may find useful

– Slides will be there

• Twitter: @vfedorkov

– Mainly Sphinx and MySQL performance

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 25: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

QUESTIONS!

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM

Page 26: Sphinx tech overview - archive.fosdem.org...Meet Sphinx •Created in early 200x as an alternative to MySQL full-text search •Written on C++ •Working as separate daemon •Running

THANK YOU!

Open Source Search Devroom, FOSDEM’15 ASTELLAR.COM