search intelligence @elo7.com

29
Search Intelligence @ elo7.com Fernando Meyer, Felipe Besson March 9, 2013

Upload: fernando-meyer

Post on 05-Apr-2017

35 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: Search Intelligence @elo7.com

Search Intelligence @ elo7.com

Fernando Meyer, Felipe Besson

March 9, 2013

Page 2: Search Intelligence @elo7.com

OutlineSome data about our data

Some history

Apache SolrHow Lucene WorksExamplesTermsInverted index

How a result is scored against a query in LuceneLucene conceptual Scoring formula [?]

Page 3: Search Intelligence @elo7.com

Search Intelligence

How have we optimized our index

How to declare a solr index

Infrastructure Upgradeversion 2 - single node

version 3 - current infrastructure

Frenzy API

Example of product operation

Content recommendation

Architecture

http://elo7.com 2013 3/29

Page 4: Search Intelligence @elo7.com

Search Intelligence

Current Scenario

Future WorkContent Tracker

BigData Analytics

http://elo7.com 2013 4/29

Page 5: Search Intelligence @elo7.com

Search Intelligence

About

Fernando Meyer - Undergrad in Applied Mathematics for University of São Paulo.Holds more than 12 years of experience in R&D deploying cool systems forcompanies like RedHat(JBoss), Globo and Locaweb. Currently is focusing hisresearch and interests in machine learning, information retrieve and statistics.

Felipe Besson - B.S. in Information Systems and Masters in Computer Sci-ence for the University of São Paulo, Brazil. His research focused on automatedtesting of web services composition. Now, he is expanding his horizons by workingwith searching, data mining, machine learning and other geek stuff.

http://elo7.com 2013 5/29

Page 6: Search Intelligence @elo7.com

Search Intelligence

Some data about our data• 3000 (avg.) queries per second

• from 3500 to 4200 users on site per minute

• 15000 requests per minute on AppServer

• 160000 (avg.) bot/requests per day

• 160000 (avg.) bot/requests per day

• 1200000 indexed products

• 20000 active sellers

http://elo7.com 2013 6/29

Page 7: Search Intelligence @elo7.com

Search Intelligence

Some history

• Search v0.0 - select * from product where text like ’%query%’

• Search v0.1 - Sphinx

– No delta index

– Poor index/query performance for large scale dataset

• Search v1.0 - Apache Solr

http://elo7.com 2013 7/29

Page 8: Search Intelligence @elo7.com

Search Intelligence

Apache Solr

Solr is written in Java and runs as a standalone full-text search server within aservlet container such as Jetty. Solr uses the Lucene Java search library at itscore for full-text indexing and search, and has REST-like HTTP/XML and JSONAPIs that make it easy to use from virtually any programming language.

http://elo7.com 2013 8/29

Page 9: Search Intelligence @elo7.com

Search Intelligence

How Lucene Works

Lucene is an inverted full-text index. This means that it takes all the documents,splits them into words, and then builds an index for each word. Since the indexis an exact string-match, unordered, it can be extremely fast.

http://elo7.com 2013 9/29

Page 10: Search Intelligence @elo7.com

Search Intelligence

ExamplesTermsT[0] = "it is what it is"T[1] = "what is it"T[2] = "it is a banana"

Inverted index"a": {(2, 2)}"banana": {(2, 3)}"is": {(0, 1), (0, 4), (1, 1), (2, 1)}"it": {(0, 0), (0, 3), (1, 2), (2, 0)}"what": {(0, 2), (1, 0)}

http://elo7.com 2013 10/29

Page 11: Search Intelligence @elo7.com

Search Intelligence

How a result is scored against a query in Lucene

A.K.A: That answer to the dollar question: Why isn’t this product appearing bysearching "bleh"

Lucene conceptual Scoring formula [?]

score(q,d) = coord-factor(q,d).query-boost(q). A·B‖A‖‖B‖ .doc-len-norm(d).score(d)

http://elo7.com 2013 11/29

Page 12: Search Intelligence @elo7.com

Search Intelligence

How have we optimized our index<fieldType name="text_pt_br" class="solr.TextField"

positionIncrementGap="100"><analyzer type="index">

<tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="1"catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" enablePositionIncrements="true"/>

<filter class="solr.LowerCaseFilterFactory"/><filter class="solr.ASCIIFoldingFilterFactory"/><filter class="com.elo7.solr.analysis.OrengoStemmerFilterFactory"

http://elo7.com 2013 12/29

Page 13: Search Intelligence @elo7.com

Search Intelligence

exceptionList="stemmerignore.txt"/><filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

</analyzer><analyzer type="query">

<tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.WordDelimiterFilterFactory"

generateWordParts="1" generateNumberParts="1" catenateWords="0"catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"ignoreCase="true" expand="true"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/>

<filter class="solr.LowerCaseFilterFactory"/><filter class="solr.ASCIIFoldingFilterFactory"/>

http://elo7.com 2013 13/29

Page 14: Search Intelligence @elo7.com

Search Intelligence

<filter class="com.elo7.solr.analysis.OrengoStemmerFilterFactory"exceptionList="stemmerignore.txt"/>

<filter class="solr.RemoveDuplicatesTokenFilterFactory"/></analyzer>

</fieldType>

http://elo7.com 2013 14/29

Page 15: Search Intelligence @elo7.com

Search Intelligence

How to declare a solr index

<field name="id" type="int" indexed="true"stored="true" required="true" />

<field name="title" type="text_pt_br"indexed="true" stored="true"/>

<field name="description" type="text_pt_br"indexed="true" stored="false" />

<field name="tags" type="text_pt_br"indexed="true" stored="true" multiValued="true"/>

http://elo7.com 2013 15/29

Page 16: Search Intelligence @elo7.com

Search Intelligence

Infrastructure Upgrade

version 2 - single node

• Scaling issues

• M1.xlarge => m2.2xlarge => c1.xlarge 90% CPU

• Solr 3.6

• Full index with ruby scripts (takes 3.5hs to full index )

http://elo7.com 2013 16/29

Page 17: Search Intelligence @elo7.com

Search Intelligence

version 3 - current infrastructure

• 3 m1.xlarge (20% CPU Usage) behind an amazon ELB

• 1 m1.xlarge Search API (50% of logged users staging )

• Solr Data Importer (takes 15mn to full index)

http://elo7.com 2013 17/29

Page 18: Search Intelligence @elo7.com

Search Intelligence

Frenzy APISolr environment evolution

• Operations: Searching, indexing and deleting

• Resources: Products, stores, auto-complete suggestions and categories

• Recommendations

Advantages

• Removing search and indexing logic from marketplace

• Providing a search service to other applications (e.g., mobile)

http://elo7.com 2013 18/29

Page 19: Search Intelligence @elo7.com

Search Intelligence

Example of product operationSearching

• input (GET): query term

– filters: city, min. price and max. price

– sort: featured, organic, oldest, newest, ...

• output (json)

– metadata (query status, response time and hits)

– list of products

– references (previous and next page urls)

http://elo7.com 2013 19/29

Page 20: Search Intelligence @elo7.com

Search Intelligence

Content recommendation

• Collaborative filtering (user similarity)

• Based on user favorited products

Input (GET)

• frenzy/users/:id/recommendations

Output: (similiar to search output)

http://elo7.com 2013 20/29

Page 21: Search Intelligence @elo7.com

Search Intelligence

Architecture

http://elo7.com 2013 21/29

Page 22: Search Intelligence @elo7.com

Search Intelligence

Current Scenario

• Experimental stage

• Search operations are being integrated

• 50% of logged user searches are using the API

• Recommendation API is being evolved

http://elo7.com 2013 22/29

Page 23: Search Intelligence @elo7.com

Search Intelligence

Future WorkContent Tracker

We need to understand, track, analyse and take advantage on our users navigationpatterns.

• Any user receiver an unique ID

• This ID follows any user’s interaction with the website

• Whenever an user interacts with a product: views; add to favorites; socialshare; add to cart or buys. we trigger a convertion action.

http://elo7.com 2013 23/29

Page 24: Search Intelligence @elo7.com

Search Intelligence

SearchID UserID Term pgN Filters

A376AC e00c59 "abajur" 1 NilA376AD e00c59 "abajur" 1 "pr:[10.0,15.0]"A376AE e00c59 "abajur" 1 "pr:[10.0,15.0] city:curitiba"

Table 1: Search Action logger

http://elo7.com 2013 24/29

Page 25: Search Intelligence @elo7.com

Search Intelligence

ViewID SearchID PRDID PPP

000001 A376AE 201209 1000002 A37FED 204439 5000003 EDA342 202234 1000004 EFDBC1 231324 5000005 EDA563 214512 2000006 EFA564 264553 13

Table 2: Product View logger

http://elo7.com 2013 25/29

Page 26: Search Intelligence @elo7.com

Search Intelligence

ActionID ViewID type

000001 000001 cart000002 000002 fav000003 000005 cart000004 000004 social000005 000003 ship000006 000006 contact

Table 3: Product Action logger

http://elo7.com 2013 26/29

Page 27: Search Intelligence @elo7.com

Search Intelligence

ActionID convert

000001 true000002 true000003 false000004 false000005 false000006 true

Table 4: Action to convert

http://elo7.com 2013 27/29

Page 28: Search Intelligence @elo7.com

Search Intelligence

BigData Analytics• Product conversion per channel

• Consumer behaviour

• Trends

• Better recomendation (including new users)

• Better emailmarketing (attractiveness )

• Per product stats (Clicks/Impressions/CTR)

http://elo7.com 2013 28/29