fibep wmic 2015 - how infomedia upgraded their closed-source search engine to a fast, scalable and...

19
FIBEP World Media Intelligence Congress 17-20 November 2015, Vienna FIBEP World Media Intelligence Congress 17-20 November 2015, Vienna www.wmicongress.com Speaker: Twitter: How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform Session Title: 2015-11-19 Kristian Schou, Infomedia & Charlie Hull, Flax @InfomediaDK @Flaxsearch Web: www.flax.co.uk

Upload: charlie-hull

Post on 15-Apr-2017

2.084 views

Category:

Software


3 download

TRANSCRIPT

Page 1: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, ViennaFIBEP World Media Intelligence Congress17-20 November 2015, Vienna

www.wmicongress.com

Speaker:Twitter:

How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible

open-source platform

Session Title:

2015-11-19

Kristian Schou, Infomedia & Charlie Hull, Flax@InfomediaDK @Flaxsearch Web: www.flax.co.uk

Page 2: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

About Infomedia• Founded in 2003• The leading Danish provider of media monitoring and media

analysis• Largest and oldest Danish Media archive with access to

approximately 75 million searchable articles

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 3: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

About Flax• Founded in 2001 in Cambridge, U.K. • Independent, honest advice and analysis• Expert design & development, Apache Solr committers• Test-driven relevancy and performance tuning• Custom training & mentoring for your staff• Flexible support up to 24/7/365 with SLAs• Some of our clients:

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 4: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

The situation at Infomedia in 2013• Very old media monitoring system based on Verity

• Verity was put into production in 2001 at the company that would later become Infomedia!

• Slightly less old installation of Autonomy IDOL used for Infomedia’s Media Archive

• put into production at Infomedia in 2009/10

• Drawbacks:– Verity at almost max capacity needing constant attention– Old and complex workflow for receiving and processing articles – Different platforms for monitoring and archive searches meant we were ‘bi-lingual’,

using two different query languages in-house.– Verity no longer supported by the owning company (HP)– Verity not scalable!

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 5: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

What to do?• Different upgrading options explored throughout 2011-2012

• Upgrade everything to Autonomy IDOL?• Switch to other commercial search engine?• Go open-source?

• Recommendations and internal testing drew us to Apache Solr, an open source enterprise search platform

• Advantages:– Transparency (going from commercial to open-source)– Rapid maturity of Solr – development moving very fast– Large and active Solr Community– Customizability– Solr is known to be fast and highly scalable– No license fees

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 6: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Defining the project with Flax• Infomedia searched for Solr expertise in Denmark/Scandinavia

– could not find an option that we were comfortable with

• Introduced to Flax through networking and recommendations– Experience from similar upgrade projects with Gorkana and AAP– Very impressed with Flax’s insight, knowledge and credentials– Actual committer to Apache Solr

• Project began in autumn of 2013 with the goals of:– Building a completely new search architecture to replace Verity and IDOL– Defining Infomedia's own query language, IQL, owned and controlled by Infomedia – Translating old monitoring queries (app. 8.000) to this new IQL syntax

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 7: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Replacing Verity• Verity replaced by Flax Monitor

– Parses IQL to Lucene queries– Runs on 2 servers– Uses Luwak, Flax's 'stored search' library:

• Built on Apache Lucene (as is Solr)• Also used by Bloomberg, Booz Allen Hamilton & others• In use for 1m stored searches (some 250k characters), 1m stories/day• 40x faster than Elasticsearch Percolator• Open source at https://github.com/flaxsearch/luwak

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 8: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Turning search upside down

@_FIBEP #_FIBEP #WMIC152015-11-19

Docs

Result

QueryQueryStoredQueries $$$

Page 9: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Turning search upside down

@_FIBEP #_FIBEP #WMIC152015-11-19

Docs

Result

QueryQueryStoredQueries

1 million queriesSome 250k longComplex rules

1 million new documents a day

$$$

Within 5-100ms

Page 10: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Turning search upside down

@_FIBEP #_FIBEP #WMIC152015-11-19

Docs

Result

QueryQueryStoredQueries

1 million queriesSome 250k longComplex rules

1 million new documents a day

$$$$$$

Within 5-100ms

Page 11: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Turning search upside down

@_FIBEP #_FIBEP #WMIC152015-11-19

Docs

Result

QueryQueryStoredQueries

1 million queriesSome 250k longComplex rules

1 million new documents a day

$$$$$$

Within 5-100ms

Page 12: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Turning search upside down

@_FIBEP #_FIBEP #WMIC152015-11-19

Docs

QueryQueryStoredQueries 1.

Pre

QuerySubset

1 million queriesSome 250k longComplex rules

~200

Doc

1 million new documents a day

Page 13: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Turning search upside down

@_FIBEP #_FIBEP #WMIC152015-11-19

Docs

QueryQueryStoredQueries 1.

Pre

QuerySubset

Result

1 million queriesSome 250k longComplex rules

~200

2.Search

Doc

1 million new documents a day

Page 14: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Replacing Autonomy IDOL• Autonomy IDOL replaced by Apache Solr

Parses IQL to Lucene queries

SolrCloud distributes the index & queries across several servers

Setup: 75 million documents hosted on 8 servers,6 cores/24GB memory and 125 GB storage per server

This setup is doubled to have full redundancy

Features added to standard Solr by Flax:

• Custom highlighting,

• Framework to handle multiple languages

• Extended error logging

• Cluster management

• Performance enhancements for complex wildcard queries

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 15: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Benefits of the project• Articles indexed and searchable within minutes of receiving them• New, much smarter tools for constructing and comparing

monitoring queries• The Flax Monitor is an extremely smart and performant monitoring

solution

• Huge benefits from defining the Infomedia Query Language, IQL– Extremely enlightening and empowering process to analyze what we actually need from a

query language– We fully understand and have documented how IQL works– IQL is designed to match Infomedia’s demands and preferences– We can revise and expand IQL as new needs and opportunities arrive– Not bound to any search platform. We can take it with us

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 16: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Learnings/Where are we now?• A challenging, complex, time-consuming but ultimately rewarding project

• The ripple effect – we have had to revisit and update a lot of legacy systems • Customization is great, but can also mean more specification• Open Source prevents lock-in but demands investment in education - otherwise it is still

just a magic box• Flax‘s expert knowledge has been invaluable

• A succesful migration• More than 90% of Infomedia’s monitoring queries have been migrated to IQL with

practically no negative change in precision or recall

• The collaboration with Flax continues• As Infomedia develops, so do new ideas and feature requests• A customized open source platform also means continuous improvement

• Currently updating to Solr 5.3• Still experimenting with different ways to scale our Solr installation

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 17: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

Other lessons• You can also keep your old query language

- Flax have written dtSearch & Verity parsers for Lucene

• Some of your old queries might not be working- e.g. Verity doesn't always tell you when queries are broken!

• Open source can help future-proof your search- and you have control of the software

• Engage with the open source community:- User groups

- Mailing lists

- Contribute back if you can

@_FIBEP #_FIBEP #WMIC152015-11-19

Page 18: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

@_FIBEP #_FIBEP #WMIC15Date of Presentation

Thanks for listening - any questions?

Kristian Schou, Infomedia & Charlie Hull, Flax@InfomediaDK @Flaxsearch Web: www.flax.co.uk

Page 19: FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform

FIBEP World Media Intelligence Congress17-20 November 2015, Vienna

@_FIBEP #_FIBEP #WMIC15Date of Presentation

Something else you might like

Think outside the search box!

2DSearch is a patent pending, radical alternative to traditional keyword search. Instead of a one-dimensional search box, concepts are expressed and manipulated as objects on a two-dimensional canvas. So you spend less time worrying about Boolean strings, and more time creating semantically transparent queries and effective search strategies.

Sign up to gain early access at www.2dsearch.com