what's the story with open source? searching and monitoring news media with open source...

67
What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo source: http://www.flickr.com/photos/shironekoeuro/

Upload: jonas-cannon

Post on 26-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

What's the story with open source? Searching and monitoring news media with open source technology

Charlie Hull, FlaxBCS IRSG Search Solutions 2010

Photo source: http://www.flickr.com/photos/shironekoeuro/

www.flax.co.uk 2

What is Flax?

www.flax.co.uk 3

What is Flax? Search engine specialists Formed in 2001 from the ashes of Muscat Ltd

and Webtop as Lemur Consulting Ltd Based in Cambridge UK Contributors to and users of Xapian Recently selected as UK Authorized Partner by

Lucid Imagination Customers include Mydeco, NLA, Durrants

Ltd, Financial Times, MediaMiser, MySkreen

Apache Lucene and Solr are trademarks of The Apache Software Foundation

www.flax.co.uk 4

The challenges

www.flax.co.uk 5

The challenges

Content is created for publication, not for search

www.flax.co.uk 6

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to all

www.flax.co.uk 7

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple

www.flax.co.uk 8

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google”

www.flax.co.uk 9

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google” Every system will have to scale beyond its originally

planned size

www.flax.co.uk 10

The challenges

Content is created for publication, not for searchContent isn't published consistently or available to allRanking is never simple “We just want something like Google” Every system will have to scale beyond its originally

planned size

- Every project is different

www.flax.co.uk 11

So how do we build news search?

www.flax.co.uk 12

So how do we build news search?

Indexing

www.flax.co.uk 13

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)

www.flax.co.uk 14

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quickly

www.flax.co.uk 15

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, source

www.flax.co.uk 16

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessary

www.flax.co.uk 17

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes required

www.flax.co.uk 18

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes requiredContent restriction & embargo data

www.flax.co.uk 19

So how do we build news search?

IndexingHistorical, daily & updates (i.e. later editions)Must cope with high volume, quicklyEssential metadata – byline, title, sourceFile format translation not always necessaryBUT Pre-processing sometimes requiredContent restriction & embargo data

SolutionLightweight, customisable index scripts using powerful open source libraries

www.flax.co.uk 20

So how do we build news search? import xapian import flax.core

db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db)

doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()

www.flax.co.uk 21

So how do we build news search?

Searching

www.flax.co.uk 22

So how do we build news search?

SearchingFree text with Boolean operators

www.flax.co.uk 23

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date ranges

www.flax.co.uk 24

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance ranking

www.flax.co.uk 25

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriate

www.flax.co.uk 26

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting

www.flax.co.uk 27

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'

www.flax.co.uk 28

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters

www.flax.co.uk 29

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters

SolutionTemplate-based user interface scripts, again using open source libraries

www.flax.co.uk 30

So how do we build news search?

SearchingFree text with Boolean operatorsFilters for metadata & date rangesCombine date and relevance rankingFaceted search where appropriateSaved searches & Alerting'More like this'Content restriction & embargo filters

SolutionTemplate-based user interface scripts, again using open source librariesBeware Javascript & older browsers!

www.flax.co.uk 31

So how do we build news search?

Administration Indexing failures commonLogging is essential

www.flax.co.uk 32

So how do we build news search?

Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later

www.flax.co.uk 33

So how do we build news search?

Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later

ScalabilityContent is always growingBoth indexing & searching must scale

www.flax.co.uk 34

So how do we build news search?

Administration Indexing failures commonLogging is essentialLog to text as a first pass, reports later

ScalabilityContent is always growingBoth indexing & searching must scaleOpen source search libraries provide distributed indexing, replication, remote indexesNot simple to get this right!

www.flax.co.uk 35

So how do we build news search?

Available open source technologiesLanguages – C/C++, Java, Python, JavascriptSearch libraries – Xapian, LuceneSearch bindings/servers – Xappy, Flax.core, SolrExternal libraries – pyparsing, CherryPy, xmllib, mxODBC, ...Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), ...

www.flax.co.uk 36

So how do we build news search?

Available open source technologiesLanguages – C/C++, Java, Python, JavascriptSearch libraries – Xapian, LuceneSearch bindings/servers – Xappy, Flax.core, SolrExternal libraries – pyparsing, CherryPy, xmllib, mxODBC, ...Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), …We can use whatever works!

www.flax.co.uk 37

Some examples

Newspaper Licensing Agency – NLA Clipshare20 million newspaper stories6500 usersContent from every major newspaper (and most regionals)Used by journalists, clippings agencies, media monitorsReplacing internal systems at major newspapers

http://www.nla-clipshare.com

www.flax.co.uk 38

Some examples

Newspaper Licensing Agency – NLA Clipshare20 million newspaper stories6500 usersContent from every major newspaper (and most regionals)Used by journalists, clippings agencies, media monitorsReplacing internal systems at major newspapersOne of very few ways to search content from all the papers within hours of publication

http://www.nla-clipshare.com

www.flax.co.uk 39

www.flax.co.uk 40

www.flax.co.uk 41

www.flax.co.uk 42

Some examples

Financial Times – press cuttingsWeb Service for easy integrationXML source dataFaceted searchArea filters (whole article, body, headline, byline or any combination)Synonyms, spelling suggestions

http://presscuttings.ft.com

www.flax.co.uk 43

Some examples

Financial Times – press cuttingsWeb Service for easy integrationXML source dataFaceted searchArea filters (whole article, body, headline, byline or any combination)Synonyms, spelling suggestionsBuilt from scratch in a fortnightDesigned as a prototype, scaled to production use without significant change

http://presscuttings.ft.com

www.flax.co.uk 44

www.flax.co.uk 45

A different task – news monitoring

Non-traditional use of search

www.flax.co.uk 46

A different task – news monitoring

Non-traditional use of searchMany automated searches on incoming content

www.flax.co.uk 47

A different task – news monitoring

Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needs

www.flax.co.uk 48

A different task – news monitoring

Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needsFalse positives require human checking

www.flax.co.uk 49

A different task – news monitoring

Non-traditional use of searchMany automated searches on incoming contentSearches reflect complex client needsFalse positives require human checkingFalse negatives should never occur!

www.flax.co.uk 50

A different task – news monitoringAn example

Durrants Ltd.

www.flax.co.uk 51

A different task – news monitoringAn example

Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline

www.flax.co.uk 52

A different task – news monitoringAn example

Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline

SolutionFlexible query language allows OCR errors, punctuation, fuzzy matching, weightingSupports features of previous engineScalable master-slave architecture

www.flax.co.uk 53

A different task – news monitoringAn example

Durrants Ltd.Thousands of client search profiles Hundreds of thousands of articles per dayComplex publication heirarchyEstablished pipeline

SolutionFlexible query language allows OCR errors, punctuation, fuzzy matching, weightingSupports features of previous engineScalable master-slave architecture

Accuracy improved in some cases from 95% rejected to 95% accepted Hardware budget 15% of previous system

www.flax.co.uk 54

Why open source?

Flexible, extendable

www.flax.co.uk 55

Why open source?

Flexible, extendable Powerful & scalable

www.flax.co.uk 56

Why open source?

Flexible, extendable Powerful & scalable Lower cost

www.flax.co.uk 57

Why open source?

Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary

www.flax.co.uk 58

Why open source?

Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary

- Freedom to innovate

www.flax.co.uk 59

Looking to the future

www.flax.co.uk 60

Looking to the future

More and more content including social media

www.flax.co.uk 61

Looking to the future

More and more content including social mediaMultiple delivery platforms

www.flax.co.uk 62

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications

www.flax.co.uk 63

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'

www.flax.co.uk 64

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud

www.flax.co.uk 65

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud

Search no longer a bolt-on, but a platform for innovation

www.flax.co.uk 66

Looking to the future

More and more content including social mediaMultiple delivery platforms Search-powered websites & applications'No-SQL'Cloud

Search no longer a bolt-on, but a platform for innovationOpen source no longer an outsider, but the obvious choice

www.flax.co.uk 67

Thankyou!

Questions?

[email protected]/blogTwitter: @FlaxSearch

Photo source: http://www.flickr.com/photos/katerha/4259440136/