building a lightweight discovery interface for china's patents@nyc solr/lucene meetup

47
BUILDING A LIGHTWEIGHT DISCOVERY INTERFACE FOR CHINESE PATENTS New York Solr/Lucene Meetup ERIC PUGH | [email protected] | @dep4b

Upload: opensource-connections

Post on 11-May-2015

430 views

Category:

Technology


2 download

DESCRIPTION

War stories from building GPSN, a US Federal site for searching China's patents.

TRANSCRIPT

Page 1: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

BUILDING A LIGHTWEIGHT DISCOVERY INTERFACE FOR

CHINESE PATENTS !

New York Solr/Lucene Meetup

ERIC PUGH | [email protected] | @dep4b

Page 2: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Who am I?

• Principal of OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary

• Member of Apache Software Foundation

• SOLR-284 UpdateRichDocuments (July 07)

• Fascinated by the art of software development

Page 3: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Co-AuthorN

ext Edition June!

Page 4: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Congrats to Trey and Tim!

Page 5: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Agilista

Page 6: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Selected Customers

Page 7: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Telling some storieswar ^

Page 8: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup
Page 9: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

• First USPTO application in “the cloud”

• Simple, and discoverable

• Expresses our philosophy of “Cloud meets Ocean”

!

• Check it out at http://gpsn.uspto.gov

Page 10: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Telling some stories

➡How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 11: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Flow of understanding

Data UnderstandingInformation

Page 12: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Building “Discovery”

Engine

UX DataTension

Page 13: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Grok data at gut level

Look for outliers

!

!

User Interviews

Surveys

Card Sorting

Scenarios/Personas

!

UX

Data

brainstormMockups

Proof of concept

!

!

Page 14: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Where to spend time?

UX

Engine

Data

40%

!

20%

!

40%

!

40%

!

40%

!

20%

We spent

!

!

Page 15: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Telling some stories

• How to inject “Discovery” into your app

➡The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 16: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Boy meets Girl Story

Page 17: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Boy meets Girl Story

Metadata

Ingest Pipeline

Discovery UX

Content Files

Page 18: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Nothing but JS and Solr!

• Updates are quarterly

• User state in browser

• Solr is the “RESTful” API ;-)

• KISS: EmberJS + Solr

Page 19: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

How we built it

EmberJS Single Page Search App

HTML

XML

JSON

Server Dashboard

GPSN UI (Bootsrap CSS)

BrowsersMobile/

Tablet

Third Party Application

Servers

S3 BucketSolr

Page 20: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Yes, Solr is hangout out there on the Net…

• Using Jetty container security to lock down everything but the /select handler.

• Yes, the /admin interface appears to load, but no panels load.

• Go ahead, do a delete query! I dare you. Actually, please don’t. ;-)

Page 21: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Single 550 GB index

• Solr + Index are in a Amazon AMI image.

• Currently running two independent Solrs.

• Optimize works! Still.

• Elastic Load Balancer + AutoScale spins up more Solr’s if needed.

• Threw lots of “provisioned IOPS” at VM

Page 22: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

A better security proxy

from Alex?https://github.com/

dergachev/solr-security-proxy

Page 23: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Spyglass

• EmberJS based Widget framework

• List of Results

• Facets

• Autocomplete

• “Deploy” is just .html + .js. S3 bucket!

• Tooling is a pain. EmberJS is complex!

Better then AjaxSolr!

Page 24: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Daniel Beach’s project

https://github.com/o19s/spyglass

Page 25: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Key scaling concept behind GPSN:

!

Cloud meets Ocean

Page 26: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup
Page 27: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

More prosaically…

Database

Server

Server

Server

Client

Client

Client

$

$

$

$

Page 28: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Lessons Learned

Page 29: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Don’t Move Files

• Copying 5 TB data up to S3 was very painful.

• We used S3Funnel which is “rsync like”

• We bought more network bandwidth for our office

Page 30: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Never underestimate

the bandwidth of a station wagon

full of tapes hurtling down the highway.

–Andrew Tanenbaum, 1981

Page 31: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Data Size

0

250000

500000

750000

1000000

1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

Patent Count

277871

Page 32: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Think about Data Volume• Started with older dataset, and tasks like TIFF -> PNG

conversion became progressively harder. Map/Reduce nice, need more visibility into progress..

• Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!)

• 8 shards dropped time from 12 hours to 2 hours. Merging took 5!

• We had too many steps in our pipeline

Page 33: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Building  a  Patents  IndexM

achi

ne C

ount

0

75

150

225

300

5 days 3 days 30 Minutes

1 5

300

Page 34: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Telling some stories

• How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

➡Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Page 35: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Why so many pipelines?Morphlines

Page 36: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Tika as a pipeline?

Page 37: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Lot’s of File Types

• Sometimes in ZIP archives, sometimes not!

• multiple XML formats as well as CSV and EDI

• Purplebook, Yellowbook, Redbook,Greenbook, Questel, SIPO…

Page 38: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Tika as a pipeline!

• Auto detects content type

• Metadata structure has all the key/value needed for Solr

• Allows us to scale up with Behemoth project (and others!).

Page 39: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Lots of files!HHHHHT APS1 ISSUE - 760106!PATN!WKU 039302717!SRC 5!APN 5328756!APT 1!ART 353!APD 19741216!TTL Golf glove!ISD 19760106!NCL 4!ECL 1

<PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>

Page 40: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Detector to pick Filepublic  class  GreenbookDetector  implements  Detector  {  !        private  static  Pattern  pattern  =  Pattern.compile("PATN");                    @Override          public  MediaType  detect(InputStream  stream,  Metadata  metadata)  throws  IOException  {  !                MediaType  type  =  MediaType.OCTET_STREAM;                  InputStream  lookahead  =  new  LookaheadInputStream(stream,  1024);                  String  extract  =  org.apache.commons.io.IOUtils.toString(lookahead,  "UTF-­‐8");  !                Matcher  matcher  =  pattern.matcher(extract);  !                if  (matcher.find())  {                          type  =  GreenbookParser.MEDIA_TYPE;                  }  !                lookahead.close();                                    return  type;          }        }

Page 41: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Telling some stories

• How to inject “Discovery” into your app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

➡Don’t be Afraid to Share!

Page 42: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Your Search solution isn’t perfect

• Allow users to export data

• Most business users want to work in Excel! Accept it!

• Allow other applications to build on top of it.

Page 43: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

GPSN has• Lots of easy “Print to

PDF” options.

• Data stored in S3 as:

• individual patent files

• chunky downloads.

• Filtering to expand or select specific data sets.

• Permalinks: simple, very sharable URLs.

• Underlying Solr service is exposed to public via proxy. You can query Solr yourself.

• Need advance querying? Use Lucene syntax in search bar.

Page 44: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

One more thought...

Page 45: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Measuring the impact of our algorithms

changes is just getting harder as we get

smarter.

Page 46: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

www.quepid.com

Quepid: Give your Queries some Love

Project SolrPanl

We need beta users!

Page 47: Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Thank you! !

Questions?

[email protected]

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s

Nervous about speaking up? Ask

me later!