using lucene/solr to build citeseerx and friends

Using Lucene/Solr to Build CiteSeerX and Friends

Dr. C. Lee Giles Information Sciences and Technology Computer Science and Engineering The Pennsylvania State University

University Park, PA, USA [email protected]

http://clgiles.ist.psu.edu

Prof. C. Lee Giles •  Intelligent and specialty search engines; cyberinfrastructure

for science, academia and government –  Modular, scalable, robust, automaEc cyberinfrastructure and

search engine creaEon and maintenance –  Large heterogeneous data and informaEon systems –  Specialty search engines and portals for knowledge integraEon

•  CiteSeerx (computer and informaEon science) •  ChemXSeer (e-‐chemistry portal) •  GrantSeer (grant search) •  RefSeer (recommendaEon of paper references)

•  Scalable intelligent tools/agents/methods/algorithms –  InformaEon, knowledge and data integraEon –  InformaEon and metadata extracEon; enEty disambiguaEon –  Unique search, knowledge discovery, informaEon integraEon,

data mining algorithms –  Web 2.0 methods

•  Automated tagging for search and informaEon retrieval •  Social network analysis

http://clgiles.ist.psu.edu

SeerSuite Contributors/Collaborators: recent past and present (incomplete list)

Projects: CiteSeer, CiteSeerX, ChemXSeer, ArchSeer, CollabSeer, GrantSeer, SeerSeer, RefSeer, AlgoSeer, AckSeer, BotSeer, YouSeer, …

•  P. Mitra, V. Bhatnagar, L. Bolelli, J. Carroll, I. Councill, F. Fonseca, J. Jansen, D. Lee, W-‐C. Lee, H. Li, J. Li, E. Manavoglu, A. Sivasubramaniam, P. Teregowda, H. Zha, S. Zheng, D. Zhou, Z. Zhuang, J. Stribling, D. Karger, S. Lawrence, J. Gray, G. Flake, S. Debnath, H. Han, D. Pavlov, E. Fox, M. Gori, E. Blanzieri, M. Marchese, N. Shadbolt, I. Cox, S. Gauch, A. Bernstein, L. Cassel, M-‐Y. Kan, X. Lu, Y. Liu, A. Jaiswal, K. Bai, B. Sun, Y. Sung, J. Z. Wang, K. Mueller, J.Kubicki, B. Garrison, J. Bandstra, Q. Tan, J. Fernandez, P. Treeratpituk, W. Brouwer, U. Farooq, J. Huang, M. Khabsa, M. Halm, B. Urgaonkar, Q. He, D. Kifer, J. Pei, S. Das, S. Kataria, D. Yuan, T. Suppawong, others.

•  Current funding: NSF, Dow Chemical

Outline

•  MoEvaEon –  Data science; Cyberinfrastructure –  Vast growth in domain science data and documents

•  SeerSuite –  Tool for creaEng Seers –  Specialized data and document search and recommendaEons

•  Tables, formulae, figures, references … –  Use of Solr/Lucene

•  Disciplinary sciences, indexes & informaEon extracEon (the Seers) –  Computer science –  Chemistry –  Briefly other Seers

•  OpportuniEes for Research •  Conclusions and DirecEons

The Evolu3on of Science -‐ the 4th Paradigm

•  Observa3onal Science –  ScienEst gathers data by direct

observaEon –  ScienEst analyzes data

•  Analy3cal Science –  ScienEst builds analyEcal model –  Makes predicEons.

•  Computa3onal Science –  Simulate analyEcal model –  Validate model and makes predicEons

•  Data Driven Science –  Data captured from the web, by

instruments, or from documents –  Data generated by simulaEon –  Placed in data structures / files –  ScienEst(s) analyze(s) data –  Access & search crucial

Jim Gray’s paradigm

Data Access Varies with Discipline or Small vs Big Science

•  Small vs Big science –  “Data from Big Science is … easier to handle, understand and archive.

Small Science is horribly heterogeneous and far more vast. In Eme Small Science will generate 2-‐3 Emes more data than Big Science.”

•  ‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher EducaEon (23/06/2006)

–  Data is local –  Data will not be shared

•  At some point there will be needed –  indices to control search –  parallel data search and analysis

•  Cyberinfrastructure can help –  If you can’t move the data around, –  Bandwidth of a van loaded with disks take the analysis to the data! –  Do all data manipulaEons locally

•  Build custom procedures and funcEons locally

SeerSuite •  Open source search engine and digital library tool kit used to

build search engines and digital libraries –  CiteSeerX , ChemXSeer, RefSeer,YouSeer, CollabSeer, etc.

•  Supports research in –  Indexing and search –  Digital libraries –  Data mining & structures –  InformaEon and knowledge extracEon –  Social networks –  Scientometrics/infometrics –  Systems engineering, User design –  Sokware engineering and management –  Web crawling

•  Trains students in search and sokware systems –  EducaEonal tool for search engine creaEon –  Students highly sought in industry and government

SeerSuite -‐ proper3es •  Modular, scalable, extensible, robust design

–  Extensible to many problems and disciplines •  Integrated features

–  Focused crawler -‐ Heritrix –  Indexer -‐ Solr/lucene –  Metadata extracEon -‐ modular –  Ranked results

•  Builds on experience with other domain engines and OS tools –  Lucene and Solr –  The MySQL Database and InnoDB Storage Engine –  Apache Tomcat –  Spring Framework –  Acegi Security –  AcEveMQ –  AcEveBPEL Open Source Engine –  Apache Commons Libraries –  SVMlight support vector machine package –  CRF++ condiEonal random field package

•  Hardware independent; Linux •  Reuse not reinvent

Data Mining & Information Extraction in Seers •  Data acquisition

•  SeerSuite systems often crawls the public web for new data •  Many data types available

•  Richness of data offers unique data mining features •  CiteSeerX as testbed/sandbox

•  Large scale data resources •  Millions of documents, authors, etc. •  Some common features/metadata

•  Commercial grade indexer (Solr/Lucene)

•  Scalable to G’s of documents and M’s of users •  “Watson”

•  Modular design •  Cloudable

•  State of the art algorithms (machine learning) for large scale unique metadata (information) extraction & mining

•  Unique parsers and indexing •  Quality of extraction •  Precision/recall •  Ranking •  Architecture/integration

Seer Friends •  In various stages of the system lifecycle with various data resources

and indexes: –  Mature and developing, code released

•  CiteSeer, now CiteSeerX •  ChemXSeer •  TableSeer •  YouSeer

–  New, future TBD, not all aspects public •  ArchSeer •  AlgoSeer •  CollabSeer •  RefSeer •  SeerSeer •  GrantSeer

–  Dead or limping by (could be revived) •  AckSeer (acknowledgement indexing) (revived!) •  BizSeer •  BotSeer

–  Proposed, but do not exist •  BrainSeer •  CensorSeer •  ArXivSeer

Why Solr/Lucene? •  Only open source considered – cost •  CompeEtors:

–  Indri – Wumpus –  Terrier –  Others?

•  Must scale for both number of documents and users •  Easily integrable and customizable

–  Other indexes, crawlers, ingesEon, metadata extractors •  Well used (Watson) •  AcEve community of support

–  Enterprise plaporm a plus •  Easy to transiEon to government/industry/academia

–  Apache license

http://citeseerx.ist.psu.edu

Next Generation CiteSeer, CiteSeerX

•  2 M documents

•  40 M citaEons

•  2 to 5 M authors •  2 to 4 M hits day

•  800K individual users •  en3re data shared

•  Index -‐ 50 G

History: CiteSeer (aka ResearchIndex)

C. Lee Giles

Kurt Bollacker

Steve Lawrence

  Project at NEC Research InsEtute, Princeton   1st academic document search engine   Very popular with computer science

  Hosted at NEC from 1997 – 2004.   Moved to Penn State as collaborators lek.

  Provided a broad range of unique services including   AutomaEc citaEon indexing, reference linking, full text indexing, similar documents lisEng, automated metadata extracEon and several other pioneering features.

  Refactored and redesigned as CiteSeerx   Released 2008   Lucene based indexing

CiteSeer continuously running for 15 years!

SeerSuite/CiteSeerX Architecture

•  Web Application

•  Focused Crawler

•  Document Conversion and Extraction

•  Document Ingestion

•  Data Storage

•  Maintenance Services

•  Federated Services

Teregowda, USENIX ‘10

4 systems:

•  Production •  Crawling •  Staging •  Research

All or some can be cloudized

Teregowda, USENIX 2010

CiteSeerX Services   CiteSeerX is a very automated system:

  Full OAI metadata if available   Full text Indexing (many different indexes)

  Documents   CitaEons   Tables   More forthcoming (Algorithms, Figures, Acknowledgements).

  CitaEon Graph.   Ranking based on citaEons.   Linking documents

-  Co-‐citaEons -  CiEng documents

  Author DisambiguaEon   DisEnguish between authors with similar names.   Profiles and publicaEon informaEon for author.

  AutomaEc crawling from list and submissions   PersonalizaEon

-  Login based access to features on CiteSeerX. -  CorrecEons to metadata. -  Storage of queries. -  CollecEon of papers -  Follows document metadata changes.

Focused Crawling •  Maintain a list of parent URLs where documents were previously found

–  Parent URLs are usually academic homepages. •  300,000 unique parent URLs, as of summer 2011

–  Parent URLs are stored in a database table with two addiEonal fields for scheduling:

•  Last Eme changed, get new documents from the page. •  EsEmated change rate according to previous crawls of this page.

•  The crawling process starts with the scheduler selecEng 1000 parent URLs which have the highest probability of having new documents available. –  Assume Poisson process for the change behavior of a parent page.

•  Suppose a parent page P’s last observed change occurred at Eme t1, and its esEmated change rate is R, then at Eme t2 (t2 = t1 + Δ), the probability that it has changed again since t1 is 1 – exp(-‐R*Δ)

•  Larger R or larger Δ will give larger probability. •  Aker each crawl, the change rate of the scheduled parent URL should be recalculated.

•  Crawling run incrementally daily (invoked by a Linux cron job at 12 am) –  Most discovered documents have been crawled before.

•  Use hash table comparison for detecEon of new documents •  Normally retrieve a few thousand NEW documents per day, someEmes less than 1k.

•  Moved to whitelist vs blacklist Zheng, CIKM’09

documents from crawled urls 90% all citations from the first 550 sites

90% all documents from the first 1250 sites

How will we get metadata for fields?

Now... that should clear up a few things around here

Metadata ExtracEon

•  Documents are converted from PDF/PS to text using converters.

–  Converters include TET, pd{ox, pdkotext,gs.

•  Documents are filtered checking, for existence of references and duplicaEon (checksum).

•  Use tools or build your own –  Metadata extracEon system uses machine learning

methods like SVM (Header Parser), CRF (ParsCit) to extract various enEEes from the document.

•  Rule based templates are applied before extracEon.

id

version cluster

title

abstract

venue

venueType

pages

publisher

public

n-cites

crawldate selfCites repositoryID

10.1.1.130.782

This .. 2009

2

JOURNAL

455-500

True 34

6

12/30/2008 10

“Tensor Decompositions and Applications”, SIAM REVIEW, 2009, pp 455-500 Abstract: This …. Cited 34 times, 6 times by Author

AutomaEcally Created DB of paper in CSX

year

System

Extractor/ User/

Inference

Inference/ User

Assigned By

Tensor Decompositions and Applications

9248987

SIAM

SIAM REVIEW

Extraction

Storage

Load Balancer

Web 1

Web 2

Index

Index - Tables

Repository

Database

Load Balancer

Web Application

Ingestion Crawler

User Request

Queries

Requests

3 Tier Architecture

CiteSeerX Sokware Overview •  IngesEon Process: Responsible for obtaining and preparing a document and the

related metadata. –  Process the document

•  Submi|ed by the user or Crawler –  Extract Metadata

•  Header •  CitaEons •  Acknowledgements

–  Store the metadata and documents. •  CitaEon Matching

–  Iden>fying the underlying graph structure – documents ci>ng this document and the rela>onship between documents and cita>ons

•  Inference matching and graph generaEon

–  User CorrecEons (Version Maintenance) –  Determine and accept valid user correc>ons –  Regular NoEficaEon Mechanisms –  Ensure that the user is no>fied when new documents are added to the collec>on

•  Linked to MyCiteSeer.

•  Update and Maintenance –  Update and make valid the full text index and various sta>s>cs. –  StaEsEcs

–  Index updates

CiteSeerX Search   Enabling Search

  Fulltext

  Fields created

-  Title

-  Authors

-  CitaEons

-  Venue

-  Keywords

-  Abstract

-  Range (PublicaEon)

-  CitaEons

Field Schema

Field Type Indexed/Stored DOI String Y/Y - Unique Citation/Document String Y/Y Title Text Y/Y Author A Text Y/Y

Authors Normalized A Text Y/N

ncites (# cited by) Integer Y/Y

URL String Y/Y

cites Tokens Y/N

citedby Tokens Y/N

Timestamp Date Y/Y

* - A Text is a Text field which does not have a stopword filter or stemming ^ - Tokens are a Text field with only duplicate removal and whitespace tokenizer

CiteSeerX Search Results   Results SorEng

  Relevance (default)

-  Based on dismax query handling with boosEng.

  CitaEons

-  CitaEons received by the document in collecEon plus default

  Year

-  PublicaEon date.

  Recency

-  Date of acquisiEon.

Sorting

CiteSeerX CitaEon Graph

  RelaEonships

  CitaEon graph

-  Store Cited by and Cites in index

  Build

-  Build document graph by querying index for relaEonship.

E

D

A

C

B

Cites

Cited by

Adding documents

  Ingest documents for new crawls

-  Add metadata to collecEon

-  Add full text to system

-  Link metadata in collecEon

  Run maintenance scripts

-  Poll updates and post to Solr.

  Fulltext

  Metadata

  RelaEonships

  Challenge: Maintain data freshness.

Query Response

Database

Index

Web

Web Interface

•  Query forwarded to Solr from the presentaEon layer (JSP)

•  Solr generates ranked response in JSON

•  Build each record in xml with the database (Add fields: Abstract)

•  PresentaEon layer (JSP) formats records based on ranking.

Ranking with BoosEng (Relevance)

  Use of Boost FuncEon, Minimum Match, Query Fields   Boost FuncEon – the effect of citaEons -  Map number of citaEons > 1 to 500

  Minimum Match – 2   Query Fields -  Text (1)

-  Title (4) -  Abstract (2)

Query Response   Query at Interface (JSP)

  Hand over to Web applicaEon (Java/Spring)

  Hand over to Solr   Ranked response from Solr (JSON)

  Response unwrapped and more details included with informaEon from DB

  Present response at Interface (JSP)

Web Interaface

Web Application

Index

DB

Q

Q R

F

R Text

Text JSON

HashMap

HashMap

Name DisambiguaEon •  Name disambiguaEon (NER)

–  A person can be referred to in different ways with different a|ributes in mulEple records; the goal of name disambiguaEon is to resolve such ambiguiEes, linking and merging all the records of the same enEty together

•  Three types of name ambiguiEes: –  Aliases -‐ one person with mulEple aliases, name variaEons, or name

changed e.g. CL Giles & Lee Giles, Superman & Clark Kent

–  Common Names -‐ more than one person shares a common name, e.g. Jian Huang – 103 papers in DBLP

–  Typography Errors -‐ resulEng from human input or automaEc extracEon

•  Goal: disambiguate, cluster and link names in a large digital library or bibliographic resource such as Medline, CiteSeerX, etc.

•  EnEty disambiguaEon problem –  Determine the real idenEty of the

authors using metadata of the research papers, including co-‐authors, affiliaEon, physical address, email address, informaEon from crawling such as host server, etc.

–  EnEty normalizaEon •  MoEvaEon

–  Enhance search funcEonaliEes for digital repositories

•  Fielded search by author name –  Improve metadata quality –  Improved social network analysis –  Government and business

intelligence •  E.g. census data and credit

records

•  Challenges –  Accuracy –  Scalability –  Expandability

Efficient Large Scale En3ty Disambigua3on Testbed: CiteSeerX and PubMedSeer

SimilarityFunction

JaccardSimilarity

Soft-TFIDF

Similarity

MetadataExtraction

Module

Online SVM with Active Learning

Distance LearnerAnnotator

Author 1Paper 3

Author 2Paper 4

CandidateClass

SVMDistanceFunction

DBSCANClustering

Module

BlockingModule

•  Key features –  LASVM distance funcEon

•  AcEve learning –  Simpler and more accurate model

–  Be|er generalizaEon power

•  Online learning –  Expandable to new training data

–  DBSCAN clustering •  Ameliorate labeling inconsistency (transiEvity problem) •  Efficient soluEon to find name clusters

•  N logN scaling

documents

Actors, entities

Huang, et.al PKDD 2006 Treeratpituk, et.al JCDL 2009

Author DisambiguaEon Field

•  Currently uses author fields – For author search (both for author menEons and for disambiguated authors)

•  Future direcEon – Use Lucene index for blocking in author disambiguaEon – creaEng candidate set of author menEons that could belong to the same cluster

Author DisambiguaEon •  Random Forest (RF)

–  Use random feature selecEon+bootstrap sampling to construct mulEple decision trees from one training data –  Aggregate votes of a collecEon of decision tree as final decision –  The more independent each tree is, the be|er the improvement over a single decision tree

•  Author disambiguaEon with Random Forest –  Various meta data is used as features in Random Forest to determine whether two author name from two papers

refer to the same person •  E.g. Author names, affiliaEon, coauthors, keywords, journal informaEon, year of publicaEons, etc

–  MulEple distance funcEons are used for each type of meta data •  E.g. TFIDF, Jaccard distance, for comparing affiliaEons

•  Compared with previous SVM-‐based approach –  Shown to provide higher accuracy than SVM in pair-‐wise author disambiguaEon task –  Easy parameterizaEon in the training phrase (only number of trees and randomness at each node, no decision on

kernel funcEon needed), and performance is not sensiEve to parameters chosen –  Provide measurement for importance of each individual features (how informaEve each feature is, and how

sensiEve the decision is to noise in a parEcular feature), which is not trivial for SVM with non-‐linear kernel –  Training Eme & classificaEon Eme is linear to the number of tree and data size

•  Also provide higher disambiguaEon accuracy when compared with other tradiEonal method (LogisEc Regression, Naïve Bayes, Decision Tree)

Treeratpituk, Giles, JCDL09

Data and Publications in the Field of Chemistry

Chemistry • not physics - no arXiv – or computer science - no CiteSeer

• Legacy of early information access - Chem Abstracts • Cheminformatics is not bioinformatics

Chemistry has been up to recently a data poor field Data sharing tradition just being established Data creation is exploding - local (small science)

Journals and societies sensitive to their IP issues dominate the field Unsubstantiated IP claims such as data in the paper belongs to the publisher Discourage online versions of publications - ACS

Large powerful international companies have a vested interest in research Chemical information extraction tools are easily monetized Standards exist - CML, InCHI

“Fixing the past so we can fix the future.” Jeremy Frey Chemistry is an old discipline with publications going back 100 years

Chemistry is compound centric, not algorithmic centric Search is about the compound! Compounds have a rich data environ

3D graph structure, energies, etc.

ChemXSeer Architecture Integrate and implement well-used open source tools

Use CiteSeerX tools when possible Integrate into SeerSuite Search

Chemical formulae unique search Table search Figure search More data (grey literature) than documents

•  Automated information extraction modules based on machine learning methods •  Lucene/Solr indices for extracted fields, •  Relational databases for datasets,

Work closely with chemists to understand their needs Tools for data conversion

Provide a public portal and repository for easy use User access controls

Integrated visualization tools like JMOL for Gaussian data residing into our repository

API’s for users for extracted data

Data and documents standards de facto: xml, pdf, etc.

chemxseer.ist.psu.edu

ChemXSeer Formula Search

• Extraction and search of chemical formulae in scientific documents has been shown to be very useful.

• Intersection of two research areas: • Information retrieval • Chemoinformatics

•  Formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search)

B. Sun, WWW’07, WWW’08, TOIS’11 D. Yuan, ICDE’12

Challenges in Formula Search

How to identify a formula in scientific documents?

Non-Formula “… This work was funded under NIH grants …” “ … YSI 5301, Yellow Springs, OH, USA …” “… action and disease. He has published over …”

Formula “… such as hydroxyl radical OH, superoxide O2- …” “ and the other He emissions scarcely changed …”

Machine learning algorithms (SVM + CRF) yield high accuracies for correct formula identification.

SegmenEng chemical names •  Goal: to discover semanEcally meaningful sub-‐terms in

chemical names –  Methylethyl alcohol –  methionylglutaminylarginyltyrosylglutamylserylleucyl

phenylalanylalanylglutaminylleucyllysylglutamylarginyl lysylglutamylglycylalanylphenylalanylvalylprolylphenyl alanylvalylthreonylleucylglycylaspartylprolylglycylisol eucylglutamylglutaminylserylleucyllysylisoleucylaspartyl threonylleucylisoleucylglutamylalanylglycylalanylaspartyl alanylleucylglutamylleucylglycylisoleucylprolylphenyl alanylserylaspartylprolylleucylalanylaspartylglycylprolyl threonylisoleucylglutaminylasparaginylalanylthreonylleucyl arginylalanylphenylalanylalanylalanylglycylvalylthreonyl prolylalanylglutaminylcysteinylphenylalanylglutamyl methionylleucylalanylleucylisoleucylarginylglutaminyllysyl hisEdylprolylthreonylisoleucylprolylisoleucylglycylleucyl leucylmethionyltyrosylalanylasparaginylleucylvalylphenyl alanylasparaginyllysylglycylisoleucylaspartylglutamylphenyl alanyltyrosylalanylglutaminylcysteinylglutamyllysylvalyl glycylvalylaspartylserylvalylleucylvalylalanylaspartylvalyl prolylvalylglutaminylglutamylserylalanylprolylphenylalanyl arginylglutaminylalanylalanylleucylarginylhisEdylasparaginyl valylalanylprolylisoleucylphenylalanylisoleucylcysteinyl prolylprolylaspartylalanylaspartylaspartylaspartylleucyl leucylarginylglutaminylisoleucylalanylseryltyrosylglycyl arginylglycyltyrosylthreonyltyrosylleucylleucylserylarginyl alanylglycylvalylthreonylglycylalanylglutamylasparaginyl

Chemical Search Aspects

•  Parsing •  ExtracEon and tagging •  Indexing •  Ranking

Chemical EnEty ExtracEon and Tagging •  Name tagging

–  Each chemical name can be a phrase –  Example

•  "... Determina>on of lac4c acid and ...“ •  "... insec>cide promecarb (3-‐isopropyl-‐5-‐methylphenyl methylcarbamate) acts

against ..."

•  Formula tagging –  Each formula is a single term –  Example

•  "... such as hydroxyl radical OH, superoxide ..."

–  Non-‐formula example •  "... YSI 5301, Yellow Springs, OH, USA ... ”

•  Tagging examples –  Name tagging:

"... of <name-‐type>lac>c acid</name-‐type> and ...“ –  Formula tagging:

"... radical <formula-‐type>OH</formula-‐type> , superoxide ..."

Textual Chemical Molecule InformaEon Indexing and Search

•  SegmentaEon-‐based index scheme –  Used for indexing chemical names –  First segment a chemical name hierarchically

and then index substrings at each node methylethyl

ethylmethyl

meth ethyl yl

me th

•  Frequency-‐and-‐discriminaEon-‐based index scheme –  Used for indexing chemical formulas –  SequenEally select frequent and discriminaEve subsequences of a

formula from the shortest to the longest

•  Index Schemes: –  Which tokens to index? –  Indexing all subsequences generates a large size index

Features for Formula Indexing

•  Formula –  A sequence of chemical element or par3al formula with corresponding frequencies

–  E.g. CH3(CH2)2OH •  ParEal formula

–  ParEal formula: a subsequence of a formula –  E.g. C, H, O, CH3, CH2, OH, CH3(CH)2, H3(CH)2, CH3(CH)2O, etc.

•  Index construcEon –  ParEal formulas with frequencies: e.g. <C,3>,<H,6>,<CH2,2>, etc.

–  Too many parEal formulas, need feature selec3on

Criteria of Feature SelecEon

•  Criteria of feature selecEon –  Frequent features (Freqs≥Freqmin)

–  DiscriminaEve features (αs ≥αmin) •  If a sequence’s selected subsequences are enough to disEnguish formulas containing them from other formulas, this sequence is redundant.

•  DiscriminaEon score

where F is the selected feature set, and Ds is the set of formulas containing s.

||/|| ''' ssssFss DDpI ∧∈=α

An Example for Formula Indexing

•  Data set: –  1.CH3COOH, 2.CH3(CH2)2OH, 3.CH3(CH2)3COOH

•  Parameter: –  Freqmin=2, αmin=1.1

•  Steps: –  Length=1, Candidates={C,H,O}, F={C,H,O} –  Length=2, Candidates={CH3,H3C,CO,OO,OH,CH2}, Frequent

Candidates={CH3,CO,OO,OH,CH2}

Frequent & DiscriminaEve Candidates={CO,OO,CH2} F={C,H,O,CO,OO,CH2}

–  Length=3, …

1|}3,2,1{|/|}3,2,1{}3,2,1{| 33 == CHHCCH Iα

5.1|}3,1{|/|}3,2,1{}3,2,1{| == COOCCO Iα

Formula Search

•  SF.IEF: Subsequence Frequency & Inverse EnEty Frequency

•  Exact formula search –  Search for exact representaEons. E.g. =C1-‐2H4-‐6 matches CH4 and

C2H6, not H4C or H6C2.

•  Frequency formula search –  Full frequency search: search for formulas with specified chemical

elements and frequency ranges, ignoring the order, no unspecified elements. E.g. C1-‐2H4-‐6 matches CH4, C2H6, H6C2, CH3CH3, not CH4O, C2H6O2.

–  ParEal frequency search: similar but allow unspecified elements. E.g. *C1-‐2H4-‐6 matches CH4, C2H6, H6C2, CH3CH3, and CH4O and C2H6O2 as well.

–  Ranking funcEon

€

SF(s,e) =Freq(s,e)| e |

,IEF(s) = log |C ||{e | s p e} |

))(||/()(),(),( 22 ∑∑∈∈

×=qsqs

sIFFfsIFFesSFeqscore

Formula Search substructure •  Substructure formula search

–  Search for formulas that may have a substructure. E.g. -‐COOH matches CH3COOH (exact match: high score), HOOCCH3 (reverse match: medium score), and CH3CHO2 (parsed match: low score).

–  Ranking funcEon where Wmatch(q,f) is the weight for exact match, reverse match, and parsed match

•  Similarity formula search –  Search for formulas with a similar structure of the query formula.

Feature-‐based approach using parEal formula matching. E.g. ~CH3COOH matches CH3COOH, (CH3COO)2Co, CH3COO-‐, etc.

–  Ranking funcEon

•  ConjuncEve search of the basic types of formula searches –  E.g. [*C2H4-‐6 -‐COOH] matches CH3COOH, not C2H4O or

CH3CH2COOH.

•  Document query rewriEng –  E.g. document query atom formula:=CH4 is rewri|en to atom (CH4

OR CD4), if formula search of =CH4 matches CH4 and CD4.

€

score(s,e) =Wmatch(s, f )SF(s,e)IFF(s) / | e |

€

score(q,e) = Wmatch(q,e )W (s)SF(s,q)SF(s,e)IFF(s)spq∑ / | e |

Formula Search -‐Query Models

Many models are possible from exact to semanEc Models discriminated by matching algorithms

•  Exact search –  Search for exact representaEons –  E.g. =C1-‐2H4-‐6 matches CH4 and C2H6, not H4C or H6C2

•  Frequency searches –  Full frequency search: search for formulae with specified chemical elements and

frequency ranges, ignoring the order, no unspecified elements –  E.g. C1-‐2H4-‐6 matches CH4, C2H6, H6C2, CH3CH3, not CH4O, C2H6O2 –  ParEal frequency search: similar but allow unspecified elements –  E.g. *C1-‐2H4-‐6 matches CH4, C2H6, H6C2, CH3CH3, and CH4O and C2H6O2 as well

•  Substructure search –  Search for formulae that may have a substructure –  E.g. -‐COOH matches CH3COOH (exact match: high score), HOOCCH3 (reverse match:

medium score), and CH3CHO2 (parsed match: low score). •  Similarity search

–  Search for formulae with a similar structure of the query formula. Feature-‐based approach using parEal formulae matching.

–  E.g. ~CH3COOH matches CH3COOH, (CH3COO)2Co, CH3COO-‐, etc.

Ranking formulae

•  Ranking formulae has to depend on need and importance •  Focus on structural methods and frequency •  Importance can be introduced by citaEon rank or pagerank or others •  SF.IFF

–  Substructure frequency and inverse formula frequency •  Frequency searches

– 

–  where |f| is the total frequency of elements

•  Substructure search – 

–  where Wmatch(q,f) is the weight for exact match, reverse match, and parsed match

•  Similarity search – 

))(||/()(),(),( 22 ∑∑∈∈

×=qeqe

eIFFfeIFFfeSFfqscore

||/)(),(),( ),( fqIFFfqSFWfqscore fqmatch=

||/)(),(),()(),( ),( fsIFFfsSFqsSFsWWfqscoreqs

fqmatch∑=p

Chemical compounds as graphs •  Chemical compound modeled as a semanEc graph with properEes

Above figures are copied from eMolecules.com

Atom: vertex/node in the graph Bond: edge in the graph Dimensions: 3 or 4

What’s Chemical Structure Search •  Substructure Search

– Given an input chemical structure sketch, find all the chemical compounds containing the input as a substructure.

•  Super structure Search – Given an input chemical structure sketch, find all the important descriptors (substructures/ funcEonal group) contained in the input.

•  Similarity Search – Given an input chemical structure sketch, find all the chemical compounds “similar” to the input.

Table Search

Tables are widely used to present experimental results or statistical data in scientific documents; some data only exists in these tables.

Current search engines treat tabular data as regular text •  Structural information and semantics not preserved.

Goal: automatically identify tables, extract table metadata from pdf documents into xml and rank data

Table Metadata Representation: •  Environment metadata: (document specifics: type, title,…) •  Frame metadata: (border left, right, top, bottom, …) •  Affiliated metadata: (Caption, footnote, …) •  Layout metadata: (number of rows, columns, headers,…) •  Cell content metadata: (values in cells) •  Type metadata: (numeric, symbolic, hybrid, …)

Y. Liu AAA’07, JCDL’07.

Tables •  A history that pre-dates that of sentential text

–  Cuneiform clay tablets •  Not received the same level of formal characterization

enjoyed by sentential text •  Varying and irregular formats •  Different intuitive understanding of what a “table” is.

–  Is the Periodic Table of the Elements a table? –  Tables vs. Lists? –  Tables vs. Forms? –  Tables vs. Figures? –  Genuine table vs. non-genuine table? [12]

•  Our definition: scientific genuine table –  Caption + tabular structure –  Ruling lines are not required

TableSeer Beta design of a table search engine

TableSeer System

Architecture

Page Box-‐Cu�ng Algorithm

•  Improves the table detecEon performance by excluding more than 93.6% document content in the beginning

Sample Table Metadata Extracted File

•  <Table>

•  <DocumentOrigin>Analyst</DocumentOrigin> •  <DocumentName>b006011i.pdf</DocumentName>

•  <Year>2001</Year> •  <DocumentTitle>Detec3on of chlorinated methanes by 3n oxide gas sensors </DocumentTitle>

•  <Author>Sang Hyun Park, a ? Young-‐Chan Son, a Brenda R . Shaw, a Kenneth E. Creasy,* b and Steven L. Suib* acd a Department of Chemistry, U-‐60, University of Connec3cut, Storrs, C T 06269-‐3060</Author>

•  <TheNumOfCiters></TheNumOfCiters> •  <Citers></Citers>

•  <TableCap3on>Table 1 Temperature effect o n r esistance change ( D R ) and response 3meof 3n oxide thin film with 1 % C Cl 4</TableCap3on> •  <TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O 2 ) (%) R esponse 3me Reproducibiliy </TableColumnHeading>

•  <TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-‐8 min Yes 300 1027 21 < 2 0 s Yes 400 993 31 ~ 1 0 s No </TableContent> •  <TableFootnote> a D R =( R , CCl 4 ) -‐ ( R ,O 2 ). </TableFootnote>

•  <ColumnNum>5</ColumnNum> •  <TableReferenceText>In page 3, line 11, … Film responses to 1% CCl4 at different temperatures are summarized in Table 1……</TableReferenceText>

•  <PageNumOfTable>3</PageNumOfTable> •  <Snapshot>b006011i/b006011i_t1.jpg</Snapshot>

•  </Table>

TableRank

• Rank tables by rating the <query, table> pairs, instead of the <query, document> pairs: preventing a lot of false positive hits for table search, which frequently occur in current web search engines • The similarity between a <table, query> pair: the cosine of the angle between vectors

• Tailored term vector space => table vectors: • Query vectors and table vectors, instead of document vectors

Table Index

  Index   CapEons

  Footnotes   Reference Text

  BoosEng   CapEons (2)

  FuncEon: -  Inversely (recip) proporEonal to #cites.

Term WeighEng for Tables –  TTF – ITTF: (Table Term Frequency-‐Inverse Table Term Frequency)

–  TLB: Table Level Boost Factors (e.g., table frequency) –  DLB: Document Level Boost factors (e.g., journal/proceeding order, document

citaEon)

Table term ranking

• A term occurring in a few tables is likely to be a better discriminator than a term appearing in most or all tables • Similar to document abstract, table metadata and table query should be treated as semi-structured text

• Not complete sentences and express a summary • P = 0.5 (G. Salton 1988)

•  b is the total number of tables • IDF(ijk): the number of tables that term t(i) occurs in the matadata m(k)

Table Level Boost and Document Level Boost

Btbf is the boost value of the table frequency Btrt is the boost value of the table reference text (e.g., the normalized length), and Btp is the boost value of the table position. r is a parameter, which is 1 if users specify the table position in the query. Otherwise, r = 0.

IVj: document Importance Value (IV). If a table comes from a document with a high IV , all the table terms of this document should get a high document level boost ICj: the inherited citation value (ICj) DOj: source value (the rank of the journal/conference proceeding) DFj: document freshness

Table citaEon network •  Similar to the PageRank network

–  Documents construct a network from the citaEons –  The “incoming links” – the documents that cite the document in which

the table is located –  ExponenEal decay used to deal with the impact of the propagated

importance •  Unlike the PageRank network

–  Directed Acyclic Graph –  Importance Value (IV) of a document not decreased as the number of

citaEons increases –  IV not divided by the number of outbound links

•  A document may have mulEple, one, or no tables •  Each table is consisted as a set of metadata •  Same keywords may appear in different metadata in different

tables

Table Search Summary •  An novel first table ranking algorithm -‐-‐ TableRank •  A tailored table term vector space •  A table term weighEng scheme – TTF-‐ITTF

– AggregaEng impact factors from three levels: the term, the table, and the document

•  Index table referenced texts, term locaEons, and document backgrounds

•  Design and implement first table search engine, TableSeer, to evaluate the TableRank and compare with popular web search engines

•  Code released •  Currently implement in CiteSeerX -‐ millions of tables •  Improving extracEon – Dow Chemical support

Automated Figure Data Extraction and Search"•  Large amount of results in digital documents are recorded in figures, time series, experimental

results (eg., NMR spectra, income growth) and this is the only record of the data"

•  Extraction for purposes of:"–  Further modeling using presented data"–  Indexing, meta-data creation for storage & search on figures for data reuse"

•  Current extraction done manually!!

Documents

Plot Index Document Index

Merged Index

Extracted Plot Extracted Info.

User

Digital Library

Seer Figure/Plot Data Extraction and Search

Numerical data in scientific publications are often found in figures.

Tools that automate the data extraction from figures provide the following: •  Increases our understanding of key concepts of papers •  Provides data for automatic comparative analyses. •  Enables regeneration of figures in different contexts. •  Enables search for documents with figures containing specific experiment results.

X. Lu JCDL’06 & IJDAR’09, Brouwer JCDL’08, Kataria AAAI’08

Metadata & data to extract: 2 Dimensional Plot"

Snapshot of a document Extracted 2D plot

X-Axis Label

Legend

Axis Units

Ticks

Data Points

Y-Axis Labels

Our Approach to Plot Data ExtracEon • Identify and extract figures from digital documents

• Ascii and image extraction (xpdf) • OCR - bit map, raster pdfs

• Identify figures as images of 2D plots using SVM (Only for Bit map images)

• Hough transform • Wavelets coefficients of image • Surrounding text features

• Binarization of the 2D plots identified for preprocessing (No need for Vectorized Images)

• Adaptive Thresholding •  Image segmentation to identify regions

• Profiling or Image Signature •  Text block detection

• Nearest Neighbor •  Data point detection

• K-means Filtering •  Data point disambiguation for overlapping points

• Simulated Annealing

•  System integration within ChemXSeer or CiteSeerX"–  XML data generation"–  Open source tool in Lucene/SOLR "

•  Extension to other figures (3D, …)

10" 20" 30" 40" 50" 60" 70"

5"10"15"20"25"30"0"

2e+07"4e+07"6e+07"8e+07"1e+08"

1.2e+08"

Future Directions

ChemXSeer Highlights •  Portal for academic researchers in environmental chemistry which integrates the scientific literature with experimental, analytical and simulation results and tools

•  Provides unique metadata extraction, indexing and searching pertinent to the chemical literature by using heuristics combined with machine learning

•  Chemical formulae and names •  Tables •  Figures •  Publication functions as in CiteSeerX •  Interoperability ORE-Chem development •  Novel ranking required

•  After extraction, data stored API accessible xml for users

•  Hybrid repository (Not fully open): Serves as a federated information interoperational system •  Scientific papers crawled and indexed from the web •  User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) •  Scientific documents and metadata from publishers (e.g. Royal Society of Chemistry)

•  Access control for publisher-provided content and user-submitted experiment data

•  Takes advantage of developments in other funded cyberinfrastructure and open source projects

•  CiteSeerX, PlanetLab, Lucene/Solr, ORE, others •  Some released open source

•  CollabSeer currently supports 400k authors •  h|p://collabseer.ist.psu.edu

Experimental Collaborator recommendation system

CollaboraEon recommendaEon

•  Metadata of authors and coauthors and topics of interest (similar to expert recommendaEon)

•  Use social network and topics to recommend collaborators of collaborators (FOF)

•  Devise SN index and ranking scheme

•  Explore models of vertex similarity •  Built on SeerSuite

•  Other recommendaEons? –  Experimental methods

–  Chemicals?

Gou JCDL’10, Gou MIR’10 Chen JCDL’11, SAC’12

RecommendaEon list and user’s topic of interest

•  Users refine the recommend list by clicking on their topic of interest. (lek: refined by “query processing”, right: default recommendaEon list)

•  How two potenEal collaborators are linked by common collaborators

CollabSeer Framework

IntegraEon of Vertex Similarity and Textual Similarity

•  –  S: vertex similarity

–  SC.O.T.: collaborator’s contribuEon to a specified topic – Use the product of exponenEal funcEons to avoid zero vertex similarity score or zero contribuEon (textual similarity) score to turn the whole measure into zero

•  Other measures?

•  RefSeerX: recommend citaEons for papers

•  Based –  ExisEng citaEons –  CitaEon context –  Venue and importance –  Contemporary vs seminal

paper citaEons

The authors are unaware of related work  they do not know they are looking for  recommends related citations

Use these

He, WWW ‘10, WSDM ’11; Kataria, CIKM ’10, IJCAI’11,

Expert Search

• Expert search for authors, currently in alpha

Keyphrase ExtracEon for experts

SecEon Parser

Candidate Extractor

Random Forest

Top Keyphrases

Training Data

DBLP data

Text Document

Parse document into sections with regular expression

Use DBLP statistic to extract keyphrase candidates

Train random forest to classify & rank whether a phrase is a keyphrase

Treeratpituk, P., Teregowda, P., Huang, J. and Giles, CL. SEERLAB: A System for Extracting Keyphrases from Scholarly Documents, Semeval-2010 task 5: Automatic keyphrase extraction from scientific article. ACL workshop on Semantic Evaluations (SemEval 2010), Sweden, July 2010.

GrantSeer •  Prototype search engine for PI profiles and their grant

informaEon to assist funding agencies, deans of research, foundaEons

•  Link PIs with their –  Grants –  PublicaEons –  CitaEons –  OrganizaEon –  ExperEse –  Others?

•  Data that can be shared –  CiteSeerX or Google Scholar data –  Database of funded research

Funded by NSF – Julia Lane

Cover page NSF XML extracEon

GrantSeer: PI profile

grants awarded

publications + citations PI’s expertise

Algorithm Search

• Homepage search for authors, currently in alpha

AlgorithmSeer

Algorithm Search

-‐ ExtracEon -‐ Indexing -‐ Ranking

Suite Workshop ICSE ‘11

Algorithm Search

Metadata extraction • Extract

• Pseudo-codes and their metadata • Captions • Reference sentences • Synopsys • Etc.

• Index metadata using Solr to make the pseudo-codes searchable • Each search result has a pointer to the page in the document where the pseudo-code appears

Index Fields

id <string> caption <text> reftext <text> (Reference Sentences) synopsis <text> (Summarizing Text) page <sint> (Page Number) paperid <string> (Document ID) year <sint> (Year of Publication) ncites <sint> (Number of Citations)

AckSeer

94

AckSeer

95

NameNumber of

Acknowledge-mentsTotal

CitationsC/A

Metric NameNumber of

Acknowledge-mentsTotal

CitationsC/A

Metric

Funding Agencies EducationalInstitutions

National ScienceFoundation 12287 144643 11.77 Carnegie Mellon

University 640 10840 16.94

Defense AdvancedResearch Projects Agency 4712 80659 17.12 Massachusetts Institute

of Technology 500 10509 21.02

Office of Naval Research 3080 48873 15.87 California Institute ofTechnology 464 4170 8.99

DeutscheForschungsgemeinschaft 2780 9782 3.52 Santa Fe Institute 368 3387 9.2

National Aeronautics andSpace Administration 2408 21242 8.82

French NationalInstitute for Research inComputer Science

321 3399 10.59

Engineering and PhysicalScience Research Council 2007 16582 8.26 Stanford University 314 3693 11.76

Air Force Office ofScientific Research 1657 16850 10.17 University of California

at Berkeley 306 10439 34.11

National Sciences andEngineering ResearchCouncil of Canada

1422 12050 8.47National Center forSupercomputingApplications

261 4777 18.3

Department of Energy 1054 5562 5.28 International ComputerScience Institute 180 2078 11.54

Australian ResearchCouncil 1010 5464 5.41 Cornell University 180 1656 9.2

European UnionInformation TechnologiesProgram

825 9594 11.63 University of Illinois atUrbana-Champaign 177 5304 29.97

National Institutes ofHealth 709 7279 10.27 USC Information

Sciences Institute 176 3283 18.65

Army Research Office 666 7709 11.58 University of CaliforniaLos Angeles 176 2003 11.38

Netherlands Organizationfor Scientific Research 646 2843 4.4 McGill University 152 3001 19.74

Science and EngineeringResearch Council 489 6976 14.27 Australian National

University 123 549 4.46

Companies IndividualsInternational BusinessMachines 1380 23948 17.35 Olivier Danvy 268 8000 29.85

Intel Corporation 962 14441 15.01 Oded Goldreich 259 4615 17.82Digital EquipmentCorporation 831 16390 19.72 Luca Cardelli 247 10846 43.91

Hewlett-Packard 735 11186 15.22 Tom Mitchell 226 5494 24.31

Sun Microsystems 651 12042 18.5 Martin Abadi 222 9647 43.46

Microsoft Corporation 368 6061 16.47 Phil Wadler 181 7252 40.07

Silicon Graphics, Inc 279 3898 13.97 Moshe Vardi 180 6094 33.86

Xerox Corporation 265 4309 16.26 Peter Lee 167 8941 53.54

Siemens Corporation 241 8395 34.83 Avi Wigderson 160 2901 18.13

Bellcore 192 2393 12.46 Matthias Felleisen 154 4705 30.55

Nippon Electric Company 164 942 5.74 Benjamin Pierce 152 4641 30.53

AT&T- Bell Labs 146 1549 10.61 Noga Alon 152 2388 15.71

Apple Computer 135 3159 23.4 John Ousterhout 152 6369 41.9

Motorola 122 1352 11.08 Frank Pfenning 148 2049 13.84

Texas Instruments 92 1165 12.66 Andrew Appel 144 7630 52.99

Funding agency impact •  based on acknowledgement indexing •  # of acknowledgements •  total citations •  #Citation / #ack metric

Based on acknowledgment entities extracted from 150K acknowledgements in CiteSeer

Giles, PNAS, 2004

New system available this spring AckSeer

Funding Agency Impact

Author Citations Acknowledge-mentsC/A

MetricOlivierDanvy 847 268 29.85

OdedGoldreich 3277 259 17.82

LucaCardelli 3847 247 43.91

TomMitchell 3336 226 24.31

MartinAbadi 3507 222 43.46

PhilWadler 3780 181 40.07

MosheVardi 3786 180 33.86

Peter Lee 1790 167 53.54AviWigderson 2566 160 18.13

MatthiasFelleisen 1622 154 30.55

BenjaminPierce 1484 152 30.53

Noga Alon 2640 152 15.71JohnOusterhout 3693 152 41.9

FrankPfenning 1639 148 13.84

AndrewAppel 2064 144 52.99

Most Acknowledged Authors and Impact Factor

Interviewed by Nature as to why he was the most acknowledged computer scientist

Who is most acknowledged?

Mom or dad Theorists or experimentalists

Who has a better metric?

Clouding CiteSeerX •  Hosting cloud CiteSeerX instances

•  Economic issues •  Cost of hosting •  Cost of refactoring the source to be hosted in the cloud.

•  Computational/technical issues •  What workflow to cloudize •  Component modification for efficient operation •  VM size: storage, memory and CPU sizing as a function of

needs •  Establishing computational needs and availability clusters •  Appropriate load balancing across multiple sites. •  Security of data stored including metadata and user data.

•  Policy issues •  Privacy of user data •  Copyright issues.

Teregowda Cloud’10 USENIX’10

SeerSuite Research/Development Opportuni3es •  Old Seers

–  Improve or revive old systems and port them into compeEEve SeerX space •  eBizSeer to eBizSeerX; BotSeer to BotSeerX; ArchSeer to ArchSeerX

•  New Seers –  New domains such as physics, neuroscience, biology, algorithms, TBD (build new indexes) –  MyCiteSeerX

•  Be|er features –  Parsing –  EnEty disambiguaEon –  CitaEon analysis –  Ranking; ranking, ranking

•  New features –  New parsing, indexing, ranking

•  Tables, figures, equaEons, algorithms, maps, carbon daEng, chemical formulae, etc –  Homepage linking –  ORE search and data integraEon –  CollaboraEve spaces –  API/web services –  IntegraEon with DL such as Fedora –  New clusters

•  Topics, venues, affiliaEons –  Recommender systems –  SNA analysis –  Others

Collabora>ons welcomed! Data and sohware available

Research SeerSuite supports •  Many uses as a research testbed and support structure

–  Scaling of algorithms for IR, IE, data mining, social networks, ... –  NLP methods on large text collecEons –  ML methods to automaEcally extract data –  Novel indexing and ranking –  Federated search –  CollaboraEve and social networks –  Focused crawling – new data resources –  Interface design and integraEon –  Systems analysis

•  Many development applied research issues –  IntegraEon with other DLs –  Automated feature development –  Transfer to nontechnical use –  Cloud based delivery

Summary •  Propose an infrastructure for academic and scienEfic search engine/digital library

creaEon -‐ SeerSuite –  Modular, scalable, extensible, robust –  Based on commercial grade open source (Solr/Lucene); easy to use –  Easy to apply to other domains (separable indexes and projects -‐ integraEon)

•  Allows scalable data mining and informaEon extracEon for actual systems –  Unique informa4on extrac4on plugins –  Focus on unique scalable extracEon/data mining methods

•  Most methods less than N2 complexity

–  AutomaEcally populates databases or data structures •  Demonstrate with beta systems in

–  Computer science, Archaeology, Chemistry, Robots.txt, PubMed, YouSeer, Tables, Figures, Maps, References, CollaboraEons, DisambiguaEon

–  Personal features •  Systems are reasonably easy to build; issues are

–  Data collecEon or data access –  InformaEon extracEon, indexing, ranking

•  Many uses as a research testbed –  Data sharing models

•  Want to find a Seer, search Google or use my homepage.

Opportun3es •  Science is being flooded with data

–  SimulaEons, sensors, web •  Digital humaniEes is right behind •  Needs in

–  Large scale data management (tera to peta) •  NoSQL databases: graphs, documents, floaEng point,

–  Large scale •  data mining •  informaEon extracEon •  search

•  Domain experEse crucial •  Reuse not reinvent (much is out there) •  Solr/Lucene is great for both demos, producEon and research.

•  clgiles.ist.psu.edu •  [email protected] •  SourceForge.com

“Human attention is the scarce resource, not information.” Herbert A. Simon, Nobel Laureate, 1997.

For more informaEon