using lucene/solr to build citeseerx and friends
DESCRIPTION
Presented by C. Lee Giles, Pennsylvania State University - See complete conference videos - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. We propose the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and discuss its uses in building enterprise search and cyberinfrastructure for the sciences and academia. We highlight application domains with examples of specialized search engines that we have built for computer science, CiteSeerX, chemistry, ChemXSeer, archaeology, ArchSeer. acknowledgements, AckSeer, reference recommendation, RefSeer, collaboration recommendation, CollabSeer, and others, all using Solr/Lucene. Because such enterprise systems require unique information extraction approaches, several different machine learning methods, such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining, etc. are critical for performance.TRANSCRIPT
Using Lucene/Solr to Build CiteSeerX and Friends
Dr. C. Lee Giles Information Sciences and Technology Computer Science and Engineering The Pennsylvania State University
University Park, PA, USA [email protected]
http://clgiles.ist.psu.edu
Prof. C. Lee Giles • Intelligent and specialty search engines; cyberinfrastructure
for science, academia and government – Modular, scalable, robust, automaEc cyberinfrastructure and
search engine creaEon and maintenance – Large heterogeneous data and informaEon systems – Specialty search engines and portals for knowledge integraEon
• CiteSeerx (computer and informaEon science) • ChemXSeer (e-‐chemistry portal) • GrantSeer (grant search) • RefSeer (recommendaEon of paper references)
• Scalable intelligent tools/agents/methods/algorithms – InformaEon, knowledge and data integraEon – InformaEon and metadata extracEon; enEty disambiguaEon – Unique search, knowledge discovery, informaEon integraEon,
data mining algorithms – Web 2.0 methods
• Automated tagging for search and informaEon retrieval • Social network analysis
http://clgiles.ist.psu.edu
SeerSuite Contributors/Collaborators: recent past and present (incomplete list)
Projects: CiteSeer, CiteSeerX, ChemXSeer, ArchSeer, CollabSeer, GrantSeer, SeerSeer, RefSeer, AlgoSeer, AckSeer, BotSeer, YouSeer, …
• P. Mitra, V. Bhatnagar, L. Bolelli, J. Carroll, I. Councill, F. Fonseca, J. Jansen, D. Lee, W-‐C. Lee, H. Li, J. Li, E. Manavoglu, A. Sivasubramaniam, P. Teregowda, H. Zha, S. Zheng, D. Zhou, Z. Zhuang, J. Stribling, D. Karger, S. Lawrence, J. Gray, G. Flake, S. Debnath, H. Han, D. Pavlov, E. Fox, M. Gori, E. Blanzieri, M. Marchese, N. Shadbolt, I. Cox, S. Gauch, A. Bernstein, L. Cassel, M-‐Y. Kan, X. Lu, Y. Liu, A. Jaiswal, K. Bai, B. Sun, Y. Sung, J. Z. Wang, K. Mueller, J.Kubicki, B. Garrison, J. Bandstra, Q. Tan, J. Fernandez, P. Treeratpituk, W. Brouwer, U. Farooq, J. Huang, M. Khabsa, M. Halm, B. Urgaonkar, Q. He, D. Kifer, J. Pei, S. Das, S. Kataria, D. Yuan, T. Suppawong, others.
• Current funding: NSF, Dow Chemical
Outline
• MoEvaEon – Data science; Cyberinfrastructure – Vast growth in domain science data and documents
• SeerSuite – Tool for creaEng Seers – Specialized data and document search and recommendaEons
• Tables, formulae, figures, references … – Use of Solr/Lucene
• Disciplinary sciences, indexes & informaEon extracEon (the Seers) – Computer science – Chemistry – Briefly other Seers
• OpportuniEes for Research • Conclusions and DirecEons
The Evolu3on of Science -‐ the 4th Paradigm
• Observa3onal Science – ScienEst gathers data by direct
observaEon – ScienEst analyzes data
• Analy3cal Science – ScienEst builds analyEcal model – Makes predicEons.
• Computa3onal Science – Simulate analyEcal model – Validate model and makes predicEons
• Data Driven Science – Data captured from the web, by
instruments, or from documents – Data generated by simulaEon – Placed in data structures / files – ScienEst(s) analyze(s) data – Access & search crucial
Jim Gray’s paradigm
Data Access Varies with Discipline or Small vs Big Science
• Small vs Big science – “Data from Big Science is … easier to handle, understand and archive.
Small Science is horribly heterogeneous and far more vast. In Eme Small Science will generate 2-‐3 Emes more data than Big Science.”
• ‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher EducaEon (23/06/2006)
– Data is local – Data will not be shared
• At some point there will be needed – indices to control search – parallel data search and analysis
• Cyberinfrastructure can help – If you can’t move the data around, – Bandwidth of a van loaded with disks take the analysis to the data! – Do all data manipulaEons locally
• Build custom procedures and funcEons locally
SeerSuite • Open source search engine and digital library tool kit used to
build search engines and digital libraries – CiteSeerX , ChemXSeer, RefSeer,YouSeer, CollabSeer, etc.
• Supports research in – Indexing and search – Digital libraries – Data mining & structures – InformaEon and knowledge extracEon – Social networks – Scientometrics/infometrics – Systems engineering, User design – Sokware engineering and management – Web crawling
• Trains students in search and sokware systems – EducaEonal tool for search engine creaEon – Students highly sought in industry and government
SeerSuite -‐ proper3es • Modular, scalable, extensible, robust design
– Extensible to many problems and disciplines • Integrated features
– Focused crawler -‐ Heritrix – Indexer -‐ Solr/lucene – Metadata extracEon -‐ modular – Ranked results
• Builds on experience with other domain engines and OS tools – Lucene and Solr – The MySQL Database and InnoDB Storage Engine – Apache Tomcat – Spring Framework – Acegi Security – AcEveMQ – AcEveBPEL Open Source Engine – Apache Commons Libraries – SVMlight support vector machine package – CRF++ condiEonal random field package
• Hardware independent; Linux • Reuse not reinvent
Data Mining & Information Extraction in Seers • Data acquisition
• SeerSuite systems often crawls the public web for new data • Many data types available
• Richness of data offers unique data mining features • CiteSeerX as testbed/sandbox
• Large scale data resources • Millions of documents, authors, etc. • Some common features/metadata
• Commercial grade indexer (Solr/Lucene)
• Scalable to G’s of documents and M’s of users • “Watson”
• Modular design • Cloudable
• State of the art algorithms (machine learning) for large scale unique metadata (information) extraction & mining
• Unique parsers and indexing • Quality of extraction • Precision/recall • Ranking • Architecture/integration
Seer Friends • In various stages of the system lifecycle with various data resources
and indexes: – Mature and developing, code released
• CiteSeer, now CiteSeerX • ChemXSeer • TableSeer • YouSeer
– New, future TBD, not all aspects public • ArchSeer • AlgoSeer • CollabSeer • RefSeer • SeerSeer • GrantSeer
– Dead or limping by (could be revived) • AckSeer (acknowledgement indexing) (revived!) • BizSeer • BotSeer
– Proposed, but do not exist • BrainSeer • CensorSeer • ArXivSeer
Why Solr/Lucene? • Only open source considered – cost • CompeEtors:
– Indri – Wumpus – Terrier – Others?
• Must scale for both number of documents and users • Easily integrable and customizable
– Other indexes, crawlers, ingesEon, metadata extractors • Well used (Watson) • AcEve community of support
– Enterprise plaporm a plus • Easy to transiEon to government/industry/academia
– Apache license
http://citeseerx.ist.psu.edu
Next Generation CiteSeer, CiteSeerX
• 2 M documents
• 40 M citaEons
• 2 to 5 M authors • 2 to 4 M hits day
• 800K individual users • en3re data shared
• Index -‐ 50 G
History: CiteSeer (aka ResearchIndex)
C. Lee Giles
Kurt Bollacker
Steve Lawrence
Project at NEC Research InsEtute, Princeton 1st academic document search engine Very popular with computer science
Hosted at NEC from 1997 – 2004. Moved to Penn State as collaborators lek.
Provided a broad range of unique services including AutomaEc citaEon indexing, reference linking, full text indexing, similar documents lisEng, automated metadata extracEon and several other pioneering features.
Refactored and redesigned as CiteSeerx Released 2008 Lucene based indexing
CiteSeer continuously running for 15 years!
SeerSuite/CiteSeerX Architecture
• Web Application
• Focused Crawler
• Document Conversion and Extraction
• Document Ingestion
• Data Storage
• Maintenance Services
• Federated Services
Teregowda, USENIX ‘10
4 systems:
• Production • Crawling • Staging • Research
All or some can be cloudized
Teregowda, USENIX 2010
CiteSeerX Services CiteSeerX is a very automated system:
Full OAI metadata if available Full text Indexing (many different indexes)
Documents CitaEons Tables More forthcoming (Algorithms, Figures, Acknowledgements).
CitaEon Graph. Ranking based on citaEons. Linking documents
- Co-‐citaEons - CiEng documents
Author DisambiguaEon DisEnguish between authors with similar names. Profiles and publicaEon informaEon for author.
AutomaEc crawling from list and submissions PersonalizaEon
- Login based access to features on CiteSeerX. - CorrecEons to metadata. - Storage of queries. - CollecEon of papers - Follows document metadata changes.
Focused Crawling • Maintain a list of parent URLs where documents were previously found
– Parent URLs are usually academic homepages. • 300,000 unique parent URLs, as of summer 2011
– Parent URLs are stored in a database table with two addiEonal fields for scheduling:
• Last Eme changed, get new documents from the page. • EsEmated change rate according to previous crawls of this page.
• The crawling process starts with the scheduler selecEng 1000 parent URLs which have the highest probability of having new documents available. – Assume Poisson process for the change behavior of a parent page.
• Suppose a parent page P’s last observed change occurred at Eme t1, and its esEmated change rate is R, then at Eme t2 (t2 = t1 + Δ), the probability that it has changed again since t1 is 1 – exp(-‐R*Δ)
• Larger R or larger Δ will give larger probability. • Aker each crawl, the change rate of the scheduled parent URL should be recalculated.
• Crawling run incrementally daily (invoked by a Linux cron job at 12 am) – Most discovered documents have been crawled before.
• Use hash table comparison for detecEon of new documents • Normally retrieve a few thousand NEW documents per day, someEmes less than 1k.
• Moved to whitelist vs blacklist Zheng, CIKM’09
documents from crawled urls 90% all citations from the first 550 sites
90% all documents from the first 1250 sites
How will we get metadata for fields?
Now... that should clear up a few things around here
Metadata ExtracEon
• Documents are converted from PDF/PS to text using converters.
– Converters include TET, pd{ox, pdkotext,gs.
• Documents are filtered checking, for existence of references and duplicaEon (checksum).
• Use tools or build your own – Metadata extracEon system uses machine learning
methods like SVM (Header Parser), CRF (ParsCit) to extract various enEEes from the document.
• Rule based templates are applied before extracEon.
id
version cluster
title
abstract
venue
venueType
pages
publisher
public
n-cites
crawldate selfCites repositoryID
10.1.1.130.782
This .. 2009
2
JOURNAL
455-500
True 34
6
12/30/2008 10
“Tensor Decompositions and Applications”, SIAM REVIEW, 2009, pp 455-500 Abstract: This …. Cited 34 times, 6 times by Author
AutomaEcally Created DB of paper in CSX
year
System
Extractor/ User/
Inference
Inference/ User
Assigned By
Tensor Decompositions and Applications
9248987
SIAM
SIAM REVIEW
Extraction
Storage
Load Balancer
Web 1
Web 2
Index
Index - Tables
Repository
Database
Load Balancer
Web Application
Ingestion Crawler
User Request
Queries
Requests
3 Tier Architecture
CiteSeerX Sokware Overview • IngesEon Process: Responsible for obtaining and preparing a document and the
related metadata. – Process the document
• Submi|ed by the user or Crawler – Extract Metadata
• Header • CitaEons • Acknowledgements
– Store the metadata and documents. • CitaEon Matching
– Iden>fying the underlying graph structure – documents ci>ng this document and the rela>onship between documents and cita>ons
• Inference matching and graph generaEon
– User CorrecEons (Version Maintenance) – Determine and accept valid user correc>ons – Regular NoEficaEon Mechanisms – Ensure that the user is no>fied when new documents are added to the collec>on
• Linked to MyCiteSeer.
• Update and Maintenance – Update and make valid the full text index and various sta>s>cs. – StaEsEcs
– Index updates
CiteSeerX Search Enabling Search
Fulltext
Fields created
- Title
- Authors
- CitaEons
- Venue
- Keywords
- Abstract
- Range (PublicaEon)
- CitaEons
Field Schema
Field Type Indexed/Stored DOI String Y/Y - Unique Citation/Document String Y/Y Title Text Y/Y Author A Text Y/Y
Authors Normalized A Text Y/N
ncites (# cited by) Integer Y/Y
URL String Y/Y
cites Tokens Y/N
citedby Tokens Y/N
Timestamp Date Y/Y
* - A Text is a Text field which does not have a stopword filter or stemming ^ - Tokens are a Text field with only duplicate removal and whitespace tokenizer
CiteSeerX Search Results Results SorEng
Relevance (default)
- Based on dismax query handling with boosEng.
CitaEons
- CitaEons received by the document in collecEon plus default
Year
- PublicaEon date.
Recency
- Date of acquisiEon.
Sorting
CiteSeerX CitaEon Graph
RelaEonships
CitaEon graph
- Store Cited by and Cites in index
Build
- Build document graph by querying index for relaEonship.
E
D
A
C
B
Cites
Cited by
Adding documents
Ingest documents for new crawls
- Add metadata to collecEon
- Add full text to system
- Link metadata in collecEon
Run maintenance scripts
- Poll updates and post to Solr.
Fulltext
Metadata
RelaEonships
Challenge: Maintain data freshness.
Query Response
Database
Index
Web
Web Interface
• Query forwarded to Solr from the presentaEon layer (JSP)
• Solr generates ranked response in JSON
• Build each record in xml with the database (Add fields: Abstract)
• PresentaEon layer (JSP) formats records based on ranking.
Ranking with BoosEng (Relevance)
Use of Boost FuncEon, Minimum Match, Query Fields Boost FuncEon – the effect of citaEons - Map number of citaEons > 1 to 500
Minimum Match – 2 Query Fields - Text (1)
- Title (4) - Abstract (2)
Query Response Query at Interface (JSP)
Hand over to Web applicaEon (Java/Spring)
Hand over to Solr Ranked response from Solr (JSON)
Response unwrapped and more details included with informaEon from DB
Present response at Interface (JSP)
Web Interaface
Web Application
Index
DB
Q
Q R
F
R Text
Text JSON
HashMap
HashMap
Name DisambiguaEon • Name disambiguaEon (NER)
– A person can be referred to in different ways with different a|ributes in mulEple records; the goal of name disambiguaEon is to resolve such ambiguiEes, linking and merging all the records of the same enEty together
• Three types of name ambiguiEes: – Aliases -‐ one person with mulEple aliases, name variaEons, or name
changed e.g. CL Giles & Lee Giles, Superman & Clark Kent
– Common Names -‐ more than one person shares a common name, e.g. Jian Huang – 103 papers in DBLP
– Typography Errors -‐ resulEng from human input or automaEc extracEon
• Goal: disambiguate, cluster and link names in a large digital library or bibliographic resource such as Medline, CiteSeerX, etc.
• EnEty disambiguaEon problem – Determine the real idenEty of the
authors using metadata of the research papers, including co-‐authors, affiliaEon, physical address, email address, informaEon from crawling such as host server, etc.
– EnEty normalizaEon • MoEvaEon
– Enhance search funcEonaliEes for digital repositories
• Fielded search by author name – Improve metadata quality – Improved social network analysis – Government and business
intelligence • E.g. census data and credit
records
• Challenges – Accuracy – Scalability – Expandability
Efficient Large Scale En3ty Disambigua3on Testbed: CiteSeerX and PubMedSeer
SimilarityFunction
JaccardSimilarity
Soft-TFIDF
Similarity
MetadataExtraction
Module
Online SVM with Active Learning
Distance LearnerAnnotator
Author 1Paper 3
Author 2Paper 4
CandidateClass
SVMDistanceFunction
DBSCANClustering
Module
BlockingModule
• Key features – LASVM distance funcEon
• AcEve learning – Simpler and more accurate model
– Be|er generalizaEon power
• Online learning – Expandable to new training data
– DBSCAN clustering • Ameliorate labeling inconsistency (transiEvity problem) • Efficient soluEon to find name clusters
• N logN scaling
documents
Actors, entities
Huang, et.al PKDD 2006 Treeratpituk, et.al JCDL 2009
Author DisambiguaEon Field
• Currently uses author fields – For author search (both for author menEons and for disambiguated authors)
• Future direcEon – Use Lucene index for blocking in author disambiguaEon – creaEng candidate set of author menEons that could belong to the same cluster
Author DisambiguaEon • Random Forest (RF)
– Use random feature selecEon+bootstrap sampling to construct mulEple decision trees from one training data – Aggregate votes of a collecEon of decision tree as final decision – The more independent each tree is, the be|er the improvement over a single decision tree
• Author disambiguaEon with Random Forest – Various meta data is used as features in Random Forest to determine whether two author name from two papers
refer to the same person • E.g. Author names, affiliaEon, coauthors, keywords, journal informaEon, year of publicaEons, etc
– MulEple distance funcEons are used for each type of meta data • E.g. TFIDF, Jaccard distance, for comparing affiliaEons
• Compared with previous SVM-‐based approach – Shown to provide higher accuracy than SVM in pair-‐wise author disambiguaEon task – Easy parameterizaEon in the training phrase (only number of trees and randomness at each node, no decision on
kernel funcEon needed), and performance is not sensiEve to parameters chosen – Provide measurement for importance of each individual features (how informaEve each feature is, and how
sensiEve the decision is to noise in a parEcular feature), which is not trivial for SVM with non-‐linear kernel – Training Eme & classificaEon Eme is linear to the number of tree and data size
• Also provide higher disambiguaEon accuracy when compared with other tradiEonal method (LogisEc Regression, Naïve Bayes, Decision Tree)
Treeratpituk, Giles, JCDL09
Data and Publications in the Field of Chemistry
Chemistry • not physics - no arXiv – or computer science - no CiteSeer
• Legacy of early information access - Chem Abstracts • Cheminformatics is not bioinformatics
Chemistry has been up to recently a data poor field Data sharing tradition just being established Data creation is exploding - local (small science)
Journals and societies sensitive to their IP issues dominate the field Unsubstantiated IP claims such as data in the paper belongs to the publisher Discourage online versions of publications - ACS
Large powerful international companies have a vested interest in research Chemical information extraction tools are easily monetized Standards exist - CML, InCHI
“Fixing the past so we can fix the future.” Jeremy Frey Chemistry is an old discipline with publications going back 100 years
Chemistry is compound centric, not algorithmic centric Search is about the compound! Compounds have a rich data environ
3D graph structure, energies, etc.
ChemXSeer Architecture Integrate and implement well-used open source tools
Use CiteSeerX tools when possible Integrate into SeerSuite Search
Chemical formulae unique search Table search Figure search More data (grey literature) than documents
• Automated information extraction modules based on machine learning methods • Lucene/Solr indices for extracted fields, • Relational databases for datasets,
Work closely with chemists to understand their needs Tools for data conversion
Provide a public portal and repository for easy use User access controls
Integrated visualization tools like JMOL for Gaussian data residing into our repository
API’s for users for extracted data
Data and documents standards de facto: xml, pdf, etc.
chemxseer.ist.psu.edu
ChemXSeer Formula Search
• Extraction and search of chemical formulae in scientific documents has been shown to be very useful.
• Intersection of two research areas: • Information retrieval • Chemoinformatics
• Formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search)
B. Sun, WWW’07, WWW’08, TOIS’11 D. Yuan, ICDE’12
Challenges in Formula Search
How to identify a formula in scientific documents?
Non-Formula “… This work was funded under NIH grants …” “ … YSI 5301, Yellow Springs, OH, USA …” “… action and disease. He has published over …”
Formula “… such as hydroxyl radical OH, superoxide O2- …” “ and the other He emissions scarcely changed …”
Machine learning algorithms (SVM + CRF) yield high accuracies for correct formula identification.
SegmenEng chemical names • Goal: to discover semanEcally meaningful sub-‐terms in
chemical names – Methylethyl alcohol – methionylglutaminylarginyltyrosylglutamylserylleucyl
phenylalanylalanylglutaminylleucyllysylglutamylarginyl lysylglutamylglycylalanylphenylalanylvalylprolylphenyl alanylvalylthreonylleucylglycylaspartylprolylglycylisol eucylglutamylglutaminylserylleucyllysylisoleucylaspartyl threonylleucylisoleucylglutamylalanylglycylalanylaspartyl alanylleucylglutamylleucylglycylisoleucylprolylphenyl alanylserylaspartylprolylleucylalanylaspartylglycylprolyl threonylisoleucylglutaminylasparaginylalanylthreonylleucyl arginylalanylphenylalanylalanylalanylglycylvalylthreonyl prolylalanylglutaminylcysteinylphenylalanylglutamyl methionylleucylalanylleucylisoleucylarginylglutaminyllysyl hisEdylprolylthreonylisoleucylprolylisoleucylglycylleucyl leucylmethionyltyrosylalanylasparaginylleucylvalylphenyl alanylasparaginyllysylglycylisoleucylaspartylglutamylphenyl alanyltyrosylalanylglutaminylcysteinylglutamyllysylvalyl glycylvalylaspartylserylvalylleucylvalylalanylaspartylvalyl prolylvalylglutaminylglutamylserylalanylprolylphenylalanyl arginylglutaminylalanylalanylleucylarginylhisEdylasparaginyl valylalanylprolylisoleucylphenylalanylisoleucylcysteinyl prolylprolylaspartylalanylaspartylaspartylaspartylleucyl leucylarginylglutaminylisoleucylalanylseryltyrosylglycyl arginylglycyltyrosylthreonyltyrosylleucylleucylserylarginyl alanylglycylvalylthreonylglycylalanylglutamylasparaginyl
Chemical Search Aspects
• Parsing • ExtracEon and tagging • Indexing • Ranking
Chemical EnEty ExtracEon and Tagging • Name tagging
– Each chemical name can be a phrase – Example
• "... Determina>on of lac4c acid and ...“ • "... insec>cide promecarb (3-‐isopropyl-‐5-‐methylphenyl methylcarbamate) acts
against ..."
• Formula tagging – Each formula is a single term – Example
• "... such as hydroxyl radical OH, superoxide ..."
– Non-‐formula example • "... YSI 5301, Yellow Springs, OH, USA ... ”
• Tagging examples – Name tagging:
"... of <name-‐type>lac>c acid</name-‐type> and ...“ – Formula tagging:
"... radical <formula-‐type>OH</formula-‐type> , superoxide ..."
Textual Chemical Molecule InformaEon Indexing and Search
• SegmentaEon-‐based index scheme – Used for indexing chemical names – First segment a chemical name hierarchically
and then index substrings at each node methylethyl
ethylmethyl
meth ethyl yl
me th
• Frequency-‐and-‐discriminaEon-‐based index scheme – Used for indexing chemical formulas – SequenEally select frequent and discriminaEve subsequences of a
formula from the shortest to the longest
• Index Schemes: – Which tokens to index? – Indexing all subsequences generates a large size index
Features for Formula Indexing
• Formula – A sequence of chemical element or par3al formula with corresponding frequencies
– E.g. CH3(CH2)2OH • ParEal formula
– ParEal formula: a subsequence of a formula – E.g. C, H, O, CH3, CH2, OH, CH3(CH)2, H3(CH)2, CH3(CH)2O, etc.
• Index construcEon – ParEal formulas with frequencies: e.g. <C,3>,<H,6>,<CH2,2>, etc.
– Too many parEal formulas, need feature selec3on
Criteria of Feature SelecEon
• Criteria of feature selecEon – Frequent features (Freqs≥Freqmin)
– DiscriminaEve features (αs ≥αmin) • If a sequence’s selected subsequences are enough to disEnguish formulas containing them from other formulas, this sequence is redundant.
• DiscriminaEon score
where F is the selected feature set, and Ds is the set of formulas containing s.
||/|| ''' ssssFss DDpI ∧∈=α
An Example for Formula Indexing
• Data set: – 1.CH3COOH, 2.CH3(CH2)2OH, 3.CH3(CH2)3COOH
• Parameter: – Freqmin=2, αmin=1.1
• Steps: – Length=1, Candidates={C,H,O}, F={C,H,O} – Length=2, Candidates={CH3,H3C,CO,OO,OH,CH2}, Frequent
Candidates={CH3,CO,OO,OH,CH2}
Frequent & DiscriminaEve Candidates={CO,OO,CH2} F={C,H,O,CO,OO,CH2}
– Length=3, …
1|}3,2,1{|/|}3,2,1{}3,2,1{| 33 == CHHCCH Iα
5.1|}3,1{|/|}3,2,1{}3,2,1{| == COOCCO Iα
Formula Search
• SF.IEF: Subsequence Frequency & Inverse EnEty Frequency
• Exact formula search – Search for exact representaEons. E.g. =C1-‐2H4-‐6 matches CH4 and
C2H6, not H4C or H6C2.
• Frequency formula search – Full frequency search: search for formulas with specified chemical
elements and frequency ranges, ignoring the order, no unspecified elements. E.g. C1-‐2H4-‐6 matches CH4, C2H6, H6C2, CH3CH3, not CH4O, C2H6O2.
– ParEal frequency search: similar but allow unspecified elements. E.g. *C1-‐2H4-‐6 matches CH4, C2H6, H6C2, CH3CH3, and CH4O and C2H6O2 as well.
– Ranking funcEon
€
SF(s,e) =Freq(s,e)| e |
,IEF(s) = log |C ||{e | s p e} |
))(||/()(),(),( 22 ∑∑∈∈
×=qsqs
sIFFfsIFFesSFeqscore
Formula Search substructure • Substructure formula search
– Search for formulas that may have a substructure. E.g. -‐COOH matches CH3COOH (exact match: high score), HOOCCH3 (reverse match: medium score), and CH3CHO2 (parsed match: low score).
– Ranking funcEon where Wmatch(q,f) is the weight for exact match, reverse match, and parsed match
• Similarity formula search – Search for formulas with a similar structure of the query formula.
Feature-‐based approach using parEal formula matching. E.g. ~CH3COOH matches CH3COOH, (CH3COO)2Co, CH3COO-‐, etc.
– Ranking funcEon
• ConjuncEve search of the basic types of formula searches – E.g. [*C2H4-‐6 -‐COOH] matches CH3COOH, not C2H4O or
CH3CH2COOH.
• Document query rewriEng – E.g. document query atom formula:=CH4 is rewri|en to atom (CH4
OR CD4), if formula search of =CH4 matches CH4 and CD4.
€
score(s,e) =Wmatch(s, f )SF(s,e)IFF(s) / | e |
€
score(q,e) = Wmatch(q,e )W (s)SF(s,q)SF(s,e)IFF(s)spq∑ / | e |
Formula Search -‐Query Models
Many models are possible from exact to semanEc Models discriminated by matching algorithms
• Exact search – Search for exact representaEons – E.g. =C1-‐2H4-‐6 matches CH4 and C2H6, not H4C or H6C2
• Frequency searches – Full frequency search: search for formulae with specified chemical elements and
frequency ranges, ignoring the order, no unspecified elements – E.g. C1-‐2H4-‐6 matches CH4, C2H6, H6C2, CH3CH3, not CH4O, C2H6O2 – ParEal frequency search: similar but allow unspecified elements – E.g. *C1-‐2H4-‐6 matches CH4, C2H6, H6C2, CH3CH3, and CH4O and C2H6O2 as well
• Substructure search – Search for formulae that may have a substructure – E.g. -‐COOH matches CH3COOH (exact match: high score), HOOCCH3 (reverse match:
medium score), and CH3CHO2 (parsed match: low score). • Similarity search
– Search for formulae with a similar structure of the query formula. Feature-‐based approach using parEal formulae matching.
– E.g. ~CH3COOH matches CH3COOH, (CH3COO)2Co, CH3COO-‐, etc.
Ranking formulae
• Ranking formulae has to depend on need and importance • Focus on structural methods and frequency • Importance can be introduced by citaEon rank or pagerank or others • SF.IFF
– Substructure frequency and inverse formula frequency • Frequency searches
–
– where |f| is the total frequency of elements
• Substructure search –
– where Wmatch(q,f) is the weight for exact match, reverse match, and parsed match
• Similarity search –
))(||/()(),(),( 22 ∑∑∈∈
×=qeqe
eIFFfeIFFfeSFfqscore
||/)(),(),( ),( fqIFFfqSFWfqscore fqmatch=
||/)(),(),()(),( ),( fsIFFfsSFqsSFsWWfqscoreqs
fqmatch∑=p
Chemical compounds as graphs • Chemical compound modeled as a semanEc graph with properEes
Above figures are copied from eMolecules.com
Atom: vertex/node in the graph Bond: edge in the graph Dimensions: 3 or 4
What’s Chemical Structure Search • Substructure Search
– Given an input chemical structure sketch, find all the chemical compounds containing the input as a substructure.
• Super structure Search – Given an input chemical structure sketch, find all the important descriptors (substructures/ funcEonal group) contained in the input.
• Similarity Search – Given an input chemical structure sketch, find all the chemical compounds “similar” to the input.
Table Search
Tables are widely used to present experimental results or statistical data in scientific documents; some data only exists in these tables.
Current search engines treat tabular data as regular text • Structural information and semantics not preserved.
Goal: automatically identify tables, extract table metadata from pdf documents into xml and rank data
Table Metadata Representation: • Environment metadata: (document specifics: type, title,…) • Frame metadata: (border left, right, top, bottom, …) • Affiliated metadata: (Caption, footnote, …) • Layout metadata: (number of rows, columns, headers,…) • Cell content metadata: (values in cells) • Type metadata: (numeric, symbolic, hybrid, …)
Y. Liu AAA’07, JCDL’07.
Tables • A history that pre-dates that of sentential text
– Cuneiform clay tablets • Not received the same level of formal characterization
enjoyed by sentential text • Varying and irregular formats • Different intuitive understanding of what a “table” is.
– Is the Periodic Table of the Elements a table? – Tables vs. Lists? – Tables vs. Forms? – Tables vs. Figures? – Genuine table vs. non-genuine table? [12]
• Our definition: scientific genuine table – Caption + tabular structure – Ruling lines are not required
TableSeer Beta design of a table search engine
TableSeer System
Architecture
Page Box-‐Cu�ng Algorithm
• Improves the table detecEon performance by excluding more than 93.6% document content in the beginning
Sample Table Metadata Extracted File
• <Table>
• <DocumentOrigin>Analyst</DocumentOrigin> • <DocumentName>b006011i.pdf</DocumentName>
• <Year>2001</Year> • <DocumentTitle>Detec3on of chlorinated methanes by 3n oxide gas sensors </DocumentTitle>
• <Author>Sang Hyun Park, a ? Young-‐Chan Son, a Brenda R . Shaw, a Kenneth E. Creasy,* b and Steven L. Suib* acd a Department of Chemistry, U-‐60, University of Connec3cut, Storrs, C T 06269-‐3060</Author>
• <TheNumOfCiters></TheNumOfCiters> • <Citers></Citers>
• <TableCap3on>Table 1 Temperature effect o n r esistance change ( D R ) and response 3meof 3n oxide thin film with 1 % C Cl 4</TableCap3on> • <TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O 2 ) (%) R esponse 3me Reproducibiliy </TableColumnHeading>
• <TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-‐8 min Yes 300 1027 21 < 2 0 s Yes 400 993 31 ~ 1 0 s No </TableContent> • <TableFootnote> a D R =( R , CCl 4 ) -‐ ( R ,O 2 ). </TableFootnote>
• <ColumnNum>5</ColumnNum> • <TableReferenceText>In page 3, line 11, … Film responses to 1% CCl4 at different temperatures are summarized in Table 1……</TableReferenceText>
• <PageNumOfTable>3</PageNumOfTable> • <Snapshot>b006011i/b006011i_t1.jpg</Snapshot>
• </Table>
TableRank
• Rank tables by rating the <query, table> pairs, instead of the <query, document> pairs: preventing a lot of false positive hits for table search, which frequently occur in current web search engines • The similarity between a <table, query> pair: the cosine of the angle between vectors
• Tailored term vector space => table vectors: • Query vectors and table vectors, instead of document vectors
Table Index
Index CapEons
Footnotes Reference Text
BoosEng CapEons (2)
FuncEon: - Inversely (recip) proporEonal to #cites.
Term WeighEng for Tables – TTF – ITTF: (Table Term Frequency-‐Inverse Table Term Frequency)
– TLB: Table Level Boost Factors (e.g., table frequency) – DLB: Document Level Boost factors (e.g., journal/proceeding order, document
citaEon)
Table term ranking
• A term occurring in a few tables is likely to be a better discriminator than a term appearing in most or all tables • Similar to document abstract, table metadata and table query should be treated as semi-structured text
• Not complete sentences and express a summary • P = 0.5 (G. Salton 1988)
• b is the total number of tables • IDF(ijk): the number of tables that term t(i) occurs in the matadata m(k)
Table Level Boost and Document Level Boost
Btbf is the boost value of the table frequency Btrt is the boost value of the table reference text (e.g., the normalized length), and Btp is the boost value of the table position. r is a parameter, which is 1 if users specify the table position in the query. Otherwise, r = 0.
IVj: document Importance Value (IV). If a table comes from a document with a high IV , all the table terms of this document should get a high document level boost ICj: the inherited citation value (ICj) DOj: source value (the rank of the journal/conference proceeding) DFj: document freshness
Table citaEon network • Similar to the PageRank network
– Documents construct a network from the citaEons – The “incoming links” – the documents that cite the document in which
the table is located – ExponenEal decay used to deal with the impact of the propagated
importance • Unlike the PageRank network
– Directed Acyclic Graph – Importance Value (IV) of a document not decreased as the number of
citaEons increases – IV not divided by the number of outbound links
• A document may have mulEple, one, or no tables • Each table is consisted as a set of metadata • Same keywords may appear in different metadata in different
tables
Table Search Summary • An novel first table ranking algorithm -‐-‐ TableRank • A tailored table term vector space • A table term weighEng scheme – TTF-‐ITTF
– AggregaEng impact factors from three levels: the term, the table, and the document
• Index table referenced texts, term locaEons, and document backgrounds
• Design and implement first table search engine, TableSeer, to evaluate the TableRank and compare with popular web search engines
• Code released • Currently implement in CiteSeerX -‐ millions of tables • Improving extracEon – Dow Chemical support
Automated Figure Data Extraction and Search"• Large amount of results in digital documents are recorded in figures, time series, experimental
results (eg., NMR spectra, income growth) and this is the only record of the data"
• Extraction for purposes of:"– Further modeling using presented data"– Indexing, meta-data creation for storage & search on figures for data reuse"
• Current extraction done manually!!
Documents
Plot Index Document Index
Merged Index
Extracted Plot Extracted Info.
User
Digital Library
Seer Figure/Plot Data Extraction and Search
Numerical data in scientific publications are often found in figures.
Tools that automate the data extraction from figures provide the following: • Increases our understanding of key concepts of papers • Provides data for automatic comparative analyses. • Enables regeneration of figures in different contexts. • Enables search for documents with figures containing specific experiment results.
X. Lu JCDL’06 & IJDAR’09, Brouwer JCDL’08, Kataria AAAI’08
Metadata & data to extract: 2 Dimensional Plot"
Snapshot of a document Extracted 2D plot
X-Axis Label
Legend
Axis Units
Ticks
Data Points
Y-Axis Labels
Our Approach to Plot Data ExtracEon • Identify and extract figures from digital documents
• Ascii and image extraction (xpdf) • OCR - bit map, raster pdfs
• Identify figures as images of 2D plots using SVM (Only for Bit map images)
• Hough transform • Wavelets coefficients of image • Surrounding text features
• Binarization of the 2D plots identified for preprocessing (No need for Vectorized Images)
• Adaptive Thresholding • Image segmentation to identify regions
• Profiling or Image Signature • Text block detection
• Nearest Neighbor • Data point detection
• K-means Filtering • Data point disambiguation for overlapping points
• Simulated Annealing
• System integration within ChemXSeer or CiteSeerX"– XML data generation"– Open source tool in Lucene/SOLR "
• Extension to other figures (3D, …)
10" 20" 30" 40" 50" 60" 70"
5"10"15"20"25"30"0"
2e+07"4e+07"6e+07"8e+07"1e+08"
1.2e+08"
Future Directions
ChemXSeer Highlights • Portal for academic researchers in environmental chemistry which integrates the scientific literature with experimental, analytical and simulation results and tools
• Provides unique metadata extraction, indexing and searching pertinent to the chemical literature by using heuristics combined with machine learning
• Chemical formulae and names • Tables • Figures • Publication functions as in CiteSeerX • Interoperability ORE-Chem development • Novel ranking required
• After extraction, data stored API accessible xml for users
• Hybrid repository (Not fully open): Serves as a federated information interoperational system • Scientific papers crawled and indexed from the web • User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) • Scientific documents and metadata from publishers (e.g. Royal Society of Chemistry)
• Access control for publisher-provided content and user-submitted experiment data
• Takes advantage of developments in other funded cyberinfrastructure and open source projects
• CiteSeerX, PlanetLab, Lucene/Solr, ORE, others • Some released open source
• CollabSeer currently supports 400k authors • h|p://collabseer.ist.psu.edu
Experimental Collaborator recommendation system
CollaboraEon recommendaEon
• Metadata of authors and coauthors and topics of interest (similar to expert recommendaEon)
• Use social network and topics to recommend collaborators of collaborators (FOF)
• Devise SN index and ranking scheme
• Explore models of vertex similarity • Built on SeerSuite
• Other recommendaEons? – Experimental methods
– Chemicals?
Gou JCDL’10, Gou MIR’10 Chen JCDL’11, SAC’12
RecommendaEon list and user’s topic of interest
• Users refine the recommend list by clicking on their topic of interest. (lek: refined by “query processing”, right: default recommendaEon list)
• How two potenEal collaborators are linked by common collaborators
CollabSeer Framework
IntegraEon of Vertex Similarity and Textual Similarity
• – S: vertex similarity
– SC.O.T.: collaborator’s contribuEon to a specified topic – Use the product of exponenEal funcEons to avoid zero vertex similarity score or zero contribuEon (textual similarity) score to turn the whole measure into zero
• Other measures?
• RefSeerX: recommend citaEons for papers
• Based – ExisEng citaEons – CitaEon context – Venue and importance – Contemporary vs seminal
paper citaEons
The authors are unaware of related work they do not know they are looking for recommends related citations
Use these
He, WWW ‘10, WSDM ’11; Kataria, CIKM ’10, IJCAI’11,
Expert Search
• Expert search for authors, currently in alpha
Expert Search
• Expert search for authors, currently in alpha
Keyphrase ExtracEon for experts
SecEon Parser
Candidate Extractor
Random Forest
Top Keyphrases
Training Data
DBLP data
Text Document
Parse document into sections with regular expression
Use DBLP statistic to extract keyphrase candidates
Train random forest to classify & rank whether a phrase is a keyphrase
Treeratpituk, P., Teregowda, P., Huang, J. and Giles, CL. SEERLAB: A System for Extracting Keyphrases from Scholarly Documents, Semeval-2010 task 5: Automatic keyphrase extraction from scientific article. ACL workshop on Semantic Evaluations (SemEval 2010), Sweden, July 2010.
GrantSeer • Prototype search engine for PI profiles and their grant
informaEon to assist funding agencies, deans of research, foundaEons
• Link PIs with their – Grants – PublicaEons – CitaEons – OrganizaEon – ExperEse – Others?
• Data that can be shared – CiteSeerX or Google Scholar data – Database of funded research
Funded by NSF – Julia Lane
Cover page NSF XML extracEon
GrantSeer: PI profile
grants awarded
publications + citations PI’s expertise
Algorithm Search
• Homepage search for authors, currently in alpha
AlgorithmSeer
Algorithm Search
-‐ ExtracEon -‐ Indexing -‐ Ranking
Suite Workshop ICSE ‘11
Algorithm Search
Metadata extraction • Extract
• Pseudo-codes and their metadata • Captions • Reference sentences • Synopsys • Etc.
• Index metadata using Solr to make the pseudo-codes searchable • Each search result has a pointer to the page in the document where the pseudo-code appears
Index Fields
id <string> caption <text> reftext <text> (Reference Sentences) synopsis <text> (Summarizing Text) page <sint> (Page Number) paperid <string> (Document ID) year <sint> (Year of Publication) ncites <sint> (Number of Citations)
AckSeer
94
AckSeer
95
NameNumber of
Acknowledge-mentsTotal
CitationsC/A
Metric NameNumber of
Acknowledge-mentsTotal
CitationsC/A
Metric
Funding Agencies EducationalInstitutions
National ScienceFoundation 12287 144643 11.77 Carnegie Mellon
University 640 10840 16.94
Defense AdvancedResearch Projects Agency 4712 80659 17.12 Massachusetts Institute
of Technology 500 10509 21.02
Office of Naval Research 3080 48873 15.87 California Institute ofTechnology 464 4170 8.99
DeutscheForschungsgemeinschaft 2780 9782 3.52 Santa Fe Institute 368 3387 9.2
National Aeronautics andSpace Administration 2408 21242 8.82
French NationalInstitute for Research inComputer Science
321 3399 10.59
Engineering and PhysicalScience Research Council 2007 16582 8.26 Stanford University 314 3693 11.76
Air Force Office ofScientific Research 1657 16850 10.17 University of California
at Berkeley 306 10439 34.11
National Sciences andEngineering ResearchCouncil of Canada
1422 12050 8.47National Center forSupercomputingApplications
261 4777 18.3
Department of Energy 1054 5562 5.28 International ComputerScience Institute 180 2078 11.54
Australian ResearchCouncil 1010 5464 5.41 Cornell University 180 1656 9.2
European UnionInformation TechnologiesProgram
825 9594 11.63 University of Illinois atUrbana-Champaign 177 5304 29.97
National Institutes ofHealth 709 7279 10.27 USC Information
Sciences Institute 176 3283 18.65
Army Research Office 666 7709 11.58 University of CaliforniaLos Angeles 176 2003 11.38
Netherlands Organizationfor Scientific Research 646 2843 4.4 McGill University 152 3001 19.74
Science and EngineeringResearch Council 489 6976 14.27 Australian National
University 123 549 4.46
Companies IndividualsInternational BusinessMachines 1380 23948 17.35 Olivier Danvy 268 8000 29.85
Intel Corporation 962 14441 15.01 Oded Goldreich 259 4615 17.82Digital EquipmentCorporation 831 16390 19.72 Luca Cardelli 247 10846 43.91
Hewlett-Packard 735 11186 15.22 Tom Mitchell 226 5494 24.31
Sun Microsystems 651 12042 18.5 Martin Abadi 222 9647 43.46
Microsoft Corporation 368 6061 16.47 Phil Wadler 181 7252 40.07
Silicon Graphics, Inc 279 3898 13.97 Moshe Vardi 180 6094 33.86
Xerox Corporation 265 4309 16.26 Peter Lee 167 8941 53.54
Siemens Corporation 241 8395 34.83 Avi Wigderson 160 2901 18.13
Bellcore 192 2393 12.46 Matthias Felleisen 154 4705 30.55
Nippon Electric Company 164 942 5.74 Benjamin Pierce 152 4641 30.53
AT&T- Bell Labs 146 1549 10.61 Noga Alon 152 2388 15.71
Apple Computer 135 3159 23.4 John Ousterhout 152 6369 41.9
Motorola 122 1352 11.08 Frank Pfenning 148 2049 13.84
Texas Instruments 92 1165 12.66 Andrew Appel 144 7630 52.99
Funding agency impact • based on acknowledgement indexing • # of acknowledgements • total citations • #Citation / #ack metric
Based on acknowledgment entities extracted from 150K acknowledgements in CiteSeer
Giles, PNAS, 2004
New system available this spring AckSeer
Funding Agency Impact
Author Citations Acknowledge-mentsC/A
MetricOlivierDanvy 847 268 29.85
OdedGoldreich 3277 259 17.82
LucaCardelli 3847 247 43.91
TomMitchell 3336 226 24.31
MartinAbadi 3507 222 43.46
PhilWadler 3780 181 40.07
MosheVardi 3786 180 33.86
Peter Lee 1790 167 53.54AviWigderson 2566 160 18.13
MatthiasFelleisen 1622 154 30.55
BenjaminPierce 1484 152 30.53
Noga Alon 2640 152 15.71JohnOusterhout 3693 152 41.9
FrankPfenning 1639 148 13.84
AndrewAppel 2064 144 52.99
Most Acknowledged Authors and Impact Factor
Interviewed by Nature as to why he was the most acknowledged computer scientist
Who is most acknowledged?
Mom or dad Theorists or experimentalists
Who has a better metric?
Clouding CiteSeerX • Hosting cloud CiteSeerX instances
• Economic issues • Cost of hosting • Cost of refactoring the source to be hosted in the cloud.
• Computational/technical issues • What workflow to cloudize • Component modification for efficient operation • VM size: storage, memory and CPU sizing as a function of
needs • Establishing computational needs and availability clusters • Appropriate load balancing across multiple sites. • Security of data stored including metadata and user data.
• Policy issues • Privacy of user data • Copyright issues.
Teregowda Cloud’10 USENIX’10
SeerSuite Research/Development Opportuni3es • Old Seers
– Improve or revive old systems and port them into compeEEve SeerX space • eBizSeer to eBizSeerX; BotSeer to BotSeerX; ArchSeer to ArchSeerX
• New Seers – New domains such as physics, neuroscience, biology, algorithms, TBD (build new indexes) – MyCiteSeerX
• Be|er features – Parsing – EnEty disambiguaEon – CitaEon analysis – Ranking; ranking, ranking
• New features – New parsing, indexing, ranking
• Tables, figures, equaEons, algorithms, maps, carbon daEng, chemical formulae, etc – Homepage linking – ORE search and data integraEon – CollaboraEve spaces – API/web services – IntegraEon with DL such as Fedora – New clusters
• Topics, venues, affiliaEons – Recommender systems – SNA analysis – Others
Collabora>ons welcomed! Data and sohware available
Research SeerSuite supports • Many uses as a research testbed and support structure
– Scaling of algorithms for IR, IE, data mining, social networks, ... – NLP methods on large text collecEons – ML methods to automaEcally extract data – Novel indexing and ranking – Federated search – CollaboraEve and social networks – Focused crawling – new data resources – Interface design and integraEon – Systems analysis
• Many development applied research issues – IntegraEon with other DLs – Automated feature development – Transfer to nontechnical use – Cloud based delivery
Summary • Propose an infrastructure for academic and scienEfic search engine/digital library
creaEon -‐ SeerSuite – Modular, scalable, extensible, robust – Based on commercial grade open source (Solr/Lucene); easy to use – Easy to apply to other domains (separable indexes and projects -‐ integraEon)
• Allows scalable data mining and informaEon extracEon for actual systems – Unique informa4on extrac4on plugins – Focus on unique scalable extracEon/data mining methods
• Most methods less than N2 complexity
– AutomaEcally populates databases or data structures • Demonstrate with beta systems in
– Computer science, Archaeology, Chemistry, Robots.txt, PubMed, YouSeer, Tables, Figures, Maps, References, CollaboraEons, DisambiguaEon
– Personal features • Systems are reasonably easy to build; issues are
– Data collecEon or data access – InformaEon extracEon, indexing, ranking
• Many uses as a research testbed – Data sharing models
• Want to find a Seer, search Google or use my homepage.
Opportun3es • Science is being flooded with data
– SimulaEons, sensors, web • Digital humaniEes is right behind • Needs in
– Large scale data management (tera to peta) • NoSQL databases: graphs, documents, floaEng point,
– Large scale • data mining • informaEon extracEon • search
• Domain experEse crucial • Reuse not reinvent (much is out there) • Solr/Lucene is great for both demos, producEon and research.
• clgiles.ist.psu.edu • [email protected] • SourceForge.com
“Human attention is the scarce resource, not information.” Herbert A. Simon, Nobel Laureate, 1997.
For more informaEon