outline
DESCRIPTION
Indexing and searching heterogeneous information LLNL – Nov. 3, 2006 Edward A. Fox Virginia Tech [email protected] http://fox.cs.vt.edu. Outline. Acknowledgements, Publications Introduction: Problem, Digital Libraries New Efforts: Personalization, Superimposed Info 5S, ETANA, Structure - PowerPoint PPT PresentationTRANSCRIPT
1
Indexing and searching heterogeneous information
LLNL – Nov. 3, 2006
Edward A. FoxVirginia [email protected]
http://fox.cs.vt.edu
2
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
3
Acknowledgements: Students
• Pavel Calado, William Cameron, Yuxin Chen, Fernando Das Neves, Robert France, Marcos Gonçalves, S.H. Kim, Aaron Krowne, Ming Luo, Paul Mather, Fernando Das Neves, Sanghee Oh, Unni. Ravindranathan, Ryan Richardson, Rao Shen, Ohm Sornil, Hussein Suleman, Ricardo Torres, Manas Tungare, Wensi Xi, Seungwon Yang, Xiaoyan Yu, Baoping Zhang, Qinwei Zhu, …
4
Acknowledgements: Faculty, Staff
• Lillian Cassel, Lois Delcambre, Debra Dudley, Roger Ehrich, Joanne Eustis, Weiguo Fan, James Flanagan, C. Lee Giles, Rohit Kelapure, Neill Kipp, Douglas Knight, Deborah Knox, Aaron Krowne, Alberto Laender, David Maier, Gail McMillan, Claudia Medeiros, Manuel Perez-Quinones, Jeffrey Pomerantz, Naren Ramakrishnan, Layne Watson, Barbara Wildemuth, …
5
Other Collaborators (Selected)
• Brazil: FUA, UFMG, UNICAMP• Case Western Reserve University• Emory, Notre Dame, Oregon State• Germany: Univ. Oldenburg• Mexico: UDLA (Puebla), Monterrey• College of NJ, Hofstra, Penn State,
Villanova• University of Arizona• University of Florida, Univ. of Illinois• University of Virginia
Acknowledgements: Support
• ACM, Adobe, AOL, CAPES, CNI, CONACyT, DFG, IBM, Microsoft, NASA, NDLTD, NLM, NSF (IIS-9986089, 0086227, 0080748, 0325579, 0535057; ITR-0325579; DUE-0121679, 0136690, 0121741, 0333601, 0435059, 0532825), OCLC, SOLINET, SUN, SURA, UNESCO, US Dept. Ed. (FIPSE), VTLS
7
Publications – 1 of 2• N. J. Belkin, P. Kantor, E. A. Fox and J. A. Shaw. Combining the Evidence of Multiple
Query Representations for Information Retrieval. Information Processing & Management, 31(3), 431-448, May-June 1995.
• Fan, W., Luo, M., Wang, L., Xi, W., and Fox, E. A. Tuning before feedback: Combining ranking discovery and blind feedback for robust retrieval. SIGIR 2004, 27th Annual Int’l ACM SIGIR Conf. on R&D in Information Retrieval, Sheffield, England, 25-29 July
• Weiguo Fan; Gordon, M.D.; Pathak, P.; Wensi Xi; Fox, E.A.; Ranking function optimization for effective web search by genetic programming: an empirical study, in the Proceedings of 37th Hawaii International Conf. on System Sciences (HICSS), 5-8 Jan. 2004, 105 - 112
• Edward A. Fox, Fernando Das Neves, Xiaoyan Yu, Rao Shen, Seonho Kim, and Weiguo Fan. Exploring the computing literature with visualization and stepping stones & pathways. CACM 49(4): 52-58, April 2006
• Edward A. Fox and Paul Mather. Scalable Storage for Digital Libraries. Chapter 12 in Multimedia Information Retrieval and Management: Technological Fundamentals and Applications, eds. D. Feng, W.C. Siu and H.J. Zhang, Berlin: Springer, 2003, pp. 265-288
• E. Fox and J. Shaw. Combination of Multiple Searches. In Proc. of The Second Text REtrieval Conference (TREC-2) (Aug. 30 - Sept. 1, 1993, NIST, Gaithersburg, MD), NIST Special Pub. 500-215, 1994, ed. D. K. Harman, 243-252
• Marcos Andre Goncalves, Robert K. France, and Edward A. Fox, MARIAN: Flexible Interoperability for Federated Digital Libraries. In Proc. 5th European Conference on Research and Advanced Technology for Digital Libraries, ECDL'2001, September 4-8, 2001, Darmstadt, Germany, Springer, LNCS 2163 / 2001, pp. 173-186
• Ananth Raghavan, Naga Srinivas Vemuri, Rao Shen, Marcos Andre Goncalves, Weiguo Fan, and Edward A. Fox. Incremental, Semi-automatic, Mapping-Based Integration of Heterogeneous Collections into Archaeological Digital Libraries: Megiddo Case Study. In Proc. ECDL2005, Vienna, Sept. 18-23, 2005, 139-150
8
Publications – 2 of 2• Rao Shen, Naga Srinivas Vemuri, Weiguo Fan, Ricardo da S. Torres, and Edward A. Fox.
Exploring Digital Libraries: Integrating Browsing, Searching, and Visualization. In Proc. JCDL 2006, June 11-15, 2006, Chapel Hill, NC, 1-10
• Ricardo da Silva Torres, Alexandre X. Falcao, Baoping Zhang, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Pavel Calado. A new framework to combine descriptors for content-based image retrieval. In Proc. 14th Conf. Information and Knowledge Management, CIKM 2005, 31 Oct. - 5 Nov. 2005 Bremen, Germany, 335-336
• Li Wang, Weiguo Fan, Rui Yang, Wensi Xi, Ming Luo, Ye Zhou, Edward A. Fox, Ranking Function Discovery by Genetic Programming for Robust Retrieval, Text Retrieval Evaluation Conference-2003, Nov 17-23, NIST, Washington DC, 9 pages
• Wensi Xi, Edward A. Fox, Weiguo Fan, Benyu Zhang, Zheng Chen, Jun Yan, Dong Zhuang. SimFusion: Measuring Similarity using Unified Relationship Matrix. In Proc. SIGIR 2005, 28th Annual International ACM SIGIR Conf., Salvador, Brazil, August 15-19, 2005, 130-137, http://doi.acm.org/10.1145/1076034.1076059
• W. Xi, B. Zhang, Z. Chen, Y. Lu, S. Yan, W.Y. Ma, E.A. Fox. Link Fusion: A Unified Link Analysis Framework for Multi-type Inter-related Data Objects. In Proc. Thirteenth International World Wide Web Conf., WWW2004, NY, U.S.A. 19-22 May 2004, 10 pages
• Wensi Xi, Ohm Sornil, Ming Luo, and Edward A. Fox. Hybrid Partition Inverted Files: Experimental Validation. In "Research and Advanced Technology for Digital Libraries, 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002, Proceedings", eds. Maristella Agosti and Constantino Thanos, LNCS 2458, Springer, pp. 422-431.
• Wensi Xi, Ohm Sornil, and Edward A. Fox. Hybrid Partition Inverted Files for Large-Scale Digital Libraries. Proc. Digital Library: IT Opportunities and Challenges in the New Millennium, July 9-11, 2002, Beijing Library Press, Beijing, China, 404-418
• Baoping Zhang, Yuxin Chen, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Marco Cristo, Pavel Calado. Intelligent GP Fusion from Multiple Sources for Text Classification. In Proc. 14th Conf. on Information and Knowledge Management, CIKM 2005, 31st October - 5 Nov 2005 Bremen, Germany, 477-484
9
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
10
Problem Characterization
• Distributed (space)
• Content (streams)
• Indexing (space, structure)– Features– Type/sub-type: Image, texture; link, citation– Descriptors: words or phrases or concepts– High dimensionality
• Searching (scenario)
11
Efficiency / Effectiveness
• Effectiveness– Very common measures: Precision, Recall,
F1, 10-precision, R-Precision– Usefulness, usability, task support, …
• Efficiency– Time– Space– Performance, Resource use, …
12
13
14
15
CC2001 Information Management Areas
IM1. Information models and systems*
IM8. Distributed DBs
IM2. Database systems* IM9. Physical DB design
IM3. Data modeling* IM10. Data mining
IM4. Relational DBs IM11. Information storage and retrieval
IM5. Database query languages
IM12. Hypertext and hypermedia
IM6. Relational DB design IM13. Multimedia information & systems
IM7. Transaction processing IM14. Digital libraries
* Core components
16
DL Curriculum FrameworkSemester 1:
DL collections:development/creation
Semester 2:DL services and
sustainability
CO
UR
SE
ST
RU
CT
UR
E
DigitizationStorage
Interchange
Digital objectsCompositesPackages
MetadataCataloging
Author submission
NamingRepositories
Archives
Spaces(conceptual,geographic,2/3D, VR)
Architectures(agents, buses,
wrappers/mediators)Interoperability
Services(searching,
linking, browsing, etc.)
Intellectual property rights mgmt.
PrivacyProtection (watermarking)
Archiving and preservation
Integrity
Architectures(agents, buses,
wrappers/mediators)Interoperability
CO
RE
DL
TO
PIC
S
DocumentsE-publishing
Markup
Info. NeedsRelevanceEvaluation
Effectiveness
ThesauriOntologies
ClassificationCategorization
Bibliographic information
BibliometricsCitations
RoutingFiltering
Community filtering
Search & search strategyInfo seeking behavior
User modelingFeedback
Info summarizationVisualization
Multimedia streams/structures
Capture/representationCompression/coding
Content-based analysis
Multimedia indexing
Multimediapresentation,
rendering
RE
LA
TE
DT
OP
ICS
17
D ig ita l L ib ra r y C o n te n t
A rtic le s ,R e p o rts,
B o o ks
T e xtD o cum e n ts
S p ee ch ,M u s ic
V id eoA u d io
(A e ria l)P h o tos
G e og rap h icIn fo rm ation
M o d e lsS im u la tio ns
S o ftw a re ,P ro g ra m s
G e no m eH u m a n,a n im a l,
p la n t
B ioIn fo rm ation
2 D , 3 D ,V R ,C A T
Im ag es a ndG ra p h ics
C o nte n tT yp e s
18
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
19
Personalizing A Course Website Using the NSDL
William Cameron2, Boots Cassel2, Edward Fox1, Manuel Perez-Quinones1, Manas
Tungare1, Xiaoyan Yu1
Virginia Tech1, Villanova2
20
Syllabus Collection …Towards an intelligent educational system
Unstructured Syllabus Text
StructuredSyllabus
Text
SearcherRecommender
Crawler
SyllabusClassifier
Extractor
Editor
SyllabusOntology
Services
Publisher
Other NSDL
Resources
Potential Syllabus
Text
Classification Scheme
ResourceClassifier
21
Search
• With collection, we have a full text search
• Results point to local copy in our collection as well as to original document
• Try it outhttp://doc.cs.vt.edu/search/
22
Syllabus Ontology
• Standard, machine understandable
• Ontology Editor: Protégé
• Syllabus Schema: SylVia
• http://doc.cs.vt.edu/ontologies/
23
Creating new syllabus
• Web-based application to support entry of syllabi into collection
• Moodle Plug-in in the works
• Uses CC 2001 to select topics for a course
24
Information Extraction
• Plans to automatically extract information from syllabi documents collected
• Rule-based Approach
• Statistics-based Approach
• Apply the best extractor on the unstructured syllabi
04/19/23 25
Superimposed Tools for VT
Uma Murthy and Edward A. FoxDepartment of Computer Science, Virginia Tech
18 October 2006
26
Origin of SI
• This basic need had been addressed in diverse ways, with varying degrees of success, for many years:– concordances, annotations, comments
– bookmarks, concept maps, digital annotations, …
• The term “SI” was coined in 1999 by researchers, currently collaborating with us, now at Portland State University– Lois Delcambre
– David Maier
27
Layers in an SI system
Superimposed
Layer
Base Layer
Information Source1
Information Source2
Information Sourcen
…
marks
* Source: ICDE04 presentation by Murthy, et. al
28
Annotating an image
29
Searching over annotations
30
Searching over images/sub-images
31
Summary
* Source: ICDE04 presentation by Murthy, et. al
Superimposed
Layer
Base Layer
Information Source1
Information Source2
Information Sourcen
…
marks
32
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
33
Informal 5S & DL Definitions
DLs are complex systems that
• help satisfy info needs of users (societies)
• provide info services (scenarios)
• organize info in usable ways (structures)
• present info in usable ways (spaces)
• communicate info with users (streams)
34
5Ss
Ss Examples Objectives
Streams Text; video; audio; image Describes properties of the DL content such as encoding and language for textual material or particular forms of multimedia data
Structures Collection; catalog; hypertext; document; metadata
Specifies organizational aspects of the DL content
Spaces Measure; measurable, topological, vector, probabilistic
Defines logical and presentational views of several DL components
Scenarios Searching, browsing, recommending
Details the behavior of DL services
Societies Service managers, learners, teachers, etc.
Defines managers, responsible for running DL services; actors, that use those services; and relationships among them
35
Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing
Annotating Classifying Clustering Evaluating Extracting Indexing
Measuring Publicizing
Rating Reviewing (peer)
Surveying Translating
(language)
Conserving Converting
Copying/Replicating Emulating Renewing
Translating (format)
Acquiring Cataloging
Crawling (focused) Describing Digitizing
Federating Harvesting Purchasing Submitting
Preservational Creational
Add Value
Repository-Building
Information Satisfaction
Services
Infrastructure Services
Taxonomy of DL Services
36
5S and DL formal definitions and compositions (April 2004 TOIS)
5S
structures (d.10)streams (d.9) spaces (d.18) scenarios (d.21) societies (d. 24)
structural metadataspecification(d.25)
descriptive metadataspecification(d.26)
repository(d. 33)
collection (d. 31)
(d.34)indexingservice
structured stream (d.29)
digitalobject (d.30)
metadata catalog (d.32)
browsingservice
(d.37)
searchingservice (d.35)
digital library(minimal) (d. 38)
services (d.22)
sequence (d. 3)
graph (d. 6)function (d. 2)
measurable(d.12), measure(d.13), probability (d.14), vector (d.15), topological (d.16) spaces
event (d.10)state (d. 18)
hypertext(d.36)
sequence (d. 3)
transmission(d.23)
relation (d. 1) language (d.5)
grammar (d. 7)
tuple (d. 4)*
37
Digital Object
RepositoryCollection Minimal DL
Metadata Catalog
Descriptive Metadata
Specification
A Minimal DL in the 5S Framework
Structural Metadata
Specification
Streams Structures Spaces Scenarios Societies
indexing
browsing searching
services
hypertext
Structured Stream
38
39
ETANA-DL
• Archaeological DL• Integrated DL
– Heterogeneous data handling
• Applies and extends the OAI-PMH– Open Archives Initiative Protocol for Metadata
Handling
• Design considerations– Componentized– Extensible– Portable
40
41
Site Artifact Type Original data sourceNumber of
records harvested
Bab edh-Dhra’ Pottery cp6 database file 786
Lahav Figurine Tab-delimited text file 563
Madaba Locus field record Tables in Access DB 786
Mozan Publication PDF files 19
Nimrin
Bone field record Table in Oracle DB 7419
Seed field record Table in Oracle DB 429
Locus field record Table in Oracle DB 2101
Umayri Bone field record 2 tables in Access DB 2122
Total 18404
Heterogeneous data handling
42
ETANA Spaces
1. Geographic distribution of found artifacts2. Temporal dimension (as inferred by
archaeologists) 3. Metric or vector spaces
1. used to support retrieval operations, and to calculate distance (and similarity)
2. used to browse / constrain searches spatially
4. 3D models of the past, used to reconstruct and visualize archaeological ruins
5. 2D interfaces for human-computer interaction
43
ETANA Structures
1. Site Organization1. Region, site, partition, sub-partition, locus,
…
2. Temporal orderings (ages, periods)
3. Taxonomies1. for bones, seeds, building materials, …
4. Stratigraphic relationships1. above, beneath, coexistent
44
ETANA Streams
1. successive photos and drawings of excavation sites, loci, unearthed artifacts
2. audio and video recordings of excavation activities and discussions
3. textual reports
4. 3D models used to reconstruct and visualize archaeological ruins.
45
Degree of Structure
Chaotic Organized Structured
Web DLs DBs
46
Digital Objects (DOs)
• Born digital
• Digitized version of “real” object– Is the DO version the same, better, or worse?– Decision for ETDs: structured + rendered
• Surrogate for “real” object– Not covered explicitly in metamodel for a
minimal DL– Crucial in metamodel for archaeology DL
47
Metadata Objects (MDOs)
• MARC
• Dublin Core
• RDF
• IMS
• OAI (Open Archives Initiative)
• Crosswalks, mappings
• Ontologies
• Topics maps, concept maps
48
Also Important: Epub, SGML, XML
• 5S perspective: streams, structures, scenarios
• Authoring
• Rendering, presenting
• Tagging, Markup, DOM
• Semi-structured information
• Dual-publishing, eBooks
• Styles (XSL, XSLT)
• Structured queries
49
XML-based DL Log Standard
• Log analysis– is a source of information on:
• How patrons really use DL services• How systems behave while supporting user
information seeking activities• Used to:
– Evaluate and enhance services– Guide allocation of resources
• Common practice in the web setting– Supported by web servers, proxy caches
• DL Logging can be more detailed
50
DL Logging Features
• Captures high level user and system behaviors
• Organized according to the 5S framework– Hierarchical organization (XML-based)– Centered on the notions of events
• Record only events related to initial user inputs and final system outputs
• Help to understand user interactions and the perceived value of responses
51
The XML Log Format
Log
SessionId MachineInfo StatementTransaction Timestamp
SessionInfo RegisterInfo StatementEvent Timestamp
Action
Search Browse StoreSysInfoUpdate
SearchBy QueryString CatalogCollection PresentationInfo
StatusInfo
Timeout
52
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
53
Hybrid Partitioned Inverted Indices for Large-Scale
Digital Libraries
Ohm SornilThe National Institute for Development
Administration (NIDA)
Bangkok, Thailand
Edward A. FoxDepartment of Computer Science
Virginia Tech, USA
54
Inverted IndexDocument 1 = Information retrieval is searching and indexingDocument 2 = Indexing is building an indexDocument 3 = An inverted file is an indexDocument 4 = Building an inverted file is indexing
Vocabulary Inverted List (document; position)
an (2;4), (3;1), (3;5), (4;2)and (1;5)building (2;3), (4;1)file (3;3), (4;4)index (2;5), (3;6)indexing (1;6), (2;1), (4;6)information (1;1)inverted (3;2), (4;3)is (1;3), (2;2), (3;4), (4;5)retrieval (1;2)searching (1;4)
55
Inverted Index Partitioning
• The Inverted Index Partitioning Problem is NP-complete
TWO PREVIOUSLY PROPOSED SCHEMES– Document Partitioning
• Postings are stored at the same node as are their documents
• Aggressively balance the load
– Term Partitioning• Every posting of a term is stored in one node• Normally no attempt to balance the load
56
Hybrid Partitioning Scheme
N
c
iZiq
C
V
i
1
)()(
C: Average number of chunks required for a node to retrieve an average termc: Chunk sizeq(i): Query selection distributionZ(i): Term-frequency distributionN: Number of nodes in the system
• Attempts to balance the load • Groups postings into chunks
Chunk Size Selection Scheme– Suggests a reasonable chunk size for a particular operating
condition– Based on the cost of processing a batch of queries
57
Inverted Index PartitioningGiven• 4 Disks• Collection (4 docs)
d1: <a, b, a, c, b> d2: <a, d, e, a> d3: <b, c, a, b> d4: <b>
Term Partitioning
Node1: a = (d1;1), (d1;3), (d2;1), (d2;4), (d3;3)
Node2: b = (d1;2), (d1;5), (d3;1), (d3;4), (d4;1)
Node3: c = (d1;4), (d3;2)
Node4: d = (d2;2) e = (d2;3)
Document Partitioning
Node1: a = (d1;1), (d1;3) b = (d1;2), (d1;5) c = (d1;4)
Node2: a = (d2;1), (d2;4) d = (d2;2) e = (d2;3)
Node3: a = (d3;3) b = (d3;1), (d3;4) c = (d3;2)
Node4: b = (d4;1)
Hybrid Partitioning
• Assume: Chunk Size = 4 postings
Node1: a = (d1;1), (d1;3), (d2;1), (d2;4)
Node2: b = (d1;2), (d1;5), (d3;1), (d3;4)
Node3: a = (d3;3)
c = (d1;4), (d3;2)
Node4: b = (d4;1) d = (d2;2) e = (d2;3)
58
Conclusion
• Performance Measures:– Hybrid > Term > Document
• Hybrid partitioning scheme performs better than the other two schemes in a variety of conditions– Large collection– Multiprogramming level– Query skew– System scaling
59
Conclusion (cont.)
• Observations from the results
– Node Utilization (best is middle range)
• Results: Document > Hybrid > Term
– Load Fluctuation
• Results: Term > Hybrid > Document
60
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
61
Tuning Before Feedback:Combining Ranking Discovery
and Blind Feedback forRobust Retrieval
• Ranking function plays an important role in IR performance
• Blind feedback (pseudo-relevance feedback) was found very useful for ad hoc retrieval
• Why not combine ranking function optimization with blind feedback to improve the robustness?
62
Blind Feedback
• Automatically adds more terms to a user’s query to enhance the performance of search engines by assuming top ranked docs relevant
• Some examples– Rocchio (performs better in our exp.)– Dec-Hi– Kullback-Leibler Divergence(KLD)– Chi-Square
63
RF Discovery Problem
Order Doc. Rele.1 A 12 D 13 F 14 G 15 B 06 C 07 E 0
Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1
Feedback
Training
Data
Input
Ranking Function
Discovery
Ranking
Function f
Output
64
Ranking Function Optimization
• Ranking Function Tuning is an art! – Paul Kantor• Why not adaptively discover RF by Genetic
Programming?– Huge search space– Discrete objective function– Modeling advantage
• What is GP?– Problem solving systems designed based on principles
of evolution and heredity. Widely used for structure discovery, functional form discovery, other data mining and optimization tasks
65
An Example of GP-based RF(log (+ (* df (log (log (* (* (/ n df) (* (* (/ n df) (* (* df_max_Col tf) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col)))) (* (/ (* (* (/ n df) (* (* df_max_Col tf) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col))) (+ (* length df) tf_avg_Col)) (log tf_avg_Col)))))) (+ (* (* df_max_Col tf) (/ (* (* (/ (/ (* tf 6.720) (/ df N)) (* df_max_Col tf)) (* (* tf N) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col))) (+ (* length df) (* (* (/ tf tf_max) (+ (* length df) (* 2.812 1))) tf_avg)))) (+ (/ df tf_avg) tf))))
tf Query term frequency in the document ( vector )
tf_query Query term frequency in the query ( vector )
tf_max The maximum term frequency in a document ( scalar )
Length Document length in the number of words ( scalar )
Length_avg Average document length in the number of words ( scalar )
N Number of documents in the collection ( scalar )
tf_avg Average term frequency in the current document (scalar)
tf_avg_Col Average term frequency for all the documents in the collection ( scalar )
df_max_Col Maximum document frequency for a word in the collection ( scalar )
df Document frequency for the query words ( vector )
tf Query term frequency in the document ( vector )
tf_query Query term frequency in the query ( vector )
tf_max The maximum term frequency in a document ( scalar )
Length Document length in the number of words ( scalar )
Length_avg Average document length in the number of words ( scalar )
N Number of documents in the collection ( scalar )
tf_avg Average term frequency in the current document (scalar)
tf_avg_Col Average term frequency for all the documents in the collection ( scalar )
df_max_Col Maximum document frequency for a word in the collection ( scalar )
df Document frequency for the query words ( vector )
tftf Query term frequency in the document ( vector ) Query term frequency in the document ( vector )
tf_querytf_query Query term frequency in the query ( vector )Query term frequency in the query ( vector )
tf_maxtf_max The maximum term frequency in a document ( scalar )The maximum term frequency in a document ( scalar )
LengthLength Document length in the number of words ( scalar )Document length in the number of words ( scalar )
Length_avgLength_avg Average document length in the number of words ( scalar )Average document length in the number of words ( scalar )
NN Number of documents in the collection ( scalar )Number of documents in the collection ( scalar )
tf_avgtf_avg Average term frequency in the current document (scalar)Average term frequency in the current document (scalar)
tf_avg_Coltf_avg_Col Average term frequency for all the documents in the collection ( scalar )Average term frequency for all the documents in the collection ( scalar )
df_max_Coldf_max_Col Maximum document frequency for a word in the collection ( scalar )Maximum document frequency for a word in the collection ( scalar )
dfdf Document frequency for the query words ( vector )Document frequency for the query words ( vector )
66
The ARRANGER Engine1. Split the training data
into training and validation
2. Generate an initial population of random “ranking functions”
3. Evaluate the fitness of each “ranking function” in the population and record 10 best ones
4. If stopping criteria is not met, generate the next generation of population by genetic transformation, go to Step 3.
5. Validate the recorded best “ranking functions” and select the best one as the RF
Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1
1 2 3 48 49 50
Start
Initialize Population
Evaluate Fitness
Apply Crossover
Stop?
Validate and Output End
48 49 501 2 30.40.30.4 0.80.30.4
67
Ranking Tuning
Blind Feedback
Multiple user queriesWith relevance information New Ranking
Function
New Search Results
User Queries
Ranking Tuning
Blind Feedback
Multiple user queriesWith relevance information New Ranking
Function
New Search Results
User Queries
An Integrated Model
68
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
69
Text + CBIR + Metadata + GIS
• Combined retrieval across multiple types of information
• Ex.: bio-diversity information systems
• Architecture, approach, prototype, validation
• Novel aspects:– Learn set of descriptors for a collection– Application to fisheries, archaeology
70
Textual information retrieval
Query on Google using Sunset and Rio de Janeiro
Query result
71
Content BasedInformationRetrieval
72
Motivation for Integration
• Query 1:– List all metadata information related to fishes which
have been observed at Mississippi River.• Query 2:
– Retrieve fish images which contain a shape similar to this example
o Query 3: List all metadata information related to fishes which both have been observed at Mississippi River and contain a shape similar to a given example.
73
Longer Integrated Query
• Retrieve fish descriptions of all fish whose shape is similar to that shown in Figure below, which belong to genus “Notropis”, which have “large eyes” and “dorsal stripe”, and have been observed within the catchments of the “Tennessee” river
74
System’s Architecture
Mediator
InterfaceInterface
Data Insertion ModuleData Insertion Module Query Processing ModuleQuery Processing Module
GISDBMS
Geo. DBMetadataImage DB
Databases
75
Content-Based ImageSearch Component
(CBISC)
OAI
EcoCollection Metadata
Taxonomic Trees
Metadata-Based Search Component
(ESSEX)
Geographic Data
Search Component
(GDSC)Web Feature Server(WFS)
GeoCollection MetadataMaps
ImageCollection Image
MetadataImage
DescriptorsImages
Image Collection
InterfaceQuery
Specification Visualization
Query Mediator
AnalysisMerging
Execution
BIS Manager
HTTP Request(ListDescriptors)
HTTP Request(GetImages)
HTTP Request(keywords)
HTTP Request(GetCapabilities)
HTTP Request(GetFeatureType)
HTTP Request(GetFeature)
76
Feature Extraction Model
Feature Vector[0.98, 0.91, 0.73, ……]
R
B
G
B
77
CBISC Architecture
78
CBISC Configuration Tool
79
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
80
Meta-Search
• Contexts:– Web: search engine built atop others– Federated search: bring together results from
distributed partial content sites
• Approach– Send query out to multiple sites– Merge results from sites– Combine those results for ranking as follows:
81
Combination
• For a document, combine the sim values from each system involved:
• CombMIN
• CombMAX
• CombSUM
• CombMNZ = CombSUM * no. systems with non-zero similarities
• CombMNZ oft best, else CombSum
82
DL Integration
• What is “DL Integration”– Hide distribution– Hide heterogeneity– Enable autonomy of individual component
• Why Integration– island-DLs– inability to seamlessly and transparently
access knowledge across DLs
Utilize various autonomous DLs in concert
83
Introduction and Problem Description
UnionDL
DL1 DL2
DL4DL3
DL5
DL1 DL2
DL4DL3
DL5
84
Related Work on Integrating Services in DLs
integrating searching and browsing with other services
clustering and visualization
has an example
Stepping Stones& Pathways
EtanaVizCitiViz
includes
has an example
I3R
systemsIn 1980s
found in
RABBIT
integrating searching and browsing
systemsIn 1990s
systemsIn 2000s
CODER
DataWeb
has an example
PESTO SenseMaker
has an example
MIX ScentTrailsBBQ
16ODLMARIAN
85
semantic interoperability
in DLs
Intermediary-based mapping-based
consists of
mediator wrapper ontology
use
federation union archiving
used in
schema mapping
use
Interrelated with
CITIDELDienst FedoraNDLTD Infobus
…
proactive standardization
reactive interpretation
achieved by use
Reconceptualization of Related Work on Semantic Interoperability
Key: Blue indicates focus of our work.
Automatically generate
31
86
Formal Definition of DL Integration
• DLi=(Ri, DMi, Servi, Soci), 1 i n
– Ri is a network accessible repository
– DMi is a set of metadata catalogs for all collections
– Servi is a set of services
– Soci is a society
• UnionRep• UnionCat• UnionServices• UnionSociety
87
Formal Definition of DL Integration (Cont.)
• DL integration problem definition:
Given n individual libraries, integrate the n DLs to create a UnionDL.
88
Taxonomy of Union Services
Infrastructure Services Information Satisfaction Services
Essential Add_Vaue Essential Add_value
indexing
harvesting
mapping
(Schema registry with analyses & mapping)
(data) cleaning
(focused) crawling
copying (replicating)
logging
(format) translating
(Service to support annotation)
(Metadata validation)
searching
browsing
access control
binding
comparison
(forum) discussion
(query) expansion
filtering
recommendation
visualization
Note: Suggested NSDL services are shown in blue.
89
Union Catalog Integration
VN MetadataFormat
Global MetadataFormat
VNCatalog
HDCatalog
Union Catalog
MappingTool
Wrapper
MappingTool
Wrapper
HD MetadataFormat
Virtual Nimrin(VN)
Halif DigMaster(HD)
Union ArchDL
90
Data Mapping (state-of-the-art)
91
local schema global schema
92
Mapping recommendation
93
Mapping confirmation
Mapping history
94
Outline
• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
95
Link Fusion
A Unified Link Analysis for Multi-Type Interrelated Data
Objects
Wensi Xi1, Benyu Zhang2, Zheng Chen2, Yizhou Lu3, Shuicheng Yan3, Wei-Ying Ma2, Edward A. Fox1
1Virginia Tech 2Microsoft Research Asia 3Peking University
96
Traditional way of representing relationships
• Space: Vector and probability spaces (e.g., Salton and Wang’s works)
• Database: Sets of attributes represent relationships, and are used to design databases (e.g., Fuhr and Frieder’s works).
• Networks: Belief, inference and spreading activation networks (e.g., Turtle, Ribeiro-Neto, and Acid’s works)
Problems:• Not easily used to combine multiple types of data
objects and relationships• Need to find representations that are closer to reality
and are more dynamic
97
Example: Collaborative Filtering, Recommender System
Human Relationshi
p
User
Browse
Hyperlink/ Content
Similarity
Web page
• Inter-type relationship: Browse.• Intra-type relationship: Hyperlink, Human
relationships.
98
Example: Query Expansion, Web Clustering
Content Similarity
Reference
Hyperlink/ Content Similarity
Web page
Query
• Inter-type relationship: Reference.
• Intra-type relationship: Hyperlink, Content Similarity
99
Attribute Reinforcement Assumption
The specific attribute of a data object in one data type can be reinforced by both the attributes of related data objects in the same data space and attributes of related data objects from other data space.
Data SpaceInter-type relationship
Intra-type relationship
DataObject
100
Link Fusion algorithm (1)• Consider two types of objects, X={x1…xm} and Y={y1…ym},
their intra-type relationships are Rx Ry, inter-type relationships are Rxy and Ryx.
• Adjacency matrix Lx, Ly, Lxy, and Lyx represent the relationship of Rx Ry Rxy and Ryx respectively.
• Suppose wx is the attribute vector of objects in X, wy is the attribute vector of objects in Y
• wx is reinforced by both the intra and inter type relationships from X and Y, so as wy. The Link Fusion algorithm can be represented as:
T Ty y y xy x
T Tx x x yx y
w L w L w
w L w L w
101
Link Fusion algorithm (2)• In a more generalized scenario, suppose there are N data
types, importance attribute of one type of object can be reinforced by both inter and intra-type links as:
• It can also be represented into matrix representation w=ATw. In the matrix α and β are weights for different attributes.
• Iterative calculation would result in the prime eigenvector of A, which can be explained as the value of data objects regarding a specific attribute.
NM
T TM M M N
N M
w L w L w
' ' '1 1 12 12 1 1
' ' '21 21 2 2 2 2
' ' '1 1 2 2
...
...
...
n n
n n
n n n n n n
L L L
L L LA
L L L
102
Link Fusion algorithm (3)• If we consider webpages as a homogeneous
data space, and they are connected via intra-type relationships (hyperlinks), Link Fusion is reduced to PageRank algorithm.
• If we consider Hub and Authority attributes of webpages as two different type of objects, and they are connected via inter-type relationships (hyperlinks), Link Fusion is reduced to HITS algorithm.
• Thus, Link Fusion can be considered as an extension of traditional link analysis algorithms
103
SimFusion:Measuring Similarity Using the
Unified Relationship Matrix
1Wensi Xi, 1Edward Fox, 1Weiguo Fan, 2Benyu Zhang, 2Zheng Chen, 3Jun Yan, 4Dong Zhuang
1Department of Computer Science, Virginia Tech2Microsoft Research Asia
3School of Mathematical Science, Peking University4Department of Computer Science, Beijing Institute of
Technology
SIGIR 2005Salvador, Brazil August 15-19, 2005
104
Motivation• To achieve desired improvements with advanced
information systems, we need to combine and integrate information from a variety of sources.
• Entities from different domains can be considered as objects containing information:– Web pages or scientific papers– Queries– Users
• Information contained by objects may include:– Contents: papers, web-pages– Attributes: popularity, authority– Relationships: reference, hyperlink, similarity
105
Research Statement
Problem:
“How can the broad variety of heterogeneous data and relationships be effectively and efficiently integrated to improve the performance of various information retrieval related tasks?”
Solution:
Use matrices to represent multi-relationships, and use matrix calculations to integrate them (so as to improve searching and clustering).
106
ExampleUsers QueriesDocuments
0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 x 00 0 0 x 0 0 0
0 x 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 x
Collaborative filtering Log based clustering
U
DD
U
U
D
D
Q
0 0 0 0 0 0 0 0 0 0 x 0 0 00 0 0 0 0 0 0x 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 x 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 x
0 x 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 x 0
0 0 0 0 0 0 0 0 0 0 x 0 0 00 0 0 0 0 x 0x 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 x 00 0 0 0 0 x 00 0 0 x 0 0 0
U’
U’
D’
U’
D’
D’
D’
Q’
Cluster based retrievalUser Modeling
107
Unified Relationship Matrix (URM)
• Consider two types of objects, X={x1…xm} and Y={y1…yn}. Their intra-type relationships are Rx and Ry, while the inter-type relationships are Rxy and Ryx.
• Adjacency matrices Lx, Ly, Lxy, and Lyx represent the relationships of Rx, Ry, Rxy, and Ryx, respectively.
• All the relationships can be represented in a single unified matrix :
yyx
xyxurm LL
LLL
108
Unified Relationship Matrix (2)• In a more generalized scenario, suppose there are N data
types, and objects from different data spaces are connected by intra- and inter-type relationships.
• All the relationships can be represented into a Unified Relationship Matrix:
• Diagonal sub-matrices are for intra-relationships; others are for inter-relationships.
NNN
N
N
urm
LLL
LLL
LLL
L
21
2221
1121
109
Unified Relationship Matrix (3)
• Provides a general way of viewing data objects and relationships
• Data objects from different spaces are now all in the “unified” space.
• Previous inter- and intra-type relationships are now all intra-type relationships in the “unified” space.
110
Similarity Reinforcement Assumption The similarity of two data objects in one data type
can be reinforced by the similarity values of other data objects to which they are related.
User Space
Reference relationshipSelect relationship
Query Space
Document Space
Reading relationship
111
SimFusion Algorithm
• Suppose there are N data spaces, then objects in these spaces are connected by inter- and intra-type relationships as in the URM below:
• A Unified Similarity Matrix is built to represent the pair-wise similarities of data objects:
NNNNNNN
NN
NN
urm
LLL
LLL
LLL
L
2211
222222121
111212111
1
1
1
21
221
112
TT
T
T
usm
ss
ss
ss
S
112
SimFusion Algorithm (2)
• The Similarity reinforcement assumption can be represented as:
• Such reinforcement calculation can be continued as:
• The calculation has been proven to converge, and is named the SimFusion algorithm.
Turm
originalusmurm
newusm LSLS
Tnurmusm
nurm
Turm
nusmurm
nusm LSLLSLS )(01
113
Real World Examples
• Consider the space of scientific papers. Then with reference relationships (intra-type), SimFusion reduces to co-citation or bibliographic coupling algorithms.
• Consider the document-term “contain” relationship and build a URM as below:
Here the USM is the identity matrix, and SimFusion reduces to VSM document similarity calculation.
• Others…(Raghavan, Beeferman)
0
0
dtT
dturm L
LL
114
Summary• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary
115
Selected Links
• http://fox.cs.vt.edu• CITIDEL (computing education
resources)–www.citidel.org
• NDLTD (electronic theses and dissertations worldwide)–www.ndltd.org and etdguide.org
• Virginia Tech Digital Library Research Laboratory–DLRL, www.dlib.vt.edu
116
Questions?Discussion?
Thank You!