outline

1

Indexing and searching heterogeneous information

LLNL – Nov. 3, 2006

Edward A. FoxVirginia [email protected]

http://fox.cs.vt.edu

2

Outline

• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary

3

Acknowledgements: Students

• Pavel Calado, William Cameron, Yuxin Chen, Fernando Das Neves, Robert France, Marcos Gonçalves, S.H. Kim, Aaron Krowne, Ming Luo, Paul Mather, Fernando Das Neves, Sanghee Oh, Unni. Ravindranathan, Ryan Richardson, Rao Shen, Ohm Sornil, Hussein Suleman, Ricardo Torres, Manas Tungare, Wensi Xi, Seungwon Yang, Xiaoyan Yu, Baoping Zhang, Qinwei Zhu, …

4

Acknowledgements: Faculty, Staff

• Lillian Cassel, Lois Delcambre, Debra Dudley, Roger Ehrich, Joanne Eustis, Weiguo Fan, James Flanagan, C. Lee Giles, Rohit Kelapure, Neill Kipp, Douglas Knight, Deborah Knox, Aaron Krowne, Alberto Laender, David Maier, Gail McMillan, Claudia Medeiros, Manuel Perez-Quinones, Jeffrey Pomerantz, Naren Ramakrishnan, Layne Watson, Barbara Wildemuth, …

5

Other Collaborators (Selected)

• Brazil: FUA, UFMG, UNICAMP• Case Western Reserve University• Emory, Notre Dame, Oregon State• Germany: Univ. Oldenburg• Mexico: UDLA (Puebla), Monterrey• College of NJ, Hofstra, Penn State,

Villanova• University of Arizona• University of Florida, Univ. of Illinois• University of Virginia

Acknowledgements: Support

• ACM, Adobe, AOL, CAPES, CNI, CONACyT, DFG, IBM, Microsoft, NASA, NDLTD, NLM, NSF (IIS-9986089, 0086227, 0080748, 0325579, 0535057; ITR-0325579; DUE-0121679, 0136690, 0121741, 0333601, 0435059, 0532825), OCLC, SOLINET, SUN, SURA, UNESCO, US Dept. Ed. (FIPSE), VTLS

7

Publications – 1 of 2• N. J. Belkin, P. Kantor, E. A. Fox and J. A. Shaw. Combining the Evidence of Multiple

Query Representations for Information Retrieval. Information Processing & Management, 31(3), 431-448, May-June 1995.

• Fan, W., Luo, M., Wang, L., Xi, W., and Fox, E. A. Tuning before feedback: Combining ranking discovery and blind feedback for robust retrieval. SIGIR 2004, 27th Annual Int’l ACM SIGIR Conf. on R&D in Information Retrieval, Sheffield, England, 25-29 July

• Weiguo Fan; Gordon, M.D.; Pathak, P.; Wensi Xi; Fox, E.A.; Ranking function optimization for effective web search by genetic programming: an empirical study, in the Proceedings of 37th Hawaii International Conf. on System Sciences (HICSS), 5-8 Jan. 2004, 105 - 112

• Edward A. Fox, Fernando Das Neves, Xiaoyan Yu, Rao Shen, Seonho Kim, and Weiguo Fan. Exploring the computing literature with visualization and stepping stones & pathways. CACM 49(4): 52-58, April 2006

• Edward A. Fox and Paul Mather. Scalable Storage for Digital Libraries. Chapter 12 in Multimedia Information Retrieval and Management: Technological Fundamentals and Applications, eds. D. Feng, W.C. Siu and H.J. Zhang, Berlin: Springer, 2003, pp. 265-288

• E. Fox and J. Shaw. Combination of Multiple Searches. In Proc. of The Second Text REtrieval Conference (TREC-2) (Aug. 30 - Sept. 1, 1993, NIST, Gaithersburg, MD), NIST Special Pub. 500-215, 1994, ed. D. K. Harman, 243-252

• Marcos Andre Goncalves, Robert K. France, and Edward A. Fox, MARIAN: Flexible Interoperability for Federated Digital Libraries. In Proc. 5th European Conference on Research and Advanced Technology for Digital Libraries, ECDL'2001, September 4-8, 2001, Darmstadt, Germany, Springer, LNCS 2163 / 2001, pp. 173-186

• Ananth Raghavan, Naga Srinivas Vemuri, Rao Shen, Marcos Andre Goncalves, Weiguo Fan, and Edward A. Fox. Incremental, Semi-automatic, Mapping-Based Integration of Heterogeneous Collections into Archaeological Digital Libraries: Megiddo Case Study. In Proc. ECDL2005, Vienna, Sept. 18-23, 2005, 139-150

8

Publications – 2 of 2• Rao Shen, Naga Srinivas Vemuri, Weiguo Fan, Ricardo da S. Torres, and Edward A. Fox.

Exploring Digital Libraries: Integrating Browsing, Searching, and Visualization. In Proc. JCDL 2006, June 11-15, 2006, Chapel Hill, NC, 1-10

• Ricardo da Silva Torres, Alexandre X. Falcao, Baoping Zhang, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Pavel Calado. A new framework to combine descriptors for content-based image retrieval. In Proc. 14th Conf. Information and Knowledge Management, CIKM 2005, 31 Oct. - 5 Nov. 2005 Bremen, Germany, 335-336

• Li Wang, Weiguo Fan, Rui Yang, Wensi Xi, Ming Luo, Ye Zhou, Edward A. Fox, Ranking Function Discovery by Genetic Programming for Robust Retrieval, Text Retrieval Evaluation Conference-2003, Nov 17-23, NIST, Washington DC, 9 pages

• Wensi Xi, Edward A. Fox, Weiguo Fan, Benyu Zhang, Zheng Chen, Jun Yan, Dong Zhuang. SimFusion: Measuring Similarity using Unified Relationship Matrix. In Proc. SIGIR 2005, 28th Annual International ACM SIGIR Conf., Salvador, Brazil, August 15-19, 2005, 130-137, http://doi.acm.org/10.1145/1076034.1076059

• W. Xi, B. Zhang, Z. Chen, Y. Lu, S. Yan, W.Y. Ma, E.A. Fox. Link Fusion: A Unified Link Analysis Framework for Multi-type Inter-related Data Objects. In Proc. Thirteenth International World Wide Web Conf., WWW2004, NY, U.S.A. 19-22 May 2004, 10 pages

• Wensi Xi, Ohm Sornil, Ming Luo, and Edward A. Fox. Hybrid Partition Inverted Files: Experimental Validation. In "Research and Advanced Technology for Digital Libraries, 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002, Proceedings", eds. Maristella Agosti and Constantino Thanos, LNCS 2458, Springer, pp. 422-431.

• Wensi Xi, Ohm Sornil, and Edward A. Fox. Hybrid Partition Inverted Files for Large-Scale Digital Libraries. Proc. Digital Library: IT Opportunities and Challenges in the New Millennium, July 9-11, 2002, Beijing Library Press, Beijing, China, 404-418

• Baoping Zhang, Yuxin Chen, Weiguo Fan, Edward A. Fox, Marcos Andre Goncalves, Marco Cristo, Pavel Calado. Intelligent GP Fusion from Multiple Sources for Text Classification. In Proc. 14th Conf. on Information and Knowledge Management, CIKM 2005, 31st October - 5 Nov 2005 Bremen, Germany, 477-484

9

Outline


10

Problem Characterization

• Distributed (space)

• Content (streams)

• Indexing (space, structure)– Features– Type/sub-type: Image, texture; link, citation– Descriptors: words or phrases or concepts– High dimensionality

• Searching (scenario)

11

Efficiency / Effectiveness

• Effectiveness– Very common measures: Precision, Recall,

F1, 10-precision, R-Precision– Usefulness, usability, task support, …

• Efficiency– Time– Space– Performance, Resource use, …

15

CC2001 Information Management Areas

IM1. Information models and systems*

IM8. Distributed DBs

IM2. Database systems* IM9. Physical DB design

IM3. Data modeling* IM10. Data mining

IM4. Relational DBs IM11. Information storage and retrieval

IM5. Database query languages

IM12. Hypertext and hypermedia

IM6. Relational DB design IM13. Multimedia information & systems

IM7. Transaction processing IM14. Digital libraries

* Core components

16

DL Curriculum FrameworkSemester 1:

DL collections:development/creation

Semester 2:DL services and

sustainability

CO

UR

SE

ST

RU

CT

UR

E

DigitizationStorage

Interchange

Digital objectsCompositesPackages

MetadataCataloging

Author submission

NamingRepositories

Archives

Spaces(conceptual,geographic,2/3D, VR)

Architectures(agents, buses,

wrappers/mediators)Interoperability

Services(searching,

linking, browsing, etc.)

Intellectual property rights mgmt.

PrivacyProtection (watermarking)

Archiving and preservation

Integrity

Architectures(agents, buses,

wrappers/mediators)Interoperability

CO

RE

DL

TO

PIC

S

DocumentsE-publishing

Markup

Info. NeedsRelevanceEvaluation

Effectiveness

ThesauriOntologies

ClassificationCategorization

Bibliographic information

BibliometricsCitations

RoutingFiltering

Community filtering

Search & search strategyInfo seeking behavior

User modelingFeedback

Info summarizationVisualization

Multimedia streams/structures

Capture/representationCompression/coding

Content-based analysis

Multimedia indexing

Multimediapresentation,

rendering

RE

LA

TE

DT

OP

ICS

17

D ig ita l L ib ra r y C o n te n t

A rtic le s ,R e p o rts,

B o o ks

T e xtD o cum e n ts

S p ee ch ,M u s ic

V id eoA u d io

(A e ria l)P h o tos

G e og rap h icIn fo rm ation

M o d e lsS im u la tio ns

S o ftw a re ,P ro g ra m s

G e no m eH u m a n,a n im a l,

p la n t

B ioIn fo rm ation

2 D , 3 D ,V R ,C A T

Im ag es a ndG ra p h ics

C o nte n tT yp e s

18

Outline


19

Personalizing A Course Website Using the NSDL

William Cameron2, Boots Cassel2, Edward Fox1, Manuel Perez-Quinones1, Manas

Tungare1, Xiaoyan Yu1

Virginia Tech1, Villanova2

20

Syllabus Collection …Towards an intelligent educational system

Unstructured Syllabus Text

StructuredSyllabus

Text

SearcherRecommender

Crawler

SyllabusClassifier

Extractor

Editor

SyllabusOntology

Services

Publisher

Other NSDL

Resources

Potential Syllabus

Text

Classification Scheme

ResourceClassifier

21

Search

• With collection, we have a full text search

• Results point to local copy in our collection as well as to original document

• Try it outhttp://doc.cs.vt.edu/search/

22

Syllabus Ontology

• Standard, machine understandable

• Ontology Editor: Protégé

• Syllabus Schema: SylVia

• http://doc.cs.vt.edu/ontologies/

23

Creating new syllabus

• Web-based application to support entry of syllabi into collection

• Moodle Plug-in in the works

• Uses CC 2001 to select topics for a course

24

Information Extraction

• Plans to automatically extract information from syllabi documents collected

• Rule-based Approach

• Statistics-based Approach

• Apply the best extractor on the unstructured syllabi

04/19/23 25

Superimposed Tools for VT

Uma Murthy and Edward A. FoxDepartment of Computer Science, Virginia Tech

18 October 2006

26

Origin of SI

• This basic need had been addressed in diverse ways, with varying degrees of success, for many years:– concordances, annotations, comments

– bookmarks, concept maps, digital annotations, …

• The term “SI” was coined in 1999 by researchers, currently collaborating with us, now at Portland State University– Lois Delcambre

– David Maier

27

Layers in an SI system

Superimposed

Layer

Base Layer

Information Source1

Information Source2

Information Sourcen

…

marks

* Source: ICDE04 presentation by Murthy, et. al

28

Annotating an image

29

Searching over annotations

30

Searching over images/sub-images

31

Summary

* Source: ICDE04 presentation by Murthy, et. al

Superimposed

Layer

Base Layer

Information Source1

Information Source2

Information Sourcen

…

marks

32

Outline


33

Informal 5S & DL Definitions

DLs are complex systems that

• help satisfy info needs of users (societies)

• provide info services (scenarios)

• organize info in usable ways (structures)

• present info in usable ways (spaces)

• communicate info with users (streams)

34

5Ss

Ss Examples Objectives

Streams Text; video; audio; image Describes properties of the DL content such as encoding and language for textual material or particular forms of multimedia data

Structures Collection; catalog; hypertext; document; metadata

Specifies organizational aspects of the DL content

Spaces Measure; measurable, topological, vector, probabilistic

Defines logical and presentational views of several DL components

Scenarios Searching, browsing, recommending

Details the behavior of DL services

Societies Service managers, learners, teachers, etc.

Defines managers, responsible for running DL services; actors, that use those services; and relationships among them

35

Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing

Annotating Classifying Clustering Evaluating Extracting Indexing

Measuring Publicizing

Rating Reviewing (peer)

Surveying Translating

(language)

Conserving Converting

Copying/Replicating Emulating Renewing

Translating (format)

Acquiring Cataloging

Crawling (focused) Describing Digitizing

Federating Harvesting Purchasing Submitting

Preservational Creational

Add Value

Repository-Building

Information Satisfaction

Services

Infrastructure Services

Taxonomy of DL Services

36

5S and DL formal definitions and compositions (April 2004 TOIS)

5S

structures (d.10)streams (d.9) spaces (d.18) scenarios (d.21) societies (d. 24)

structural metadataspecification(d.25)

descriptive metadataspecification(d.26)

repository(d. 33)

collection (d. 31)

(d.34)indexingservice

structured stream (d.29)

digitalobject (d.30)

metadata catalog (d.32)

browsingservice

(d.37)

searchingservice (d.35)

digital library(minimal) (d. 38)

services (d.22)

sequence (d. 3)

graph (d. 6)function (d. 2)

measurable(d.12), measure(d.13), probability (d.14), vector (d.15), topological (d.16) spaces

event (d.10)state (d. 18)

hypertext(d.36)

sequence (d. 3)

transmission(d.23)

relation (d. 1) language (d.5)

grammar (d. 7)

tuple (d. 4)*

37

Digital Object

RepositoryCollection Minimal DL

Metadata Catalog

Descriptive Metadata

Specification

A Minimal DL in the 5S Framework

Structural Metadata

Specification

Streams Structures Spaces Scenarios Societies

indexing

browsing searching

services

hypertext

Structured Stream

39

ETANA-DL

• Archaeological DL• Integrated DL

– Heterogeneous data handling

• Applies and extends the OAI-PMH– Open Archives Initiative Protocol for Metadata

Handling

• Design considerations– Componentized– Extensible– Portable

41

Site Artifact Type Original data sourceNumber of

records harvested

Bab edh-Dhra’ Pottery cp6 database file 786

Lahav Figurine Tab-delimited text file 563

Madaba Locus field record Tables in Access DB 786

Mozan Publication PDF files 19

Nimrin

Bone field record Table in Oracle DB 7419

Seed field record Table in Oracle DB 429

Locus field record Table in Oracle DB 2101

Umayri Bone field record 2 tables in Access DB 2122

Total 18404

Heterogeneous data handling

42

ETANA Spaces

1. Geographic distribution of found artifacts2. Temporal dimension (as inferred by

archaeologists) 3. Metric or vector spaces

1. used to support retrieval operations, and to calculate distance (and similarity)

2. used to browse / constrain searches spatially

4. 3D models of the past, used to reconstruct and visualize archaeological ruins

5. 2D interfaces for human-computer interaction

43

ETANA Structures

1. Site Organization1. Region, site, partition, sub-partition, locus,

…

2. Temporal orderings (ages, periods)

3. Taxonomies1. for bones, seeds, building materials, …

4. Stratigraphic relationships1. above, beneath, coexistent

44

ETANA Streams

1. successive photos and drawings of excavation sites, loci, unearthed artifacts

2. audio and video recordings of excavation activities and discussions

3. textual reports

4. 3D models used to reconstruct and visualize archaeological ruins.

45

Degree of Structure

Chaotic Organized Structured

Web DLs DBs

46

Digital Objects (DOs)

• Born digital

• Digitized version of “real” object– Is the DO version the same, better, or worse?– Decision for ETDs: structured + rendered

• Surrogate for “real” object– Not covered explicitly in metamodel for a

minimal DL– Crucial in metamodel for archaeology DL

47

Metadata Objects (MDOs)

• MARC

• Dublin Core

• RDF

• IMS

• OAI (Open Archives Initiative)

• Crosswalks, mappings

• Ontologies

• Topics maps, concept maps

48

Also Important: Epub, SGML, XML

• 5S perspective: streams, structures, scenarios

• Authoring

• Rendering, presenting

• Tagging, Markup, DOM

• Semi-structured information

• Dual-publishing, eBooks

• Styles (XSL, XSLT)

• Structured queries

49

XML-based DL Log Standard

• Log analysis– is a source of information on:

• How patrons really use DL services• How systems behave while supporting user

information seeking activities• Used to:

– Evaluate and enhance services– Guide allocation of resources

• Common practice in the web setting– Supported by web servers, proxy caches

• DL Logging can be more detailed

50

DL Logging Features

• Captures high level user and system behaviors

• Organized according to the 5S framework– Hierarchical organization (XML-based)– Centered on the notions of events

• Record only events related to initial user inputs and final system outputs

• Help to understand user interactions and the perceived value of responses

51

The XML Log Format

Log

SessionId MachineInfo StatementTransaction Timestamp

SessionInfo RegisterInfo StatementEvent Timestamp

Action

Search Browse StoreSysInfoUpdate

SearchBy QueryString CatalogCollection PresentationInfo

StatusInfo

Timeout

52

Outline


53

Hybrid Partitioned Inverted Indices for Large-Scale

Digital Libraries

Ohm SornilThe National Institute for Development

Administration (NIDA)

Bangkok, Thailand

[email protected]

Edward A. FoxDepartment of Computer Science

Virginia Tech, USA

[email protected]

54

Inverted IndexDocument 1 = Information retrieval is searching and indexingDocument 2 = Indexing is building an indexDocument 3 = An inverted file is an indexDocument 4 = Building an inverted file is indexing

Vocabulary Inverted List (document; position)

an (2;4), (3;1), (3;5), (4;2)and (1;5)building (2;3), (4;1)file (3;3), (4;4)index (2;5), (3;6)indexing (1;6), (2;1), (4;6)information (1;1)inverted (3;2), (4;3)is (1;3), (2;2), (3;4), (4;5)retrieval (1;2)searching (1;4)

55

Inverted Index Partitioning

• The Inverted Index Partitioning Problem is NP-complete

TWO PREVIOUSLY PROPOSED SCHEMES– Document Partitioning

• Postings are stored at the same node as are their documents

• Aggressively balance the load

– Term Partitioning• Every posting of a term is stored in one node• Normally no attempt to balance the load

56

Hybrid Partitioning Scheme

N

c

iZiq

C

V

i

1

)()(

C: Average number of chunks required for a node to retrieve an average termc: Chunk sizeq(i): Query selection distributionZ(i): Term-frequency distributionN: Number of nodes in the system

• Attempts to balance the load • Groups postings into chunks

Chunk Size Selection Scheme– Suggests a reasonable chunk size for a particular operating

condition– Based on the cost of processing a batch of queries

57

Inverted Index PartitioningGiven• 4 Disks• Collection (4 docs)

d1: <a, b, a, c, b> d2: <a, d, e, a> d3: <b, c, a, b> d4: <b>

Term Partitioning

Node1: a = (d1;1), (d1;3), (d2;1), (d2;4), (d3;3)

Node2: b = (d1;2), (d1;5), (d3;1), (d3;4), (d4;1)

Node3: c = (d1;4), (d3;2)

Node4: d = (d2;2) e = (d2;3)

Document Partitioning

Node1: a = (d1;1), (d1;3) b = (d1;2), (d1;5) c = (d1;4)

Node2: a = (d2;1), (d2;4) d = (d2;2) e = (d2;3)

Node3: a = (d3;3) b = (d3;1), (d3;4) c = (d3;2)

Node4: b = (d4;1)

Hybrid Partitioning

• Assume: Chunk Size = 4 postings

Node1: a = (d1;1), (d1;3), (d2;1), (d2;4)

Node2: b = (d1;2), (d1;5), (d3;1), (d3;4)

Node3: a = (d3;3)

c = (d1;4), (d3;2)

Node4: b = (d4;1) d = (d2;2) e = (d2;3)

58

Conclusion

• Performance Measures:– Hybrid > Term > Document

• Hybrid partitioning scheme performs better than the other two schemes in a variety of conditions– Large collection– Multiprogramming level– Query skew– System scaling

59

Conclusion (cont.)

• Observations from the results

– Node Utilization (best is middle range)

• Results: Document > Hybrid > Term

– Load Fluctuation

• Results: Term > Hybrid > Document

60

Outline


61

Tuning Before Feedback:Combining Ranking Discovery

and Blind Feedback forRobust Retrieval

• Ranking function plays an important role in IR performance

• Blind feedback (pseudo-relevance feedback) was found very useful for ad hoc retrieval

• Why not combine ranking function optimization with blind feedback to improve the robustness?

62

Blind Feedback

• Automatically adds more terms to a user’s query to enhance the performance of search engines by assuming top ranked docs relevant

• Some examples– Rocchio (performs better in our exp.)– Dec-Hi– Kullback-Leibler Divergence(KLD)– Chi-Square

63

RF Discovery Problem

Order Doc. Rele.1 A 12 D 13 F 14 G 15 B 06 C 07 E 0

Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1

Feedback

Training

Data

Input

Ranking Function

Discovery

Ranking

Function f

Output

64

Ranking Function Optimization

• Ranking Function Tuning is an art! – Paul Kantor• Why not adaptively discover RF by Genetic

Programming?– Huge search space– Discrete objective function– Modeling advantage

• What is GP?– Problem solving systems designed based on principles

of evolution and heredity. Widely used for structure discovery, functional form discovery, other data mining and optimization tasks

65

An Example of GP-based RF(log (+ (* df (log (log (* (* (/ n df) (* (* (/ n df) (* (* df_max_Col tf) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col)))) (* (/ (* (* (/ n df) (* (* df_max_Col tf) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col))) (+ (* length df) tf_avg_Col)) (log tf_avg_Col)))))) (+ (* (* df_max_Col tf) (/ (* (* (/ (/ (* tf 6.720) (/ df N)) (* df_max_Col tf)) (* (* tf N) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col))) (+ (* length df) (* (* (/ tf tf_max) (+ (* length df) (* 2.812 1))) tf_avg)))) (+ (/ df tf_avg) tf))))

tf Query term frequency in the document ( vector )

tf_query Query term frequency in the query ( vector )

tf_max The maximum term frequency in a document ( scalar )

Length Document length in the number of words ( scalar )

Length_avg Average document length in the number of words ( scalar )

N Number of documents in the collection ( scalar )

tf_avg Average term frequency in the current document (scalar)

tf_avg_Col Average term frequency for all the documents in the collection ( scalar )

df_max_Col Maximum document frequency for a word in the collection ( scalar )

df Document frequency for the query words ( vector )

tf Query term frequency in the document ( vector )

tf_query Query term frequency in the query ( vector )

tf_max The maximum term frequency in a document ( scalar )

Length Document length in the number of words ( scalar )

Length_avg Average document length in the number of words ( scalar )

N Number of documents in the collection ( scalar )

tf_avg Average term frequency in the current document (scalar)

tf_avg_Col Average term frequency for all the documents in the collection ( scalar )

df_max_Col Maximum document frequency for a word in the collection ( scalar )

df Document frequency for the query words ( vector )

tftf Query term frequency in the document ( vector ) Query term frequency in the document ( vector )

tf_querytf_query Query term frequency in the query ( vector )Query term frequency in the query ( vector )

tf_maxtf_max The maximum term frequency in a document ( scalar )The maximum term frequency in a document ( scalar )

LengthLength Document length in the number of words ( scalar )Document length in the number of words ( scalar )

Length_avgLength_avg Average document length in the number of words ( scalar )Average document length in the number of words ( scalar )

NN Number of documents in the collection ( scalar )Number of documents in the collection ( scalar )

tf_avgtf_avg Average term frequency in the current document (scalar)Average term frequency in the current document (scalar)

tf_avg_Coltf_avg_Col Average term frequency for all the documents in the collection ( scalar )Average term frequency for all the documents in the collection ( scalar )

df_max_Coldf_max_Col Maximum document frequency for a word in the collection ( scalar )Maximum document frequency for a word in the collection ( scalar )

dfdf Document frequency for the query words ( vector )Document frequency for the query words ( vector )

66

The ARRANGER Engine1. Split the training data

into training and validation

2. Generate an initial population of random “ranking functions”

3. Evaluate the fitness of each “ranking function” in the population and record 10 best ones

4. If stopping criteria is not met, generate the next generation of population by genetic transformation, go to Step 3.

5. Validate the recorded best “ranking functions” and select the best one as the RF

Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1

1 2 3 48 49 50

Start

Initialize Population

Evaluate Fitness

Apply Crossover

Stop?

Validate and Output End

48 49 501 2 30.40.30.4 0.80.30.4

67

Ranking Tuning

Blind Feedback

Multiple user queriesWith relevance information New Ranking

Function

New Search Results

User Queries

Ranking Tuning

Blind Feedback

Multiple user queriesWith relevance information New Ranking

Function

New Search Results

User Queries

An Integrated Model

68

Outline


69

Text + CBIR + Metadata + GIS

• Combined retrieval across multiple types of information

• Ex.: bio-diversity information systems

• Architecture, approach, prototype, validation

• Novel aspects:– Learn set of descriptors for a collection– Application to fisheries, archaeology

70

Textual information retrieval

Query on Google using Sunset and Rio de Janeiro

Query result

71

Content BasedInformationRetrieval

72

Motivation for Integration

• Query 1:– List all metadata information related to fishes which

have been observed at Mississippi River.• Query 2:

– Retrieve fish images which contain a shape similar to this example

o Query 3: List all metadata information related to fishes which both have been observed at Mississippi River and contain a shape similar to a given example.

73

Longer Integrated Query

• Retrieve fish descriptions of all fish whose shape is similar to that shown in Figure below, which belong to genus “Notropis”, which have “large eyes” and “dorsal stripe”, and have been observed within the catchments of the “Tennessee” river

74

System’s Architecture

Mediator

InterfaceInterface

Data Insertion ModuleData Insertion Module Query Processing ModuleQuery Processing Module

GISDBMS

Geo. DBMetadataImage DB

Databases

75

Content-Based ImageSearch Component

(CBISC)

OAI

EcoCollection Metadata

Taxonomic Trees

Metadata-Based Search Component

(ESSEX)

Geographic Data

Search Component

(GDSC)Web Feature Server(WFS)

GeoCollection MetadataMaps

ImageCollection Image

MetadataImage

DescriptorsImages

Image Collection

InterfaceQuery

Specification Visualization

Query Mediator

AnalysisMerging

Execution

BIS Manager

HTTP Request(ListDescriptors)

HTTP Request(GetImages)

HTTP Request(keywords)

HTTP Request(GetCapabilities)

HTTP Request(GetFeatureType)

HTTP Request(GetFeature)

76

Feature Extraction Model

Feature Vector[0.98, 0.91, 0.73, ……]

R

B

G

B

77

CBISC Architecture

78

CBISC Configuration Tool

79

Outline


80

Meta-Search

• Contexts:– Web: search engine built atop others– Federated search: bring together results from

distributed partial content sites

• Approach– Send query out to multiple sites– Merge results from sites– Combine those results for ranking as follows:

81

Combination

• For a document, combine the sim values from each system involved:

• CombMIN

• CombMAX

• CombSUM

• CombMNZ = CombSUM * no. systems with non-zero similarities

• CombMNZ oft best, else CombSum

82

DL Integration

• What is “DL Integration”– Hide distribution– Hide heterogeneity– Enable autonomy of individual component

• Why Integration– island-DLs– inability to seamlessly and transparently

access knowledge across DLs

Utilize various autonomous DLs in concert

83

Introduction and Problem Description

UnionDL

DL1 DL2

DL4DL3

DL5

DL1 DL2

DL4DL3

DL5

84

Related Work on Integrating Services in DLs

integrating searching and browsing with other services

clustering and visualization

has an example

Stepping Stones& Pathways

EtanaVizCitiViz

includes

has an example

I3R

systemsIn 1980s

found in

RABBIT

integrating searching and browsing

systemsIn 1990s

systemsIn 2000s

CODER

DataWeb

has an example

PESTO SenseMaker

has an example

MIX ScentTrailsBBQ

16ODLMARIAN

85

semantic interoperability

in DLs

Intermediary-based mapping-based

consists of

mediator wrapper ontology

use

federation union archiving

used in

schema mapping

use

Interrelated with

CITIDELDienst FedoraNDLTD Infobus

…

proactive standardization

reactive interpretation

achieved by use

Reconceptualization of Related Work on Semantic Interoperability

Key: Blue indicates focus of our work.

Automatically generate

31

86

Formal Definition of DL Integration

• DLi=(Ri, DMi, Servi, Soci), 1 i n

– Ri is a network accessible repository

– DMi is a set of metadata catalogs for all collections

– Servi is a set of services

– Soci is a society

• UnionRep• UnionCat• UnionServices• UnionSociety

87

Formal Definition of DL Integration (Cont.)

• DL integration problem definition:

Given n individual libraries, integrate the n DLs to create a UnionDL.

88

Taxonomy of Union Services

Infrastructure Services Information Satisfaction Services

Essential Add_Vaue Essential Add_value

indexing

harvesting

mapping

(Schema registry with analyses & mapping)

(data) cleaning

(focused) crawling

copying (replicating)

logging

(format) translating

(Service to support annotation)

(Metadata validation)

searching

browsing

access control

binding

comparison

(forum) discussion

(query) expansion

filtering

recommendation

visualization

Note: Suggested NSDL services are shown in blue.

89

Union Catalog Integration

VN MetadataFormat

Global MetadataFormat

VNCatalog

HDCatalog

Union Catalog

MappingTool

Wrapper

MappingTool

Wrapper

HD MetadataFormat

Virtual Nimrin(VN)

Halif DigMaster(HD)

Union ArchDL

90

Data Mapping (state-of-the-art)

91

local schema global schema

92

Mapping recommendation

93

Mapping confirmation

Mapping history

94

Outline


95

Link Fusion

A Unified Link Analysis for Multi-Type Interrelated Data

Objects

Wensi Xi1, Benyu Zhang2, Zheng Chen2, Yizhou Lu3, Shuicheng Yan3, Wei-Ying Ma2, Edward A. Fox1

1Virginia Tech 2Microsoft Research Asia 3Peking University

96

Traditional way of representing relationships

• Space: Vector and probability spaces (e.g., Salton and Wang’s works)

• Database: Sets of attributes represent relationships, and are used to design databases (e.g., Fuhr and Frieder’s works).

• Networks: Belief, inference and spreading activation networks (e.g., Turtle, Ribeiro-Neto, and Acid’s works)

Problems:• Not easily used to combine multiple types of data

objects and relationships• Need to find representations that are closer to reality

and are more dynamic

97

Example: Collaborative Filtering, Recommender System

Human Relationshi

p

User

Browse

Hyperlink/ Content

Similarity

Web page

• Inter-type relationship: Browse.• Intra-type relationship: Hyperlink, Human

relationships.

98

Example: Query Expansion, Web Clustering

Content Similarity

Reference

Hyperlink/ Content Similarity

Web page

Query

• Inter-type relationship: Reference.

• Intra-type relationship: Hyperlink, Content Similarity

99

Attribute Reinforcement Assumption

The specific attribute of a data object in one data type can be reinforced by both the attributes of related data objects in the same data space and attributes of related data objects from other data space.

Data SpaceInter-type relationship

Intra-type relationship

DataObject

100

Link Fusion algorithm (1)• Consider two types of objects, X={x1…xm} and Y={y1…ym},

their intra-type relationships are Rx Ry, inter-type relationships are Rxy and Ryx.

• Adjacency matrix Lx, Ly, Lxy, and Lyx represent the relationship of Rx Ry Rxy and Ryx respectively.

• Suppose wx is the attribute vector of objects in X, wy is the attribute vector of objects in Y

• wx is reinforced by both the intra and inter type relationships from X and Y, so as wy. The Link Fusion algorithm can be represented as:

T Ty y y xy x

T Tx x x yx y

w L w L w

w L w L w

101

Link Fusion algorithm (2)• In a more generalized scenario, suppose there are N data

types, importance attribute of one type of object can be reinforced by both inter and intra-type links as:

• It can also be represented into matrix representation w=ATw. In the matrix α and β are weights for different attributes.

• Iterative calculation would result in the prime eigenvector of A, which can be explained as the value of data objects regarding a specific attribute.

NM

T TM M M N

N M

w L w L w

' ' '1 1 12 12 1 1

' ' '21 21 2 2 2 2

' ' '1 1 2 2

...

...

...

n n

n n

n n n n n n

L L L

L L LA

L L L

102

Link Fusion algorithm (3)• If we consider webpages as a homogeneous

data space, and they are connected via intra-type relationships (hyperlinks), Link Fusion is reduced to PageRank algorithm.

• If we consider Hub and Authority attributes of webpages as two different type of objects, and they are connected via inter-type relationships (hyperlinks), Link Fusion is reduced to HITS algorithm.

• Thus, Link Fusion can be considered as an extension of traditional link analysis algorithms

103

SimFusion:Measuring Similarity Using the

Unified Relationship Matrix

1Wensi Xi, 1Edward Fox, 1Weiguo Fan, 2Benyu Zhang, 2Zheng Chen, 3Jun Yan, 4Dong Zhuang

1Department of Computer Science, Virginia Tech2Microsoft Research Asia

3School of Mathematical Science, Peking University4Department of Computer Science, Beijing Institute of

Technology

SIGIR 2005Salvador, Brazil August 15-19, 2005

104

Motivation• To achieve desired improvements with advanced

information systems, we need to combine and integrate information from a variety of sources.

• Entities from different domains can be considered as objects containing information:– Web pages or scientific papers– Queries– Users

• Information contained by objects may include:– Contents: papers, web-pages– Attributes: popularity, authority– Relationships: reference, hyperlink, similarity

105

Research Statement

Problem:

“How can the broad variety of heterogeneous data and relationships be effectively and efficiently integrated to improve the performance of various information retrieval related tasks?”

Solution:

Use matrices to represent multi-relationships, and use matrix calculations to integrate them (so as to improve searching and clustering).

106

ExampleUsers QueriesDocuments

0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 x 00 0 0 x 0 0 0

0 x 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 x

Collaborative filtering Log based clustering

U

DD

U

U

D

D

Q

0 0 0 0 0 0 0 0 0 0 x 0 0 00 0 0 0 0 0 0x 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 x 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 x

0 x 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 x 0

0 0 0 0 0 0 0 0 0 0 x 0 0 00 0 0 0 0 x 0x 0 0 0 0 0 0

x 0 0 0 0 0 0 0 0 0 0 0 x 00 0 0 0 0 x 00 0 0 x 0 0 0

U’

U’

D’

U’

D’

D’

D’

Q’

Cluster based retrievalUser Modeling

107

Unified Relationship Matrix (URM)

• Consider two types of objects, X={x1…xm} and Y={y1…yn}. Their intra-type relationships are Rx and Ry, while the inter-type relationships are Rxy and Ryx.

• Adjacency matrices Lx, Ly, Lxy, and Lyx represent the relationships of Rx, Ry, Rxy, and Ryx, respectively.

• All the relationships can be represented in a single unified matrix :

yyx

xyxurm LL

LLL

108

Unified Relationship Matrix (2)• In a more generalized scenario, suppose there are N data

types, and objects from different data spaces are connected by intra- and inter-type relationships.

• All the relationships can be represented into a Unified Relationship Matrix:

• Diagonal sub-matrices are for intra-relationships; others are for inter-relationships.

NNN

N

N

urm

LLL

LLL

LLL

L

21

2221

1121

109

Unified Relationship Matrix (3)

• Provides a general way of viewing data objects and relationships

• Data objects from different spaces are now all in the “unified” space.

• Previous inter- and intra-type relationships are now all intra-type relationships in the “unified” space.

110

Similarity Reinforcement Assumption The similarity of two data objects in one data type

can be reinforced by the similarity values of other data objects to which they are related.

User Space

Reference relationshipSelect relationship

Query Space

Document Space

Reading relationship

111

SimFusion Algorithm

• Suppose there are N data spaces, then objects in these spaces are connected by inter- and intra-type relationships as in the URM below:

• A Unified Similarity Matrix is built to represent the pair-wise similarities of data objects:

NNNNNNN

NN

NN

urm

LLL

LLL

LLL

L

2211

222222121

111212111

1

1

1

21

221

112

TT

T

T

usm

ss

ss

ss

S

112

SimFusion Algorithm (2)

• The Similarity reinforcement assumption can be represented as:

• Such reinforcement calculation can be continued as:

• The calculation has been proven to converge, and is named the SimFusion algorithm.

Turm

originalusmurm

newusm LSLS

Tnurmusm

nurm

Turm

nusmurm

nusm LSLLSLS )(01

113

Real World Examples

• Consider the space of scientific papers. Then with reference relationships (intra-type), SimFusion reduces to co-citation or bibliographic coupling algorithms.

• Consider the document-term “contain” relationship and build a URM as below:

Here the USM is the identity matrix, and SimFusion reduces to VSM document similarity calculation.

• Others…(Raghavan, Beeferman)

0

0

dtT

dturm L

LL

114

Summary• Acknowledgements, Publications• Introduction: Problem, Digital Libraries• New Efforts: Personalization, Superimposed Info• 5S, ETANA, Structure• Hybrid Partitioned Inverted Indices• Discovering Ranking Functions• Text + CBIR + Metadata + GIS• Meta-search, Union DLs• LinkFusion, SimFusion• Summary

115

Selected Links

• http://fox.cs.vt.edu• CITIDEL (computing education

resources)–www.citidel.org

• NDLTD (electronic theses and dissertations worldwide)–www.ndltd.org and etdguide.org

• Virginia Tech Digital Library Research Laboratory–DLRL, www.dlib.vt.edu

116

Questions?Discussion?

Thank You!

outline

Documents

wensi xi fox

robust retrieval

weiguo fan gordon

aaron krowne

xiaoyan yu

rao shen

robert france

ranking discovery