retrieval of information from distributed databases by ananth anandhakrishnan

22
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Upload: johnathon-rhode

Post on 01-Apr-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Retrieval of Information from

Distributed Databases

By

Ananth Anandhakrishnan

Page 2: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Outline

Introduction to Distributed Computer Systems

The need for Distributed IR

Distributed IR

Problems of Distributed IR - system components

Federated search engine

Other examples Distributed IR

Conclusion

Page 3: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Introduction to Distributed Computer Systems

What is it?

A distributed system is a collection of independent computers that appears to its users as a single

coherent system.

The first Distributed System: IBM 1961 develop a Compatible Time Sharing System.

More recent times

The WWW concept was designed in 1989 at CERN. wide spread use in the 90's.

1972 ARPANET - building blocks of the INTERNET

INTERNET - network or networks

Page 4: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Other types of Distributed systems

ORACLE – Distributed Database Management System

Air Traffic Control System – Real-time Distributed System

University Network - Client-Server system

Question: is a search engines a distributed system?

Yes- single interface- search engines like Google have a cluster of 4000 computers doing its web crawling.

No - user is aware of where searched documents come from. web address (URL)- Google's control is centralised - index and presentation

Goals• Share and access resource located on remote sites

• Scalability

• Transparency

• Fault Tolerance

Page 5: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

The need for Distributed IR

Benefits of Centralised IR

• Centralised control of resources is easier to manage

• More relevant resources are selected from user query on a centralised system

Why is there a need for Distributed IR?

Problems of centralised systems

• It’s not scalable for millions of users accessing single server

- Increases network traffic

- Increases server load

• There is a single point of failure

Problems in IR

• Information is constantly growing.

• Different types of information are emerging with different formats and standards,

residing on heterogeneous networks. hard to integrate services.

Page 6: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

The need for Distributed IR

Improves Scalability

Distribute the information to a network of servers.

Apply Standards and ProtocolsZ39.50 search protocol - allows a uniform access to a large number of diverse and

heterogeneous information sources. client server computing.

Dublin Core - standards applied to metadata (data about data) to make searching for information more efficient.

Replication model for information retrieval

• Removes single point of failure

• Improves scalability issues

Page 7: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Distributed IR

What is distributed information retrieval?

The goal of distributed information retrieval is to enable the identification and retrieval of data sets relevant to a general description or query, wherever those data sets may be located or hosted

USER APPLICATION DISTRIBUTED DATABSES

Environments: cooperative or uncooperative

Page 8: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Distributed IR

How does it works?

Library Example• library organisation has sites in different locations and has different internet accessible

resources (e-journals, e-books) in different categories (literature, science, computing, geography, history, sports).

• Each library maintains its own database, with the resources, a unique identifier for the resources and detailed descriptions of the resources, and statistical information about the resource content.

• each library may have resources of the same type

The library organisation has an online search engine, which enables users to search for any online resource in any category in all the libraries databases.

Example of query

User enters a query into the search application which will pass this request to all the individual library databases. These databases will return a list of unique identifiers of the relevant resources which are merged together in the application to present to the single ranked list to the user. If user finds a resource they want to view, the resource identifier is used to retrieve the resource.

Page 9: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Problems of Distributed IR - system sub components

main components

• resource description

• resource selection

• query translation

• resource merging

Resource description database files which contain detailed information about the resources.

cooperative environments - START protocol

uncooperative environments - Query based sampling

Dublin Core - standards used to improve indexing information for resource descriptions. 15 elements - used to uniquely identify information or resources. Embedded into XML or HTML

Example

TITLE: Information Retrieval from Distributed DatabasesCREATOR: Ananth AnandhakrishnanDATE: 24-11-2004FORMAT: WORD DOCUMENTLANGUAGE: ENGLISH

Page 10: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Dublin Core Metadata in HTML and XML

<html> <head> <title> Distributed Information Retrieval </title> <meta name = "DC.Title" content = " Retrieval of information from DIR "> <meta name = "DC.Creator" content = "Ananth Anandhakrishnan"> <meta name = "DC.Date" content = "24/11/2004"> <meta name = "DC.Format" content = "text/html"> <meta name = "DC.Language" content = "en"> </head> <body>

</body</html>

HTML has a tag called META

XML embedded with a framework RDF

<rdf:RDF <dc:creator>Ananth Anandhakrishnan</dc:creator>

<dc:title>Distributed Information Retreival </dc:title>

<dc:description> How does Information Retrieval from Distributed databasesworks</dc:description>

<dc:date>2004-11-24</dc:date></rdf:RDF>

Page 11: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Resource Selection Component

Resource Selectiontwo jobs:• involves identifying a small set of databases from the distributed information retrieval system that

contains documents relevant to a query.

• after databases are selected a ranked list is produced

This Process based on using algorithms• CORI• KL Divergence • Relevant Document Distribution Estimation (ReDDE)

Which is the best?

ReDDE is proven to be the best algorithm for resource selection.

estimates the distribution of relevant documents across the databases for each user query and ranks databases according to this distribution of relevant documents.

Page 12: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Resource Merging

Result MergingSelected resources are complied into a single result.

removes any duplication of resources

Problems• different databases use different selection algorithms difficult to merge.

solutionuse standard selection algorithms

more problems• current merging methods take place at client end - isolated from DIR• current methods are not very good.

round robin - selecting the first database that it hits, doesn’t take into account of its relevance raw merge - results based on document scores

solutionplace merging component near the selection component

Semi Supervised Learning model - resource merging method. aim: produce a ranked list which is similar to one of a centralised information retrieval system. achieved: running a centralised sample database in parallel with the distributed databases.

centralised sample database - using query based sampling to build resource descriptions.

Page 13: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Ranked list document links

Semi Supervised Learning Model

Query entryResource selection

Merging results

CENTRALISED SAMPLE DATABASE

DISTRIBUTED DATABASES

Resource Descriptions of documents held on all databases. Obtained by querying

Query is sent to a centralised sample database

Merged results ranked by relevance.

Combine document ranking

Merged list

Ranked list of documents from central database.

Individual ranked lists

Database independent scores

Database specific scores

Page 14: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Semi Supervised Learning Model

How distributed information retrieval works in more detail • A user enters a query• The query is used to rank the collection of databases from which a set of databases

are selected.• The query is then broadcasted to all the selected databases from which it produces a

ranked list of all matches with document id and scores. The document ids and scores are added to the merging algorithm.

• The query is also broadcasted to the parallel running centralized database and the ranked list of document id’s and scores are also inputted into the merging algorithm. The ranked list provided by the central database will influence the resources merged from the distributed databases.

SSL

The SSL algorithm specifically models result merging as a task of transforming sets of database-specific document scores into a single set of database-independent document scores by using the documents acquired by query-based sampling as training data.

Uses a regression algorithm to do this.

Page 15: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

ISI Web of Knowledge

ISI products are registered trademarks and service marks used under license

.

An incredible wealth of

content --

ISI-Derwent + Partners

= depth and

diversity

Engineered to work as single resource.

Uniquely Integrated like no other

platform.

What makes the

Web of Knowledge

so unique?

CrossSearch:

• 9,000+ International Journals

• 100,000+ meetings, symposia, and reports

• 11.3 million Patented Inventions

Page 16: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Our research interests involve the development of plant species that will

actually assist in the clean-up of polluted soils.

Page 17: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

We can choose to explore our results

using the CrossSearch

results summary list as a base.

Page 18: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

We can choose to explore our results

using the CrossSearch

results summary list as a base.

We can also filter results by specific

database.

This is especially helpful in identifying

particular information, such as patent data, within

the results list.

Page 19: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan
Page 20: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Other Examples

EmergeEmerge is a software built for information retrieval of scientific data. makes use of the Dublin core and Z39.50 search protocol

XML-based translation engine which can perform metadata mapping and query translation.

Harvestcollects information from : - internet, intranet using http, ftp - local files like data on hard disk, CDROM and file servers. makes them searchable using a web interfacesupports wide range of formats

Summary Object Interchange Format (SOIF) - metadata mapping

Broker Gatherer Provider 2

Provider 1

Provider 3

Client

Collects information available at provider

Collects, stores and managers the information for clients to query

Page 21: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Other Examples

User information to keep track of processing data

SETI@HomeSETI@Home is a screensaver program used to aid the search for extraterrestrial lifeuses client computers CPU power to process data packets.

Page 22: Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Conclusion

Distributed Computing Concepts help information retrieval systemsDistributed IR depends on Centralised IR - tries to emulate it

Current State of Distributed SearchGRUB screensaver program which uses your bandwidth and CPU powerproduces the most up-to-date indexes. have not got wide level of support.

P2P searchwell known for Napster and Kazaamore dynamic than Google- allows users to upload whatever they want, and make it search availableGoogle is in a controlled environment.

not considered in commercial field - they don’t see the benefits.