geographical information retrieval system

Upload: sumitsharma

Post on 14-Oct-2015

17 views

Category:

Documents


0 download

DESCRIPTION

Geographical Information Retrieval system

TRANSCRIPT

An introduction to Geographic Information Retrieval Systems

An introduction to Geographic Information Retrieval SystemsAjay Kumar Garg Engineering College,GhaziabadUttar Pradesh Technical University June, 2014 by Nishant Shekhar(1002710065) Nishank Garg (1002710064) Nikhil Babu(1002710062) Sumit Jha (1002710106)

1. INTRODUCTION2. OBJECTIVE3. BASIC CONCEPTS OF GIRS4. INVERTED INDEX5. TRADITIONAL SEARCH VS GEO SEARCH6. COMPARISION WITH EXISTING SEARCH ENGINE7. CONCLUSIONTable of ContentsINTRODUCTIONGeographic information retrieval System(GIRS) is a fast developing area concerned with providing access to geo-referenced information sources.

It is a useful premise to assume that every document in a collection and every query issued to an information retrieval (IR) system is geography-dependent. If we can globally determine what area an article or a document is about (i.e., its geographical scope), we can reasonably assume that people, places and organizations named in the article are located in the area.

OBJECTIVEThe project presents our work on automatically identifying the geographical scope of Web documents, which provides the means to develop retrieval tools that take the geo-graphical context into consideration.

Other objectives are:-

Detection of geographic references in the documents. Modeling of geographic scope of documents. Relevance ranking according to geographic context. Need for efficient index techniques which cope with both textual and spatial dimensions. Development of user interfaces which provide usability to deal with both dimensions.

BASIC CONCEPTS OF GIRSWeb crawling and indexes: Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. It is sometimes referred to as a spider.

BASIC CONCEPTS OF GIRS (Cont.)The basic operation of any hypertext crawler is as follows. The crawler begins with one or more URLs that constitute a seed set. It picks a URL from this seed set, and then fetches the web page at that URL. The fetched page is then parsed, to extract both the text and the links from the page (each of which points to another URL).

Document Parsing: Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are called tokens, and so, in the context of search engine indexing andnatural language processing, parsing is more commonly referred to astokenization. It is also sometimes calledtext segmentation,content analysis, text analysis,text mining, speech segmentation,lexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.

BASIC CONCEPTS OF GIRS (Cont.)Document Retrieval:

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Document retrieval is sometimes also referred to as, or as a branch of, Text Retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text databases became decentralized.

INVERTED INDEXInverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database.

There are two main variants of inverted indexes: A record level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word.A word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document.

INVERTED INDEX(Continued..)

TRADITIONAL VS GEO SEARCHRanking according to subject-relevance and Geographic attributesRanking according to subject-relevance

Boolean operations on Spatial database followed by inverted indexBoolean operations on inverted index.User enters key words and geographic detailsUser enters key wordsGeographic SearchTraditional SearchCOMPARISION WITH EXISTING SEARCH ENGINE

BING RESULT

YAHOO AND GOOGLE RESULT

CONCLUSIONThis project aims at determining the geographical scope of the web documents, fetching them from the web, indexing them according to their geographical location as well as based upon their textual information. Now the intersection is taken in order to determine the exact information about the document and arranging them and displaying the result.

To support information retrieval , its fundamental that the Web page geographic classification is very accurate and the classification of each Web page is a very narrow region (for example: cities, streets). To improve the usability levels of project, the following functionalities have to be extended:

Support of geographic querys with multiple geo-graphic scopes.Support of complex semantic relations between the query object and geographic scopeEmploy the user disambiguation history to improve the geographic disambiguation.Generate document summaries that would allow the user to visualize the most important information of each result, without consulting the full document.