web page similarity draft final

CHAPTER - 11. INTRODUCTION1.1 INTRODUCTION TO DATA MININGData mining is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Apart from the raw analysis, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.1.1.1 Data Mining TaskThe data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records, unusual records and dependencies. This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. The data mining can identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system.

1.1.2 Intrinsic: Text and Metadata AnalysisMetadata extraction is the process of describing extrinsic and intrinsic qualities of the resource such as document, image, video, etc. As the result of that a number of texts from webpages are produced which enables efficient search, sort and mining functionalities provided by the websites. Another source of data for similarity analysis is document metadata where the information is stored by the data repository about the webpage. Metadata that might be useful in a similarity query include the author or the creation date of the document.

1.1.3 Extrinsic: Web Link AnalysisLink- or network-analysis algorithms have the prospect of being scalable, language and media -independent, as well as robust in the face of link-spam and topic complexity A webpage might contain link to another webpage like advertisement or adwords in form of Googles AdSense, which is irrelevant for our research. Such links should be discarded since the similarity between two websites is only studied and not the extrinsic link to another website though it is relevant or irrelevant.1.1.4 Data WarehousesA data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading solution, an online analytical processing engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.1.2 WEB MININGWeb usage mining is the process of extracting useful information from websites or webpages, e.g. the process of finding out what data the users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web usage mining is the application of data mining techniques to discover interesting usage patterns in form of text from web data in order to understand and better serve the needs of web-based applications. The proposed research, Content and Structure based mining is taken for finding similarity, since usage is not mandatory.

Web Mining

Content StructureUsage Mining Mining Mining

Agent Based Database Customized Psychographic

Figure 1. Web Mining Process1.2.1 Web Content MiningWeb content mining is the extraction and integration of useful data, information and knowledge from web page content. When the URL is given as input to the webpage, the content from the web are extracted. These contents play a vital role in finding the similarity between two webpage or website.1.2.2 Web Structure MiningWeb Structure Mining deals with the discovering and modeling the link structure of the web. This can help in discovering similarity between sites or discovering web communities. Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds: Extracting patterns from hyperlinks in the web where a hyperlink is a structural component that connects the web page to a different location. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.Based on the Content and Structure, the similarity is calculated based on the Cosine Technique with the help of matched words between two webpages. Also, the structure of webpage is used to find out the metadata and keyword details resulting in comparison between two such data to identify the structural information of the webpage, which can be used to compare and analyze the same.

Figure 2. Data Mining Process

Data Selection ProcessResource finding is the process which involves extracting data from either online or offline text resource available on the web. Information selection and pre-processing is the automatic selection of particular information from retrieved web resources. This process transforms the original retrieved data into information. The transformation could be renewal of stop words, stemming or it may be aimed for obtaining the desired representation such as finding phrases in the training corpus and representing the text in the first order logic form. Generalization automatically discovers specific patterns at individual web sites. Data Mining techniques and machine learning are used in generalization. Analysis involves the validation and interpretation of the mined patterns. It plays an important role in pattern mining.

1.2 INTRODUCTION ABOUT THE PROPOSED SYSTEMSearch Engines are used to search websites that contain duplicate or similar content. A content of web page could be similar to other content of another webpage. This happens while copying the template from online providers or copying the code from one website and modifying the same. The proposed technique helps to determine the percentage of similarity between two web pages. URL 2URL 1

Extracts Extracts

Title

KeywordsMeta-DataTitle

KeywordsMeta-Data

Cosine SimilarityMeasurement

Similarity Compliance

>50%

web page similarity draft final

Documents

transaction data

data preprocessing

input data

data repository

source of data

historical data

data mining process

data miningdata mining