annotating full document

Download Annotating Full Document

Post on 01-Mar-2018

216 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • 7/26/2019 -Annotating Full Document

    1/48

    ABSTRACT

    An increasing number of databases have become web accessible through HTML

    form-based search interfaces. The data units returned from the underlying database

    are usually encoded into the result pages dynamically for human browsing. For the

    encoded data units to be machine process able, which is essential for many

    applications such as deep web data collection and nternet comparison shopping,

    they need to be e!tracted out and assigned meaningful labels. n this paper, we

    present an automatic annotation approach that first aligns the data units on a result

    page into different groups such that the data in the same group have the same

    semantic. Then, for each group we annotate it from different aspects and aggregatethe different annotations to predict a final annotation label for it. An annotation

    wrapper for the search site is automatically constructed and can be used to annotate

    new result pages from the same web database. "ur e!periments indicate that the

    proposed approach is highly effective.

    INTRODUCTION

    What is Data Mining?

    #tructure of $ata Mining

  • 7/26/2019 -Annotating Full Document

    2/48

    %enerally, data mining &sometimes called data or 'nowledge discovery( is the

    process of analy)ing data from different perspectives and summari)ing it into

    useful information - information that can be used to increase revenue, cuts costs, or

    both. $ata mining software is one of a number of analytical tools for analy)ing

    data. t allows users to analy)e data from many different dimensions or angles,

    categori)e it, and summari)e the relationships identified. Technically, data mining

    is the process of finding correlations or patterns among do)ens of fields in large

    relational databases.

    How Data Mining Works?

    *hile large-scale information technology has been evolving separate transaction

    and analytical systems, data mining provides the lin' between the two. $ata

    mining software analy)es relationships and patterns in stored transaction data

    based on open-ended user +ueries. #everal types of analytical software are

    available statistical, machine learning, and neural networ's. Generally any o!

    !o"r ty#es o! relationshi#s are so"ght$

    Classes #tored data is used to locate data in predetermined groups. For

    e!ample, a restaurant chain could mine customer purchase data to determine

    when customers visit and what they typically order. This information could

    be used to increase traffic by having daily specials.

    Cl"sters $ata items are grouped according to logical relationships or

    consumer preferences. For e!ample, data can be mined to identify mar'et

    segments or consumer affinities.

    Asso%iations $ata can be mined to identify associations. The beer-diaper

    e!ample is an e!ample of associative mining.

  • 7/26/2019 -Annotating Full Document

    3/48

    Se&"ential #atterns $ata is mined to anticipate behavior patterns and

    trends. For e!ample, an outdoor e+uipment retailer could predict the

    li'elihood of a bac'pac' being purchased based on a consumers purchase of

    sleeping bags and hi'ing shoes.

    Data 'ining %onsists o! !i(e 'a)or ele'ents$

    ( /!tract, transform, and load transaction data onto the data warehouse

    system.

    0( #tore and manage the data in a multidimensional database system.

    1( 2rovide data access to business analysts and information technology

    professionals.

    3( Analy)e the data by application software.

    4( 2resent the data in a useful format, such as a graph or table.

    Di!!erent le(els o! analysis are a(aila*le$

    Arti!i%ial ne"ral networks 5on-linear predictive models that learn through

    training and resemble biological neural networ's in structure.

    Geneti% algorith's "ptimi)ation techni+ues that use process such as

    genetic combination, mutation, and natural selection in a design based on the

    concepts of natural evolution.

    De%ision trees Tree-shaped structures that represent sets of decisions. These

    decisions generate rules for the classification of a dataset. #pecific decision

    tree methods include 6lassification and 7egression Trees &6A7T( and 6hi

  • 7/26/2019 -Annotating Full Document

    4/48

    #+uare Automatic nteraction $etection &6HA$(. 6A7T and 6HA$ are

    decision tree techni+ues used for classification of a dataset. They provide a

    set of rules that you can apply to a new &unclassified( dataset to predict

    which records will have a given outcome. 6A7T segments a dataset by

    creating 0-way splits while 6HA$ segments using chi s+uare tests to create

    multi-way splits. 6A7T typically re+uires less data preparation than

    6HA$.

    Nearest neigh*or 'etho+ A techni+ue that classifies each record in a

    dataset based on a combination of the classes of the krecord&s( most similar

    to it in a historical dataset &where k8(. #ometimes called the k-nearest

    neighbor techni+ue.

    R"le in+"%tion The e!traction of useful if-then rules from data based on

    statistical significance.

    Data (is"ali,ation The visual interpretation of comple! relationships in

    multidimensional data. %raphics tools are used to illustrate data

    relationships.

    Chara%teristi%s o! Data Mining$

    -arge &"antities o! +ata The volume of data so great it has to be analy)ed

    by automated techni+ues e.g. satellite information, credit card transactions

    etc.

    Noisy in%o'#lete +ata mprecise data is the characteristic of all data

    collection.

  • 7/26/2019 -Annotating Full Document

    5/48

    Co'#le. +ata str"%t"re conventional statistical analysis not possible

    Heterogeneo"s +ata store+ in lega%y syste's

    Bene!its o! Data Mining$

    ( t9s one of the most effective services that are available today. *ith the help

    of data mining, one can discover precious information about the customers

    and their behavior for a specific set of products and evaluate and analy)e,

    store, mine and load data related to them

    0( An analytical 67M model and strategic business related decisions can be

    made with the help of data mining as it helps in providing a complete

    synopsis of customers

    1( An endless number of organi)ations have installed data mining pro:ects and

    it has helped them see their own companies ma'e an unprecedented

    improvement in their mar'eting strategies &6ampaigns(

    3( $ata mining is generally used by organi)ations with a solid customer focus.

    For its fle!ible nature as far as applicability is concerned is being used

    vehemently in applications to foresee crucial data including industry

    analysis and consumer buying behaviors

    4( Fast paced and prompt access to data along with economic processing

    techni+ues have made data mining one of the most suitable services that a

    company see'

    A+(antages o! Data Mining$

    /0 Marketing 1 Retail$

    $ata mining helps mar'eting companies build models based on historical data

  • 7/26/2019 -Annotating Full Document

    6/48

    to predict who will respond to the new mar'eting campaigns such as direct mail,

    online mar'eting campaign;etc. Through the results, mar'eters will have

    appropriate approach to sell profitable products to targeted customers.

    $ata mining brings a lot of benefits to retail companies in the same way as

    mar'eting. Through mar'et bas'et analysis, a store can have an appropriate

    production arrangement in a way that customers can buy fre+uent buying products

    together with pleasant. n addition, it also helps the retail companies offer certain

    discounts for particular products that will attract more customers.

    20 3inan%e 1 Banking$ata mining gives financial institutions information about loan information and

    credit reporting.

  • 7/26/2019 -Annotating Full Document

    7/48

    financial transaction to build patterns that can detect money laundering or criminal

    activities.

    60 -aw en!or%e'ent$

    $ata mining can aid law enforcers in identifying criminal suspects as well as

    apprehending these criminals by e!amining trends in location, crime type, habit,

    and other patterns of behaviors.

    70 Resear%hers$

    $ata mining can assist researchers by speeding up their data analy)ing process=

    thus, allowing those more time to wor' on other pro:ects.

    S8ST9M ANA-8SIS9:ISTING S8ST9M$

    n this e!isting system, a data unit is a piece of te!t that semantically representsone concept of an entity. t corresponds to the value of a record under an attribute.

    t is different from a te!t node which refers to a se+uence of te!t surrounded by a

    pair of HTML tags. t describes the relationships between te!t nodes and data units

    in detail. n this paper, we perform data unit level annotation. There is a high

    demand for collecting data of interest from multiple *$

  • 7/26/2019 -Annotating Full Document

    8/48

    the semantic of each data unit. >nfortunately, the semantic labels of data units are

    often not provided in result pages. For instance, no semantic labels for the values

    of title, author, publisher, etc., are given. Having semantic labels for data units is

    not only important for the above record lin'age tas', but also for storing collected

    #77s into a database table.

  • 7/26/2019 -Annotating Full Document

    9/48

    we are the first to utili)e # for annotating #77s.

    *e employ si! ba