annotating full document
Post on 01-Mar-2018
216 views
Embed Size (px)
TRANSCRIPT
7/26/2019 -Annotating Full Document
1/48
ABSTRACT
An increasing number of databases have become web accessible through HTML
form-based search interfaces. The data units returned from the underlying database
are usually encoded into the result pages dynamically for human browsing. For the
encoded data units to be machine process able, which is essential for many
applications such as deep web data collection and nternet comparison shopping,
they need to be e!tracted out and assigned meaningful labels. n this paper, we
present an automatic annotation approach that first aligns the data units on a result
page into different groups such that the data in the same group have the same
semantic. Then, for each group we annotate it from different aspects and aggregatethe different annotations to predict a final annotation label for it. An annotation
wrapper for the search site is automatically constructed and can be used to annotate
new result pages from the same web database. "ur e!periments indicate that the
proposed approach is highly effective.
INTRODUCTION
What is Data Mining?
#tructure of $ata Mining
7/26/2019 -Annotating Full Document
2/48
%enerally, data mining &sometimes called data or 'nowledge discovery( is the
process of analy)ing data from different perspectives and summari)ing it into
useful information - information that can be used to increase revenue, cuts costs, or
both. $ata mining software is one of a number of analytical tools for analy)ing
data. t allows users to analy)e data from many different dimensions or angles,
categori)e it, and summari)e the relationships identified. Technically, data mining
is the process of finding correlations or patterns among do)ens of fields in large
relational databases.
How Data Mining Works?
*hile large-scale information technology has been evolving separate transaction
and analytical systems, data mining provides the lin' between the two. $ata
mining software analy)es relationships and patterns in stored transaction data
based on open-ended user +ueries. #everal types of analytical software are
available statistical, machine learning, and neural networ's. Generally any o!
!o"r ty#es o! relationshi#s are so"ght$
Classes #tored data is used to locate data in predetermined groups. For
e!ample, a restaurant chain could mine customer purchase data to determine
when customers visit and what they typically order. This information could
be used to increase traffic by having daily specials.
Cl"sters $ata items are grouped according to logical relationships or
consumer preferences. For e!ample, data can be mined to identify mar'et
segments or consumer affinities.
Asso%iations $ata can be mined to identify associations. The beer-diaper
e!ample is an e!ample of associative mining.
7/26/2019 -Annotating Full Document
3/48
Se&"ential #atterns $ata is mined to anticipate behavior patterns and
trends. For e!ample, an outdoor e+uipment retailer could predict the
li'elihood of a bac'pac' being purchased based on a consumers purchase of
sleeping bags and hi'ing shoes.
Data 'ining %onsists o! !i(e 'a)or ele'ents$
( /!tract, transform, and load transaction data onto the data warehouse
system.
0( #tore and manage the data in a multidimensional database system.
1( 2rovide data access to business analysts and information technology
professionals.
3( Analy)e the data by application software.
4( 2resent the data in a useful format, such as a graph or table.
Di!!erent le(els o! analysis are a(aila*le$
Arti!i%ial ne"ral networks 5on-linear predictive models that learn through
training and resemble biological neural networ's in structure.
Geneti% algorith's "ptimi)ation techni+ues that use process such as
genetic combination, mutation, and natural selection in a design based on the
concepts of natural evolution.
De%ision trees Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. #pecific decision
tree methods include 6lassification and 7egression Trees &6A7T( and 6hi
7/26/2019 -Annotating Full Document
4/48
#+uare Automatic nteraction $etection &6HA$(. 6A7T and 6HA$ are
decision tree techni+ues used for classification of a dataset. They provide a
set of rules that you can apply to a new &unclassified( dataset to predict
which records will have a given outcome. 6A7T segments a dataset by
creating 0-way splits while 6HA$ segments using chi s+uare tests to create
multi-way splits. 6A7T typically re+uires less data preparation than
6HA$.
Nearest neigh*or 'etho+ A techni+ue that classifies each record in a
dataset based on a combination of the classes of the krecord&s( most similar
to it in a historical dataset &where k8(. #ometimes called the k-nearest
neighbor techni+ue.
R"le in+"%tion The e!traction of useful if-then rules from data based on
statistical significance.
Data (is"ali,ation The visual interpretation of comple! relationships in
multidimensional data. %raphics tools are used to illustrate data
relationships.
Chara%teristi%s o! Data Mining$
-arge &"antities o! +ata The volume of data so great it has to be analy)ed
by automated techni+ues e.g. satellite information, credit card transactions
etc.
Noisy in%o'#lete +ata mprecise data is the characteristic of all data
collection.
7/26/2019 -Annotating Full Document
5/48
Co'#le. +ata str"%t"re conventional statistical analysis not possible
Heterogeneo"s +ata store+ in lega%y syste's
Bene!its o! Data Mining$
( t9s one of the most effective services that are available today. *ith the help
of data mining, one can discover precious information about the customers
and their behavior for a specific set of products and evaluate and analy)e,
store, mine and load data related to them
0( An analytical 67M model and strategic business related decisions can be
made with the help of data mining as it helps in providing a complete
synopsis of customers
1( An endless number of organi)ations have installed data mining pro:ects and
it has helped them see their own companies ma'e an unprecedented
improvement in their mar'eting strategies &6ampaigns(
3( $ata mining is generally used by organi)ations with a solid customer focus.
For its fle!ible nature as far as applicability is concerned is being used
vehemently in applications to foresee crucial data including industry
analysis and consumer buying behaviors
4( Fast paced and prompt access to data along with economic processing
techni+ues have made data mining one of the most suitable services that a
company see'
A+(antages o! Data Mining$
/0 Marketing 1 Retail$
$ata mining helps mar'eting companies build models based on historical data
7/26/2019 -Annotating Full Document
6/48
to predict who will respond to the new mar'eting campaigns such as direct mail,
online mar'eting campaign;etc. Through the results, mar'eters will have
appropriate approach to sell profitable products to targeted customers.
$ata mining brings a lot of benefits to retail companies in the same way as
mar'eting. Through mar'et bas'et analysis, a store can have an appropriate
production arrangement in a way that customers can buy fre+uent buying products
together with pleasant. n addition, it also helps the retail companies offer certain
discounts for particular products that will attract more customers.
20 3inan%e 1 Banking$ata mining gives financial institutions information about loan information and
credit reporting.
7/26/2019 -Annotating Full Document
7/48
financial transaction to build patterns that can detect money laundering or criminal
activities.
60 -aw en!or%e'ent$
$ata mining can aid law enforcers in identifying criminal suspects as well as
apprehending these criminals by e!amining trends in location, crime type, habit,
and other patterns of behaviors.
70 Resear%hers$
$ata mining can assist researchers by speeding up their data analy)ing process=
thus, allowing those more time to wor' on other pro:ects.
S8ST9M ANA-8SIS9:ISTING S8ST9M$
n this e!isting system, a data unit is a piece of te!t that semantically representsone concept of an entity. t corresponds to the value of a record under an attribute.
t is different from a te!t node which refers to a se+uence of te!t surrounded by a
pair of HTML tags. t describes the relationships between te!t nodes and data units
in detail. n this paper, we perform data unit level annotation. There is a high
demand for collecting data of interest from multiple *$
7/26/2019 -Annotating Full Document
8/48
the semantic of each data unit. >nfortunately, the semantic labels of data units are
often not provided in result pages. For instance, no semantic labels for the values
of title, author, publisher, etc., are given. Having semantic labels for data units is
not only important for the above record lin'age tas', but also for storing collected
#77s into a database table.
7/26/2019 -Annotating Full Document
9/48
we are the first to utili)e # for annotating #77s.
*e employ si! ba