Download - -Annotating Full Document
-
7/26/2019 -Annotating Full Document
1/48
ABSTRACT
An increasing number of databases have become web accessible through HTML
form-based search interfaces. The data units returned from the underlying database
are usually encoded into the result pages dynamically for human browsing. For the
encoded data units to be machine process able, which is essential for many
applications such as deep web data collection and nternet comparison shopping,
they need to be e!tracted out and assigned meaningful labels. n this paper, we
present an automatic annotation approach that first aligns the data units on a result
page into different groups such that the data in the same group have the same
semantic. Then, for each group we annotate it from different aspects and aggregatethe different annotations to predict a final annotation label for it. An annotation
wrapper for the search site is automatically constructed and can be used to annotate
new result pages from the same web database. "ur e!periments indicate that the
proposed approach is highly effective.
INTRODUCTION
What is Data Mining?
#tructure of $ata Mining
-
7/26/2019 -Annotating Full Document
2/48
%enerally, data mining &sometimes called data or 'nowledge discovery( is the
process of analy)ing data from different perspectives and summari)ing it into
useful information - information that can be used to increase revenue, cuts costs, or
both. $ata mining software is one of a number of analytical tools for analy)ing
data. t allows users to analy)e data from many different dimensions or angles,
categori)e it, and summari)e the relationships identified. Technically, data mining
is the process of finding correlations or patterns among do)ens of fields in large
relational databases.
How Data Mining Works?
*hile large-scale information technology has been evolving separate transaction
and analytical systems, data mining provides the lin' between the two. $ata
mining software analy)es relationships and patterns in stored transaction data
based on open-ended user +ueries. #everal types of analytical software are
available statistical, machine learning, and neural networ's. Generally any o!
!o"r ty#es o! relationshi#s are so"ght$
Classes #tored data is used to locate data in predetermined groups. For
e!ample, a restaurant chain could mine customer purchase data to determine
when customers visit and what they typically order. This information could
be used to increase traffic by having daily specials.
Cl"sters $ata items are grouped according to logical relationships or
consumer preferences. For e!ample, data can be mined to identify mar'et
segments or consumer affinities.
Asso%iations $ata can be mined to identify associations. The beer-diaper
e!ample is an e!ample of associative mining.
-
7/26/2019 -Annotating Full Document
3/48
Se&"ential #atterns $ata is mined to anticipate behavior patterns and
trends. For e!ample, an outdoor e+uipment retailer could predict the
li'elihood of a bac'pac' being purchased based on a consumers purchase of
sleeping bags and hi'ing shoes.
Data 'ining %onsists o! !i(e 'a)or ele'ents$
( /!tract, transform, and load transaction data onto the data warehouse
system.
0( #tore and manage the data in a multidimensional database system.
1( 2rovide data access to business analysts and information technology
professionals.
3( Analy)e the data by application software.
4( 2resent the data in a useful format, such as a graph or table.
Di!!erent le(els o! analysis are a(aila*le$
Arti!i%ial ne"ral networks 5on-linear predictive models that learn through
training and resemble biological neural networ's in structure.
Geneti% algorith's "ptimi)ation techni+ues that use process such as
genetic combination, mutation, and natural selection in a design based on the
concepts of natural evolution.
De%ision trees Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. #pecific decision
tree methods include 6lassification and 7egression Trees &6A7T( and 6hi
-
7/26/2019 -Annotating Full Document
4/48
#+uare Automatic nteraction $etection &6HA$(. 6A7T and 6HA$ are
decision tree techni+ues used for classification of a dataset. They provide a
set of rules that you can apply to a new &unclassified( dataset to predict
which records will have a given outcome. 6A7T segments a dataset by
creating 0-way splits while 6HA$ segments using chi s+uare tests to create
multi-way splits. 6A7T typically re+uires less data preparation than
6HA$.
Nearest neigh*or 'etho+ A techni+ue that classifies each record in a
dataset based on a combination of the classes of the krecord&s( most similar
to it in a historical dataset &where k8(. #ometimes called the k-nearest
neighbor techni+ue.
R"le in+"%tion The e!traction of useful if-then rules from data based on
statistical significance.
Data (is"ali,ation The visual interpretation of comple! relationships in
multidimensional data. %raphics tools are used to illustrate data
relationships.
Chara%teristi%s o! Data Mining$
-arge &"antities o! +ata The volume of data so great it has to be analy)ed
by automated techni+ues e.g. satellite information, credit card transactions
etc.
Noisy in%o'#lete +ata mprecise data is the characteristic of all data
collection.
-
7/26/2019 -Annotating Full Document
5/48
Co'#le. +ata str"%t"re conventional statistical analysis not possible
Heterogeneo"s +ata store+ in lega%y syste's
Bene!its o! Data Mining$
( t9s one of the most effective services that are available today. *ith the help
of data mining, one can discover precious information about the customers
and their behavior for a specific set of products and evaluate and analy)e,
store, mine and load data related to them
0( An analytical 67M model and strategic business related decisions can be
made with the help of data mining as it helps in providing a complete
synopsis of customers
1( An endless number of organi)ations have installed data mining pro:ects and
it has helped them see their own companies ma'e an unprecedented
improvement in their mar'eting strategies &6ampaigns(
3( $ata mining is generally used by organi)ations with a solid customer focus.
For its fle!ible nature as far as applicability is concerned is being used
vehemently in applications to foresee crucial data including industry
analysis and consumer buying behaviors
4( Fast paced and prompt access to data along with economic processing
techni+ues have made data mining one of the most suitable services that a
company see'
A+(antages o! Data Mining$
/0 Marketing 1 Retail$
$ata mining helps mar'eting companies build models based on historical data
-
7/26/2019 -Annotating Full Document
6/48
to predict who will respond to the new mar'eting campaigns such as direct mail,
online mar'eting campaign;etc. Through the results, mar'eters will have
appropriate approach to sell profitable products to targeted customers.
$ata mining brings a lot of benefits to retail companies in the same way as
mar'eting. Through mar'et bas'et analysis, a store can have an appropriate
production arrangement in a way that customers can buy fre+uent buying products
together with pleasant. n addition, it also helps the retail companies offer certain
discounts for particular products that will attract more customers.
20 3inan%e 1 Banking$ata mining gives financial institutions information about loan information and
credit reporting.
-
7/26/2019 -Annotating Full Document
7/48
financial transaction to build patterns that can detect money laundering or criminal
activities.
60 -aw en!or%e'ent$
$ata mining can aid law enforcers in identifying criminal suspects as well as
apprehending these criminals by e!amining trends in location, crime type, habit,
and other patterns of behaviors.
70 Resear%hers$
$ata mining can assist researchers by speeding up their data analy)ing process=
thus, allowing those more time to wor' on other pro:ects.
S8ST9M ANA-8SIS9:ISTING S8ST9M$
n this e!isting system, a data unit is a piece of te!t that semantically representsone concept of an entity. t corresponds to the value of a record under an attribute.
t is different from a te!t node which refers to a se+uence of te!t surrounded by a
pair of HTML tags. t describes the relationships between te!t nodes and data units
in detail. n this paper, we perform data unit level annotation. There is a high
demand for collecting data of interest from multiple *$
-
7/26/2019 -Annotating Full Document
8/48
the semantic of each data unit. >nfortunately, the semantic labels of data units are
often not provided in result pages. For instance, no semantic labels for the values
of title, author, publisher, etc., are given. Having semantic labels for data units is
not only important for the above record lin'age tas', but also for storing collected
#77s into a database table.
-
7/26/2019 -Annotating Full Document
9/48
we are the first to utili)e # for annotating #77s.
*e employ si! basic annotators= each annotator can independently assign
labels to data units based on certain features of the data units. *e also
employ a probabilistic model to combine the results from different
annotators into a single label. This model is highly fle!ible so that the
e!isting basic annotators may be modified and new annotators may be added
easily without affecting the operation of other annotators.
*e construct an annotation wrapper for any given *$
-
7/26/2019 -Annotating Full Document
10/48
MODU-9S$
-
7/26/2019 -Annotating Full Document
11/48
attributes on the local search interface of the *$< will most li'ely appear in some
retrieved #77s. For e!ample, +uery term CmachineD is submitted through the Title
field on the search interface of the *$< and all three titles of the returned #77s
contain this +uery term. Thus, we can use the name of search field Title to annotate
the title values of these #77s. n general, +uery terms against an attribute may be
entered to a te!tbo! or chosen from a selection list on the local search interface.
"ur ?uery-based Annotator wor's as follows %iven a +uery with a set of +uery
terms submitted against an attribute A on the local search interface, first find the
group that has the largest total occurrences of these +uery terms and then assign
gn&A( as the label to the group.
S%he'a ;al"e Annotator
Many attributes on a search interface have predefined values on the interface. For
e!ample, the attribute 2ublishers may have a set of predefined values &i.e.,
publishers( in its selection list. More attributes in the # tend to have predefined
values and these attributes are li'ely to have more such values than those in L#s,
because when attributes from multiple interfaces are integrated, their values are
also combined. "ur schema value annotator utili)es the combined value set to
perform annotation.
The schema value annotator first identifies the attribute A: that has the highest
matching score among all attributes and then uses gn&A:( to annotate the group %i.
5ote that multiplying the above sum by the number of non)ero similarities is to
give preference to attributes that have more matches &i.e., having non)ero
similarities( over those that have fewer matches. This is found to be very effective
in improving the retrieval effectiveness of combination systems in information
retrieval
-
7/26/2019 -Annotating Full Document
12/48
Co''on nowle+ge Annotator
#ome data units on the result page are self-e!planatory because of the common
'nowledge shared by human beings. For e!ample, Cin stoc'D and Cout of stoc'D
occur in many #77s from e-commerce sites. Human users understand that it is
about the availability of the product because this is common 'nowledge. #o our
common 'nowledge annotator tries to e!ploit this situation by using some
predefined common concepts.
/ach common concept contains a label and a set of patterns or values. For
e!ample, a country concept has a label CcountryD and a set of values such as
C>.#.A.,D C6anada,D and so on. t should be pointed out that our common concepts
are different from the ontologies that are widely used in some wor's in #emantic
*eb. First, our common concepts are domain independent. #econd, they can be
obtained from e!isting information resources with little additional human effort.
Co'*ining Annotators
"ur analysis indicates that no single annotator is capable of fully labeling all the
data units on different result pages. The applicability of an annotator is the
percentage of the attributes to which the annotator can be applied. For e!ample, if
out of E attributes, four appear in tables, then the applicability of the table
annotator is 3E percent. The average applicability of each basic annotator across all
testing domains in our data set. This indicates that the results of different basic
annotators should be combined in order to annotate a higher percentage of data
units. Moreover, different annotators may produce different labels for a given
group of data units. Therefore, we need a method to select the most suitable one for
the group. "ur annotators are fairly independent from each other since each
e!ploits an independent feature.
-
7/26/2019 -Annotating Full Document
13/48
S8ST9M T9STING
The purpose of testing is to discover errors. Testing is the process of trying
to discover every conceivable fault or wea'ness in a wor' product. t provides a
way to chec' the functionality of components, sub assemblies, assemblies andBor a
finished product t is the process of e!ercising software with the intent of ensuring
that the
#oftware system meets its re+uirements and user e!pectations and does not fail in
an unacceptable manner. There are various types of test. /ach test type addresses a
specific testing re+uirement.
TYPES OF TESTS
Unit testing
>nit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. t is the
testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that
relies on 'nowledge of its construction and is invasive. >nit tests perform basic
tests at component level and test a specific business process, application, andBor
system configuration. >nit tests ensure that each uni+ue path of a business process
performs accurately to the documented specifications and contains clearly defined
inputs and e!pected results.
-
7/26/2019 -Annotating Full Document
14/48
Integration testing
ntegration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is more
concerned with the basic outcome of screens or fields. ntegration tests
demonstrate that although the components were individually satisfaction, as shown
by successfully unit testing, the combination of components is correct and
consistent. ntegration testing is specifically aimed at e!posing the problems that
arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical re+uirements, system
documentation, and user manuals.
Functional testing is centered on the following items
@alid nput identified classes of valid input must be accepted.
nvalid nput identified classes of invalid input must be re:ected.
Functions identified functions must be e!ercised.
"utput identified classes of application outputs must be e!ercised.
#ystemsB2rocedures interfacing systems or procedures must be invo'ed.
"rgani)ation and preparation of functional tests is focused on re+uirements, 'ey
functions, or special test cases. n addition, systematic coverage pertaining to
identify
-
7/26/2019 -Annotating Full Document
15/48
processes must be considered for testing.
-
7/26/2019 -Annotating Full Document
16/48
>nit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
Test strategy and approach
Field testing will be performed manually and functional tests will be written
in detail.
Test o*)e%ti(es
All field entries must wor' properly.
2ages must be activated from the identified lin'.
The entry screen, messages and responses must not be delayed.
3eat"res to *e teste+
@erify that the entries are of the correct format
5o duplicate entries should be allowed
All lin's should ta'e the user to the correct page.
702 Integration Testing
#oftware integration testing is the incremental integration testing of two or
more integrated software components on a single platform to produce failures
caused by interface defects.
The tas' of the integration test is to chec' that components or software
-
7/26/2019 -Annotating Full Document
17/48
applications, e.g. components in a software system or one step up software
applications at the company level interact without error.
Test Res"lts$ All the test cases mentioned above passed successfully. 5o defects
encountered.
6.3 Acceptance Testing
>ser Acceptance Testing is a critical phase of any pro:ect and re+uires
significant participation by the end user. t also ensures that the system meets the
functional re+uirements.
Test Res"lts$ All the test cases mentioned above passed successfully. 5o defects
encountered.
S8ST9M D9SIGN
-
7/26/2019 -Annotating Full Document
18/48
Preprocessing Layer
Knowledge Layer
Presentation Layer
User Search Interface
World Wide Web
User Query Input/Output HTML Pagennotated !esults
!esult Pro"ection
Analysis Layer
#o$$on %no&ledge nnotator
Pre'(/Su)( nnotator
*re+uency,-ased nnotator
Query,-ased nnotator
Sche$a .alue nnotator
Table nnotator
nnotation
HTML P0S
1ML/OWL
Query !esult 2ata
Pro'le
-
7/26/2019 -Annotating Full Document
19/48
. The $F$ is also called as bubble chart. t is a simple graphical formalism
that can be used to represent a system in terms of input data to the system,
various processing carried out on this data, and the output data is generatedby this system.
0. The data flow diagram &$F$( is one of the most important modeling tools. t
is used to model the system components. These components are the system
process, the data used by the process, an e!ternal entity that interacts with
the system and the information flows in the system.
1. $F$ shows how the information moves through the system and how it is
modified by a series of transformations. t is a graphical techni+ue that
depicts information flow and the transformations that are applied as data
moves from input to output.
3. $F$ is also 'nown as bubble chart. A $F$ may be used to represent a
system at any level of abstraction. $F$ may be partitioned into levels that
represent increasing information flow and functional detail.
-
7/26/2019 -Annotating Full Document
20/48
A d m i n
E N D
A d d S o u rc e
U R L A U T H O R T I T L E
D A T A B A S E
S e a rc h b y U R L
C O N T E N T
U s e r
S e a rc h b y Y e a r
S e a r c h b y A u t h o r n a m e
S e a r c h b y T it ieY E A R R IC E
UM- DIAGRAMS
>ML stands for >nified Modeling Language. >ML is a standardi)ed
general-purpose modeling language in the field of ob:ect-oriented software
engineering. The standard is managed, and was created by, the "b:ect Management
%roup.
-
7/26/2019 -Annotating Full Document
21/48
The goal is for >ML to become a common language for creating models of
ob:ect oriented computer software. n its current form >ML is comprised of two
ma:or components a Meta-model and a notation. n the future, some form of
method or process may also be added to= or associated with, >ML.
The >nified Modeling Language is a standard language for specifying,
@isuali)ation, 6onstructing and documenting the artifacts of software system, as
well as for business modeling and other non-software systems.
The >ML represents a collection of best engineering practices that have
proven successful in the modeling of large and comple! systems.
The >ML is a very important part of developing ob:ects oriented software
and the software development process. The >ML uses mostly graphical notations
to e!press the design of software pro:ects.
GOA-S$
The 2rimary goals in the design of the >ML are as follows
. 2rovide users a ready-to-use, e!pressive visual modeling Language so that
they can develop and e!change meaningful models.
0. 2rovide e!tendibility and speciali)ation mechanisms to e!tend the core
concepts.
1.
-
7/26/2019 -Annotating Full Document
22/48
US9 CAS9 DIAGRAM$
A use case diagram in the >nified Modeling Language &>ML( is a type of
behavioral diagram defined by and created from a >se-case analysis. ts purpose is
to present a graphical overview of the functionality provided by a system in terms
of actors, their goals &represented as use cases(, and any dependencies between
those use cases. The main purpose of a use case diagram is to show what system
functions are performed for which actor. 7oles of the actors in the system can be
depicted.
-
7/26/2019 -Annotating Full Document
23/48
A D M IN U S E R
D E T A I L S
I N F O R M A T I O N
A D D S O U R C E
S E R A C H B Y U R L , A U T H O R N A M E , Y E A R , T IT I LE
C-ASS DIAGRAM$
n software engineering, a class diagram in the >nified Modeling Language
&>ML( is a type of static structure diagram that describes the structure of a systemby showing the systems classes, their attributes, operations &or methods(, and the
relationships among the classes. t e!plains which class contains information.
-
7/26/2019 -Annotating Full Document
24/48
U S E R
V i e w S o u rc eU r l, T it le , Au t h o r n ! e ,Ye r ,
B r o ! s e " #
S i n$
A c t io n re ce i"e# ro " i$ e Se r" i ce %
A D % I N
A $ $ S ou rc eA $ $ C o n te n t
A $ $ In &o r! tio n
A d d S o u r c e "#
S9=U9NC9 DIAGRAM$
A se+uence diagram in >nified Modeling Language &>ML( is a 'ind of interaction
diagram that shows how processes operate with one another and in what order. t is
a construct of a Message #e+uence 6hart. #e+uence diagrams are sometimes called
event diagrams, event scenarios, and timing diagrams.
-
7/26/2019 -Annotating Full Document
25/48
A d m in S to ra & e U se r
S e a r c hA d d i n & ' e y ! o rd
A d d in & s o u r c e
r o ( i d e I n ) o rm a ti o n
*et In ) o r m at ion
ACTI;IT8 DIAGRAM$
Activity diagrams are graphical representations of wor'flows of stepwise activities
and actions with support for choice, iteration and concurrency. n the >nified
Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step wor'flows of components in a system. An activity
diagram shows the overall flow of control.
-
7/26/2019 -Annotating Full Document
26/48
L o ' i n
V l i $ U%e r
( e t In & o r! t io n
Co n $ i t i o n
S i ' n U )
Y e %
F l % e
A $ ! inU % e r
A$ $ * e +w o r$
A $ $ D e t i l%
S e r c h
B e ' i n
E n $
A$ $ in & o r! it o n
A $ $ S o u rce
D t B % e
IN
-
7/26/2019 -Annotating Full Document
27/48
can be achieved by inspecting the computer to read data from a written or printed
document or it can occur by having people 'eying the data directly into the system.
The design of input focuses on controlling the amount of input re+uired,
controlling the errors, avoiding delay, avoiding e!tra steps and 'eeping the process
simple. The input is designed in such a way so that it provides security and ease of
use with retaining the privacy. nput $esign considered the following things
*hat data should be given as inputI
How the data should be arranged or codedI
The dialog to guide the operating personnel in providing input.
Methods for preparing input validations and steps to follow when error
occur.
OB@9CTI;9S
.nput $esign is the process of converting a user-oriented description of the input
into a computer-based system. This design is important to avoid errors in the datainput process and show the correct direction to the management for getting correct
information from the computeri)ed system.
0. t is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to ma'e data entry easier and to be
free from errors. The data entry screen is designed in such a way that all the data
manipulates can be performed. t also provides record viewing facilities.
1.*hen the data is entered it will chec' for its validity. $ata can be entered with
the help of screens. Appropriate messages are provided as when needed so that the
user
-
7/26/2019 -Annotating Full Document
28/48
will not be in mai)e of instant. Thus the ob:ective of input design is to create an
input layout that is easy to follow
OUT
-
7/26/2019 -Annotating Full Document
29/48
Index Page:
Register Page:
User:
-
7/26/2019 -Annotating Full Document
30/48
User Page:
Search by Titile:``
-
7/26/2019 -Annotating Full Document
31/48
Search by URL:
Search by year:
-
7/26/2019 -Annotating Full Document
32/48
Search by Author Name:
Result Page:
-
7/26/2019 -Annotating Full Document
33/48
Additional Information:
rong !ey"ord:
-
7/26/2019 -Annotating Full Document
34/48
Admin #ntry:
Admin $ome :
-
7/26/2019 -Annotating Full Document
35/48
Admin image U%laod:
Adding Information:
-
7/26/2019 -Annotating Full Document
36/48
Adding eb &ontent:
About:
-
7/26/2019 -Annotating Full Document
37/48
#nd:
-
7/26/2019 -Annotating Full Document
38/48
CONC-USIONn this paper, we studied the data annotation problem and proposed a multi-
annotator approach to automatically constructing an annotation wrapper for
annotating the search result records retrieved from any given web database. This
approach consists of si! basic annotators and a probabilistic method to combine the
basic annotators. /ach of these annotators e!ploits one type of features for
annotation and our e!perimental results show that each of the annotators is useful
and they together are capable of generating high +uality annotation. A special
feature of our method is that, when annotating the results retrieved from a web
database, it utili)es both the L# of the web database and the # of multiple web
databases in the same domain. *e also e!plained how the use of the # can help
alleviate the local interface schema inade+uacy problem and the inconsistent label
problem.
n this paper, we also studied the automatic data alignment problem. Accurate
alignment is critical to achieving holistic and accurate annotation. "ur method is a
clustering based shifting method utili)ing richer yet automatically obtainable
features. This method is capable of handling a variety of relationships between
HTML te!t nodes and data units, including one-to-one, one-to-many, many-to-one,
and one-to-nothing. "ur e!perimental results show that the precision and recall of
this method are both above JK percent. There is still room for improvement in
several areas. For e!ample, we need to enhance our method to split composite te!t
-
7/26/2019 -Annotating Full Document
39/48
node when there are no e!plicit separators. *e would also li'e to try using
different machine learning techni+ues and using more sample pages from each
training site to obtain the feature weights so that we can identify the best techni+ue
to the data alignment problem.
R939R9NC9S
A. Arasu and H. %arcia-Molina, C/!tracting #tructured $ata from *eb 2ages,D
2roc. #%M"$ nt9l 6onf. Management of $ata, 0EE1.
0 L. Arlotta, @. 6rescen)i, %. Mecca, and 2. Merialdo, CAutomatic Annotation of
$ata /!tracted from Large *eb #ites,D 2roc. #i!th nt9l *or'shop the *eb and
$atabases &*eb$
-
7/26/2019 -Annotating Full Document
40/48
H. /lmeleegy, N. Madhavan, and A. Halevy, CHarvesting 7elational Tables from
Lists on the *eb,D 2roc. @ery Large $atabases &@L$
-
7/26/2019 -Annotating Full Document
41/48
4 H. He, *. Meng, 6. Ou, and P. *u, C6onstructing nterface #chemas for
#earch nterfaces of *eb $atabases,D 2roc. *eb nformation #ystems /ng.
&*#/( 6onf., 0EE4.
G N. Heflin and N. Hendler, C#earching the *eb with #H"/,D 2roc. AAA
*or'shop, 0EEE.
L. aufman and 2. 7ousseeuw, Finding %roups in $ata An ntroduction to
6luster Analysis. Nohn *iley Q #ons, JJE.
K 5. rushmeric', $. *eld, and 7. $oorenbos, C*rapper nduction for
nformation /!traction,D 2roc. nt9l Noint 6onf. Artificial ntelligence &N6A(,
JJ.
J N. Lee, CAnalyses of Multiple /vidence 6ombination,D 2roc. 0E thAnn. nt9l
A6M #%7 6onf. 7esearch and $evelopment in nformation 7etrieval, JJ.
0E L. Liu, 6. 2u, and *. Han, CR*7A2 An RML-/nabled *rapper
6onstruction #ystem for *eb nformation #ources,D 2roc. /// Gth nt9l 6onf.
$ata /ng. &6$/(, 0EE.
0 *. Liu, R. Meng, and *. Meng, C@i$/ A @ision-
-
7/26/2019 -Annotating Full Document
42/48
01 N. Madhavan, $. o, L. Lot, @. %anapathy, A. 7asmussen, and A.O. Halevy,
C%oogle9s $eep *eb 6rawl,D 2roc. @L$< /ndowment, vol. , no. 0, pp. 03-
040, 0EEK.
03 *. Meng, 6. Ou, and . Liu, C
-
7/26/2019 -Annotating Full Document
43/48
$atabases,D 2roc. 0th nt9l 6onf. *orld *ide *eb &***(, 0EE1.
1 P. *u et al., CTowards Automatic ncorporation of #earch /ngines into a
Large-#cale Metasearch /ngine,D 2roc. ///B*6 nt9l 6onf. *eb ntelligence
&* 9E1(, 0EE1.
10 ". Pamir and ". /t)ioni, C*eb $ocument 6lustering A Feasibility
$emonstration,D 2roc. A6M 0st nt9l #%7 6onf. 7esearch nformation
7etrieval, JJK.
11 O. Phai and
-
7/26/2019 -Annotating Full Document
44/48
/ 9.tra%ting Str"%t"re+ Data !ro' We*
-
7/26/2019 -Annotating Full Document
45/48
$ata e!traction from web pages is performed by software modules called
wrappers. 7ecently, some systems for the automatic generation of wrappers have
been proposed in the literature. These systems are based on unsupervised inference
techni+ues ta'ing as input a small set of sample pages, they can produce a
common wrapper to e!tract relevant data. However, due to the automatic nature of
the approach, the data e!tracted by these wrappers have anonymous names. n the
framewor' of our ongoing pro:ect 7oad7unner, we have developed a prototype,
called Labeller, that automatically annotates data e!tracted by automatically
generated wrappers. Although Labeller has been developed as a companion system
to our wrapper generator, its underlying approach has a general validity and
therefore it can be applied together with other wrapper generator systems. *e have
e!perimented the prototype over several real-life web sites obtaining encouraging
results.
4 9.#eri'ents on M"ltistrategy -earning *y Meta>-earning
AUTHORS$ 2. 6han and #. #tolfo
n this paper, we propose meta-learning as a general techni+ue to combine the
-
7/26/2019 -Annotating Full Document
46/48
results of multiple learning algorithms, each applied to a set of training data. *e
detail several metalearning strategies for combining independently learned
classifiers, each computed by different algorithms, to improve overall prediction
accuracy. The overall resulting classifier is composed of the classifiers generated
by the different learning algorithms and a meta-classifier generated by a meta-
learning strategy. The strategies described here are independent of the learning
algorithms used. 2reliminary e!periments using different strategies and learning
algorithms on two molecular biologyse+uence analysis data sets demonstrate
encouraging results. Machine learning techni+ues are central to automated
'nowledge discovery systems and hence our approach can enhance the
effectiveness of such systems.
5 Co'*ining A##roa%hes !or In!or'ation Retrie(a
AUTHORS$ *.
-
7/26/2019 -Annotating Full Document
47/48
a standard techni+ue for improving the effectiveness of information retrieval.
combination, for e!ample, has been studied e!tensively in the T7/6 evaluations
and is the basis of the Cmeta-searchD engines used on the *eb. This paper
e!amines the development of this techni+ue, including both e!perimental results
and the retrieval models that have been proposed as formal framewor's for
combination. *e show that combining approaches for information retrieval can be
modeled as combining the outputs of multiple classifiers based on one or more
representations, and that this simple model can provide e!planations for many of
the e!perimental results. *e also show that this view of combination is very
similar to the inference net model, and that a new approach to retrieval based on
language models supports combination and can be integrated with the inference net
model.
6 Se'Tag an+ Seeker$ Bootstra##ing the Se'anti% We* (ia A"to'ate+
Se'anti% Annotation
AUTHORS$#. $ill et al.
-
7/26/2019 -Annotating Full Document
48/48
This paper describes #ee'er, a platform for large-scale te!t analytics, and #emTag,
an application written on the platform to perform automated semantic tagging of
large corpora. *e apply #emTag to a collection of appro!imately 0G3 million web
pages, and generate appro!imately 313 million automatically disambiguated
semantic tags, published to the web as a label bureau providing metadata regarding
the 313 million annotations. To our 'nowledge, this is the largest scale semantic
tagging effort to date.*e describe the #ee'er platform, discuss the architecture of
the #emTag application, describe a new disambiguation algorithm speciali)ed to
support ontological disambiguation of large-scale data, evaluate the algorithm, and
present our final results with information about ac+uiring and ma'ing use of the
semantic tags. *e argue that automated large scale semantic tagging of ambiguous
content can bootstrap and accelerate the creation of the semantic web.