Download - -Annotating Full Document

7/26/2019 -Annotating Full Document

1/48

ABSTRACT

An increasing number of databases have become web accessible through HTML

form-based search interfaces. The data units returned from the underlying database

are usually encoded into the result pages dynamically for human browsing. For the

encoded data units to be machine process able, which is essential for many

applications such as deep web data collection and nternet comparison shopping,

they need to be e!tracted out and assigned meaningful labels. n this paper, we

present an automatic annotation approach that first aligns the data units on a result

page into different groups such that the data in the same group have the same

semantic. Then, for each group we annotate it from different aspects and aggregatethe different annotations to predict a final annotation label for it. An annotation

wrapper for the search site is automatically constructed and can be used to annotate

new result pages from the same web database. "ur e!periments indicate that the

proposed approach is highly effective.

INTRODUCTION

What is Data Mining?

#tructure of $ata Mining


2/48

%enerally, data mining &sometimes called data or 'nowledge discovery( is the

process of analy)ing data from different perspectives and summari)ing it into

useful information - information that can be used to increase revenue, cuts costs, or

both. $ata mining software is one of a number of analytical tools for analy)ing

data. t allows users to analy)e data from many different dimensions or angles,

categori)e it, and summari)e the relationships identified. Technically, data mining

is the process of finding correlations or patterns among do)ens of fields in large

relational databases.

How Data Mining Works?

*hile large-scale information technology has been evolving separate transaction

and analytical systems, data mining provides the lin' between the two. $ata

mining software analy)es relationships and patterns in stored transaction data

based on open-ended user +ueries. #everal types of analytical software are

available statistical, machine learning, and neural networ's. Generally any o!

!o"r ty#es o! relationshi#s are so"ght$

Classes #tored data is used to locate data in predetermined groups. For

e!ample, a restaurant chain could mine customer purchase data to determine

when customers visit and what they typically order. This information could

be used to increase traffic by having daily specials.

Cl"sters $ata items are grouped according to logical relationships or

consumer preferences. For e!ample, data can be mined to identify mar'et

segments or consumer affinities.

Asso%iations $ata can be mined to identify associations. The beer-diaper

e!ample is an e!ample of associative mining.


3/48

Se&"ential #atterns $ata is mined to anticipate behavior patterns and

trends. For e!ample, an outdoor e+uipment retailer could predict the

li'elihood of a bac'pac' being purchased based on a consumers purchase of

sleeping bags and hi'ing shoes.

Data 'ining %onsists o! !i(e 'a)or ele'ents$

( /!tract, transform, and load transaction data onto the data warehouse

system.

0( #tore and manage the data in a multidimensional database system.

1( 2rovide data access to business analysts and information technology

professionals.

3( Analy)e the data by application software.

4( 2resent the data in a useful format, such as a graph or table.

Di!!erent le(els o! analysis are a(aila*le$

Arti!i%ial ne"ral networks 5on-linear predictive models that learn through

training and resemble biological neural networ's in structure.

Geneti% algorith's "ptimi)ation techni+ues that use process such as

genetic combination, mutation, and natural selection in a design based on the

concepts of natural evolution.

De%ision trees Tree-shaped structures that represent sets of decisions. These

decisions generate rules for the classification of a dataset. #pecific decision

tree methods include 6lassification and 7egression Trees &6A7T( and 6hi


4/48

#+uare Automatic nteraction $etection &6HA$(. 6A7T and 6HA$ are

decision tree techni+ues used for classification of a dataset. They provide a

set of rules that you can apply to a new &unclassified( dataset to predict

which records will have a given outcome. 6A7T segments a dataset by

creating 0-way splits while 6HA$ segments using chi s+uare tests to create

multi-way splits. 6A7T typically re+uires less data preparation than

6HA$.

Nearest neigh*or 'etho+ A techni+ue that classifies each record in a

dataset based on a combination of the classes of the krecord&s( most similar

to it in a historical dataset &where k8(. #ometimes called the k-nearest

neighbor techni+ue.

R"le in+"%tion The e!traction of useful if-then rules from data based on

statistical significance.

Data (is"ali,ation The visual interpretation of comple! relationships in

multidimensional data. %raphics tools are used to illustrate data

relationships.

Chara%teristi%s o! Data Mining$

-arge &"antities o! +ata The volume of data so great it has to be analy)ed

by automated techni+ues e.g. satellite information, credit card transactions

etc.

Noisy in%o'#lete +ata mprecise data is the characteristic of all data

collection.


5/48

Co'#le. +ata str"%t"re conventional statistical analysis not possible

Heterogeneo"s +ata store+ in lega%y syste's

Bene!its o! Data Mining$

( t9s one of the most effective services that are available today. *ith the help

of data mining, one can discover precious information about the customers

and their behavior for a specific set of products and evaluate and analy)e,

store, mine and load data related to them

0( An analytical 67M model and strategic business related decisions can be

made with the help of data mining as it helps in providing a complete

synopsis of customers

1( An endless number of organi)ations have installed data mining pro:ects and

it has helped them see their own companies ma'e an unprecedented

improvement in their mar'eting strategies &6ampaigns(

3( $ata mining is generally used by organi)ations with a solid customer focus.

For its fle!ible nature as far as applicability is concerned is being used

vehemently in applications to foresee crucial data including industry

analysis and consumer buying behaviors

4( Fast paced and prompt access to data along with economic processing

techni+ues have made data mining one of the most suitable services that a

company see'

A+(antages o! Data Mining$

/0 Marketing 1 Retail$

$ata mining helps mar'eting companies build models based on historical data


6/48

to predict who will respond to the new mar'eting campaigns such as direct mail,

online mar'eting campaign;etc. Through the results, mar'eters will have

appropriate approach to sell profitable products to targeted customers.

$ata mining brings a lot of benefits to retail companies in the same way as

mar'eting. Through mar'et bas'et analysis, a store can have an appropriate

production arrangement in a way that customers can buy fre+uent buying products

together with pleasant. n addition, it also helps the retail companies offer certain

discounts for particular products that will attract more customers.

20 3inan%e 1 Banking$ata mining gives financial institutions information about loan information and

credit reporting.


7/48

financial transaction to build patterns that can detect money laundering or criminal

activities.

60 -aw en!or%e'ent$

$ata mining can aid law enforcers in identifying criminal suspects as well as

apprehending these criminals by e!amining trends in location, crime type, habit,

and other patterns of behaviors.

70 Resear%hers$

$ata mining can assist researchers by speeding up their data analy)ing process=

thus, allowing those more time to wor' on other pro:ects.

S8ST9M ANA-8SIS9:ISTING S8ST9M$

n this e!isting system, a data unit is a piece of te!t that semantically representsone concept of an entity. t corresponds to the value of a record under an attribute.

t is different from a te!t node which refers to a se+uence of te!t surrounded by a

pair of HTML tags. t describes the relationships between te!t nodes and data units

in detail. n this paper, we perform data unit level annotation. There is a high

demand for collecting data of interest from multiple *$


8/48

the semantic of each data unit. >nfortunately, the semantic labels of data units are

often not provided in result pages. For instance, no semantic labels for the values

of title, author, publisher, etc., are given. Having semantic labels for data units is

not only important for the above record lin'age tas', but also for storing collected

#77s into a database table.


9/48

we are the first to utili)e # for annotating #77s.

*e employ si! basic annotators= each annotator can independently assign

labels to data units based on certain features of the data units. *e also

employ a probabilistic model to combine the results from different

annotators into a single label. This model is highly fle!ible so that the

e!isting basic annotators may be modified and new annotators may be added

easily without affecting the operation of other annotators.

*e construct an annotation wrapper for any given *$


10/48

MODU-9S$


11/48

attributes on the local search interface of the *$< will most li'ely appear in some

retrieved #77s. For e!ample, +uery term CmachineD is submitted through the Title

field on the search interface of the *$< and all three titles of the returned #77s

contain this +uery term. Thus, we can use the name of search field Title to annotate

the title values of these #77s. n general, +uery terms against an attribute may be

entered to a te!tbo! or chosen from a selection list on the local search interface.

"ur ?uery-based Annotator wor's as follows %iven a +uery with a set of +uery

terms submitted against an attribute A on the local search interface, first find the

group that has the largest total occurrences of these +uery terms and then assign

gn&A( as the label to the group.

S%he'a ;al"e Annotator

Many attributes on a search interface have predefined values on the interface. For

e!ample, the attribute 2ublishers may have a set of predefined values &i.e.,

publishers( in its selection list. More attributes in the # tend to have predefined

values and these attributes are li'ely to have more such values than those in L#s,

because when attributes from multiple interfaces are integrated, their values are

also combined. "ur schema value annotator utili)es the combined value set to

perform annotation.

The schema value annotator first identifies the attribute A: that has the highest

matching score among all attributes and then uses gn&A:( to annotate the group %i.

5ote that multiplying the above sum by the number of non)ero similarities is to

give preference to attributes that have more matches &i.e., having non)ero

similarities( over those that have fewer matches. This is found to be very effective

in improving the retrieval effectiveness of combination systems in information

retrieval


12/48

Co''on nowle+ge Annotator

#ome data units on the result page are self-e!planatory because of the common

'nowledge shared by human beings. For e!ample, Cin stoc'D and Cout of stoc'D

occur in many #77s from e-commerce sites. Human users understand that it is

about the availability of the product because this is common 'nowledge. #o our

common 'nowledge annotator tries to e!ploit this situation by using some

predefined common concepts.

/ach common concept contains a label and a set of patterns or values. For

e!ample, a country concept has a label CcountryD and a set of values such as

C>.#.A.,D C6anada,D and so on. t should be pointed out that our common concepts

are different from the ontologies that are widely used in some wor's in #emantic

*eb. First, our common concepts are domain independent. #econd, they can be

obtained from e!isting information resources with little additional human effort.

Co'*ining Annotators

"ur analysis indicates that no single annotator is capable of fully labeling all the

data units on different result pages. The applicability of an annotator is the

percentage of the attributes to which the annotator can be applied. For e!ample, if

out of E attributes, four appear in tables, then the applicability of the table

annotator is 3E percent. The average applicability of each basic annotator across all

testing domains in our data set. This indicates that the results of different basic

annotators should be combined in order to annotate a higher percentage of data

units. Moreover, different annotators may produce different labels for a given

group of data units. Therefore, we need a method to select the most suitable one for

the group. "ur annotators are fairly independent from each other since each

e!ploits an independent feature.


13/48

S8ST9M T9STING

The purpose of testing is to discover errors. Testing is the process of trying

to discover every conceivable fault or wea'ness in a wor' product. t provides a

way to chec' the functionality of components, sub assemblies, assemblies andBor a

finished product t is the process of e!ercising software with the intent of ensuring

that the

#oftware system meets its re+uirements and user e!pectations and does not fail in

an unacceptable manner. There are various types of test. /ach test type addresses a

specific testing re+uirement.

TYPES OF TESTS

Unit testing

>nit testing involves the design of test cases that validate that the internal

program logic is functioning properly, and that program inputs produce valid

outputs. All decision branches and internal code flow should be validated. t is the

testing of individual software units of the application .it is done after the

completion of an individual unit before integration. This is a structural testing, that

relies on 'nowledge of its construction and is invasive. >nit tests perform basic

tests at component level and test a specific business process, application, andBor

system configuration. >nit tests ensure that each uni+ue path of a business process

performs accurately to the documented specifications and contains clearly defined

inputs and e!pected results.


14/48

Integration testing

ntegration tests are designed to test integrated software components to

determine if they actually run as one program. Testing is event driven and is more

concerned with the basic outcome of screens or fields. ntegration tests

demonstrate that although the components were individually satisfaction, as shown

by successfully unit testing, the combination of components is correct and

consistent. ntegration testing is specifically aimed at e!posing the problems that

arise from the combination of components.

Functional test

Functional tests provide systematic demonstrations that functions tested are

available as specified by the business and technical re+uirements, system

documentation, and user manuals.

Functional testing is centered on the following items

@alid nput identified classes of valid input must be accepted.

nvalid nput identified classes of invalid input must be re:ected.

Functions identified functions must be e!ercised.

"utput identified classes of application outputs must be e!ercised.

#ystemsB2rocedures interfacing systems or procedures must be invo'ed.

"rgani)ation and preparation of functional tests is focused on re+uirements, 'ey

functions, or special test cases. n addition, systematic coverage pertaining to

identify


15/48

processes must be considered for testing.


16/48

>nit testing is usually conducted as part of a combined code and unit test

phase of the software lifecycle, although it is not uncommon for coding and unit

testing to be conducted as two distinct phases.

Test strategy and approach

Field testing will be performed manually and functional tests will be written

in detail.

Test o*)e%ti(es

All field entries must wor' properly.

2ages must be activated from the identified lin'.

The entry screen, messages and responses must not be delayed.

3eat"res to *e teste+

@erify that the entries are of the correct format

5o duplicate entries should be allowed

All lin's should ta'e the user to the correct page.

702 Integration Testing

#oftware integration testing is the incremental integration testing of two or

more integrated software components on a single platform to produce failures

caused by interface defects.

The tas' of the integration test is to chec' that components or software


17/48

applications, e.g. components in a software system or one step up software

applications at the company level interact without error.

Test Res"lts$ All the test cases mentioned above passed successfully. 5o defects

encountered.

6.3 Acceptance Testing

>ser Acceptance Testing is a critical phase of any pro:ect and re+uires

significant participation by the end user. t also ensures that the system meets the

functional re+uirements.

Test Res"lts$ All the test cases mentioned above passed successfully. 5o defects

encountered.

S8ST9M D9SIGN


18/48

Preprocessing Layer

Knowledge Layer

Presentation Layer

User Search Interface

World Wide Web

User Query Input/Output HTML Pagennotated !esults

!esult Pro"ection

Analysis Layer

#o$$on %no&ledge nnotator

Pre'(/Su)( nnotator

*re+uency,-ased nnotator

Query,-ased nnotator

Sche$a .alue nnotator

Table nnotator

nnotation

HTML P0S

1ML/OWL

Query !esult 2ata

Pro'le


19/48

. The $F$ is also called as bubble chart. t is a simple graphical formalism

that can be used to represent a system in terms of input data to the system,

various processing carried out on this data, and the output data is generatedby this system.

0. The data flow diagram &$F$( is one of the most important modeling tools. t

is used to model the system components. These components are the system

process, the data used by the process, an e!ternal entity that interacts with

the system and the information flows in the system.

1. $F$ shows how the information moves through the system and how it is

modified by a series of transformations. t is a graphical techni+ue that

depicts information flow and the transformations that are applied as data

moves from input to output.

3. $F$ is also 'nown as bubble chart. A $F$ may be used to represent a

system at any level of abstraction. $F$ may be partitioned into levels that

represent increasing information flow and functional detail.


20/48

A d m i n

E N D

A d d S o u rc e

U R L A U T H O R T I T L E

D A T A B A S E

S e a rc h b y U R L

C O N T E N T

U s e r

S e a rc h b y Y e a r

S e a r c h b y A u t h o r n a m e

S e a r c h b y T it ieY E A R R IC E

UM- DIAGRAMS

>ML stands for >nified Modeling Language. >ML is a standardi)ed

general-purpose modeling language in the field of ob:ect-oriented software

engineering. The standard is managed, and was created by, the "b:ect Management

%roup.


21/48

The goal is for >ML to become a common language for creating models of

ob:ect oriented computer software. n its current form >ML is comprised of two

ma:or components a Meta-model and a notation. n the future, some form of

method or process may also be added to= or associated with, >ML.

The >nified Modeling Language is a standard language for specifying,

@isuali)ation, 6onstructing and documenting the artifacts of software system, as

well as for business modeling and other non-software systems.

The >ML represents a collection of best engineering practices that have

proven successful in the modeling of large and comple! systems.

The >ML is a very important part of developing ob:ects oriented software

and the software development process. The >ML uses mostly graphical notations

to e!press the design of software pro:ects.

GOA-S$

The 2rimary goals in the design of the >ML are as follows

. 2rovide users a ready-to-use, e!pressive visual modeling Language so that

they can develop and e!change meaningful models.

0. 2rovide e!tendibility and speciali)ation mechanisms to e!tend the core

concepts.

1.


22/48

US9 CAS9 DIAGRAM$

A use case diagram in the >nified Modeling Language &>ML( is a type of

behavioral diagram defined by and created from a >se-case analysis. ts purpose is

to present a graphical overview of the functionality provided by a system in terms

of actors, their goals &represented as use cases(, and any dependencies between

those use cases. The main purpose of a use case diagram is to show what system

functions are performed for which actor. 7oles of the actors in the system can be

depicted.


23/48

A D M IN U S E R

D E T A I L S

I N F O R M A T I O N

A D D S O U R C E

S E R A C H B Y U R L , A U T H O R N A M E , Y E A R , T IT I LE

C-ASS DIAGRAM$

n software engineering, a class diagram in the >nified Modeling Language

&>ML( is a type of static structure diagram that describes the structure of a systemby showing the systems classes, their attributes, operations &or methods(, and the

relationships among the classes. t e!plains which class contains information.


24/48

U S E R

V i e w S o u rc eU r l, T it le , Au t h o r n ! e ,Ye r ,

B r o ! s e " #

S i n$

A c t io n re ce i"e# ro " i$ e Se r" i ce %

A D % I N

A $ $ S ou rc eA $ $ C o n te n t

A $ $ In &o r! tio n

A d d S o u r c e "#

S9=U9NC9 DIAGRAM$

A se+uence diagram in >nified Modeling Language &>ML( is a 'ind of interaction

diagram that shows how processes operate with one another and in what order. t is

a construct of a Message #e+uence 6hart. #e+uence diagrams are sometimes called

event diagrams, event scenarios, and timing diagrams.


25/48

A d m in S to ra & e U se r

S e a r c hA d d i n & ' e y ! o rd

A d d in & s o u r c e

r o ( i d e I n ) o rm a ti o n

*et In ) o r m at ion

ACTI;IT8 DIAGRAM$

Activity diagrams are graphical representations of wor'flows of stepwise activities

and actions with support for choice, iteration and concurrency. n the >nified

Modeling Language, activity diagrams can be used to describe the business and

operational step-by-step wor'flows of components in a system. An activity

diagram shows the overall flow of control.


26/48

L o ' i n

V l i $ U%e r

( e t In & o r! t io n

Co n $ i t i o n

S i ' n U )

Y e %

F l % e

A $ ! inU % e r

A$ $ * e +w o r$

A $ $ D e t i l%

S e r c h

B e ' i n

E n $

A$ $ in & o r! it o n

A $ $ S o u rce

D t B % e

IN


27/48

can be achieved by inspecting the computer to read data from a written or printed

document or it can occur by having people 'eying the data directly into the system.

The design of input focuses on controlling the amount of input re+uired,

controlling the errors, avoiding delay, avoiding e!tra steps and 'eeping the process

simple. The input is designed in such a way so that it provides security and ease of

use with retaining the privacy. nput $esign considered the following things

*hat data should be given as inputI

How the data should be arranged or codedI

The dialog to guide the operating personnel in providing input.

Methods for preparing input validations and steps to follow when error

occur.

OB@9CTI;9S

.nput $esign is the process of converting a user-oriented description of the input

into a computer-based system. This design is important to avoid errors in the datainput process and show the correct direction to the management for getting correct

information from the computeri)ed system.

0. t is achieved by creating user-friendly screens for the data entry to handle large

volume of data. The goal of designing input is to ma'e data entry easier and to be

free from errors. The data entry screen is designed in such a way that all the data

manipulates can be performed. t also provides record viewing facilities.

1.*hen the data is entered it will chec' for its validity. $ata can be entered with

the help of screens. Appropriate messages are provided as when needed so that the

user


28/48

will not be in mai)e of instant. Thus the ob:ective of input design is to create an

input layout that is easy to follow

OUT


29/48

Index Page:

Register Page:

User:


30/48

User Page:

Search by Titile:``


31/48

Search by URL:

Search by year:


32/48

Search by Author Name:

Result Page:


33/48

Additional Information:

rong !ey"ord:


34/48

Admin #ntry:

Admin $ome :


35/48

Admin image U%laod:

Adding Information:


36/48

Adding eb &ontent:

About:


37/48

#nd:


38/48

CONC-USIONn this paper, we studied the data annotation problem and proposed a multi-

annotator approach to automatically constructing an annotation wrapper for

annotating the search result records retrieved from any given web database. This

approach consists of si! basic annotators and a probabilistic method to combine the

basic annotators. /ach of these annotators e!ploits one type of features for

annotation and our e!perimental results show that each of the annotators is useful

and they together are capable of generating high +uality annotation. A special

feature of our method is that, when annotating the results retrieved from a web

database, it utili)es both the L# of the web database and the # of multiple web

databases in the same domain. *e also e!plained how the use of the # can help

alleviate the local interface schema inade+uacy problem and the inconsistent label

problem.

n this paper, we also studied the automatic data alignment problem. Accurate

alignment is critical to achieving holistic and accurate annotation. "ur method is a

clustering based shifting method utili)ing richer yet automatically obtainable

features. This method is capable of handling a variety of relationships between

HTML te!t nodes and data units, including one-to-one, one-to-many, many-to-one,

and one-to-nothing. "ur e!perimental results show that the precision and recall of

this method are both above JK percent. There is still room for improvement in

several areas. For e!ample, we need to enhance our method to split composite te!t


39/48

node when there are no e!plicit separators. *e would also li'e to try using

different machine learning techni+ues and using more sample pages from each

training site to obtain the feature weights so that we can identify the best techni+ue

to the data alignment problem.

R939R9NC9S

A. Arasu and H. %arcia-Molina, C/!tracting #tructured $ata from *eb 2ages,D

2roc. #%M"$ nt9l 6onf. Management of $ata, 0EE1.

0 L. Arlotta, @. 6rescen)i, %. Mecca, and 2. Merialdo, CAutomatic Annotation of

$ata /!tracted from Large *eb #ites,D 2roc. #i!th nt9l *or'shop the *eb and

$atabases &*eb$


40/48

H. /lmeleegy, N. Madhavan, and A. Halevy, CHarvesting 7elational Tables from

Lists on the *eb,D 2roc. @ery Large $atabases &@L$


41/48

4 H. He, *. Meng, 6. Ou, and P. *u, C6onstructing nterface #chemas for

#earch nterfaces of *eb $atabases,D 2roc. *eb nformation #ystems /ng.

&*#/( 6onf., 0EE4.

G N. Heflin and N. Hendler, C#earching the *eb with #H"/,D 2roc. AAA

*or'shop, 0EEE.

L. aufman and 2. 7ousseeuw, Finding %roups in $ata An ntroduction to

6luster Analysis. Nohn *iley Q #ons, JJE.

K 5. rushmeric', $. *eld, and 7. $oorenbos, C*rapper nduction for

nformation /!traction,D 2roc. nt9l Noint 6onf. Artificial ntelligence &N6A(,

JJ.

J N. Lee, CAnalyses of Multiple /vidence 6ombination,D 2roc. 0E thAnn. nt9l

A6M #%7 6onf. 7esearch and $evelopment in nformation 7etrieval, JJ.

0E L. Liu, 6. 2u, and *. Han, CR*7A2 An RML-/nabled *rapper

6onstruction #ystem for *eb nformation #ources,D 2roc. /// Gth nt9l 6onf.

$ata /ng. &6$/(, 0EE.

0 *. Liu, R. Meng, and *. Meng, C@i$/ A @ision-


42/48

01 N. Madhavan, $. o, L. Lot, @. %anapathy, A. 7asmussen, and A.O. Halevy,

C%oogle9s $eep *eb 6rawl,D 2roc. @L$< /ndowment, vol. , no. 0, pp. 03-

040, 0EEK.

03 *. Meng, 6. Ou, and . Liu, C


43/48

$atabases,D 2roc. 0th nt9l 6onf. *orld *ide *eb &***(, 0EE1.

1 P. *u et al., CTowards Automatic ncorporation of #earch /ngines into a

Large-#cale Metasearch /ngine,D 2roc. ///B*6 nt9l 6onf. *eb ntelligence

&* 9E1(, 0EE1.

10 ". Pamir and ". /t)ioni, C*eb $ocument 6lustering A Feasibility

$emonstration,D 2roc. A6M 0st nt9l #%7 6onf. 7esearch nformation

7etrieval, JJK.

11 O. Phai and


44/48

/ 9.tra%ting Str"%t"re+ Data !ro' We*


45/48

$ata e!traction from web pages is performed by software modules called

wrappers. 7ecently, some systems for the automatic generation of wrappers have

been proposed in the literature. These systems are based on unsupervised inference

techni+ues ta'ing as input a small set of sample pages, they can produce a

common wrapper to e!tract relevant data. However, due to the automatic nature of

the approach, the data e!tracted by these wrappers have anonymous names. n the

framewor' of our ongoing pro:ect 7oad7unner, we have developed a prototype,

called Labeller, that automatically annotates data e!tracted by automatically

generated wrappers. Although Labeller has been developed as a companion system

to our wrapper generator, its underlying approach has a general validity and

therefore it can be applied together with other wrapper generator systems. *e have

e!perimented the prototype over several real-life web sites obtaining encouraging

results.

4 9.#eri'ents on M"ltistrategy -earning *y Meta>-earning

AUTHORS$ 2. 6han and #. #tolfo

n this paper, we propose meta-learning as a general techni+ue to combine the


46/48

results of multiple learning algorithms, each applied to a set of training data. *e

detail several metalearning strategies for combining independently learned

classifiers, each computed by different algorithms, to improve overall prediction

accuracy. The overall resulting classifier is composed of the classifiers generated

by the different learning algorithms and a meta-classifier generated by a meta-

learning strategy. The strategies described here are independent of the learning

algorithms used. 2reliminary e!periments using different strategies and learning

algorithms on two molecular biologyse+uence analysis data sets demonstrate

encouraging results. Machine learning techni+ues are central to automated

'nowledge discovery systems and hence our approach can enhance the

effectiveness of such systems.

5 Co'*ining A##roa%hes !or In!or'ation Retrie(a

AUTHORS$ *.


47/48

a standard techni+ue for improving the effectiveness of information retrieval.

combination, for e!ample, has been studied e!tensively in the T7/6 evaluations

and is the basis of the Cmeta-searchD engines used on the *eb. This paper

e!amines the development of this techni+ue, including both e!perimental results

and the retrieval models that have been proposed as formal framewor's for

combination. *e show that combining approaches for information retrieval can be

modeled as combining the outputs of multiple classifiers based on one or more

representations, and that this simple model can provide e!planations for many of

the e!perimental results. *e also show that this view of combination is very

similar to the inference net model, and that a new approach to retrieval based on

language models supports combination and can be integrated with the inference net

model.

6 Se'Tag an+ Seeker$ Bootstra##ing the Se'anti% We* (ia A"to'ate+

Se'anti% Annotation

AUTHORS$#. $ill et al.


48/48

This paper describes #ee'er, a platform for large-scale te!t analytics, and #emTag,

an application written on the platform to perform automated semantic tagging of

large corpora. *e apply #emTag to a collection of appro!imately 0G3 million web

pages, and generate appro!imately 313 million automatically disambiguated

semantic tags, published to the web as a label bureau providing metadata regarding

the 313 million annotations. To our 'nowledge, this is the largest scale semantic

tagging effort to date.*e describe the #ee'er platform, discuss the architecture of

the #emTag application, describe a new disambiguation algorithm speciali)ed to

support ontological disambiguation of large-scale data, evaluate the algorithm, and

present our final results with information about ac+uiring and ma'ing use of the

semantic tags. *e argue that automated large scale semantic tagging of ambiguous

content can bootstrap and accelerate the creation of the semantic web.

Download - -Annotating Full Document

Top Related