Transcript
  • 7/26/2019 -Annotating Full Document

    1/48

    ABSTRACT

    An increasing number of databases have become web accessible through HTML

    form-based search interfaces. The data units returned from the underlying database

    are usually encoded into the result pages dynamically for human browsing. For the

    encoded data units to be machine process able, which is essential for many

    applications such as deep web data collection and nternet comparison shopping,

    they need to be e!tracted out and assigned meaningful labels. n this paper, we

    present an automatic annotation approach that first aligns the data units on a result

    page into different groups such that the data in the same group have the same

    semantic. Then, for each group we annotate it from different aspects and aggregatethe different annotations to predict a final annotation label for it. An annotation

    wrapper for the search site is automatically constructed and can be used to annotate

    new result pages from the same web database. "ur e!periments indicate that the

    proposed approach is highly effective.

    INTRODUCTION

    What is Data Mining?

    #tructure of $ata Mining

  • 7/26/2019 -Annotating Full Document

    2/48

    %enerally, data mining &sometimes called data or 'nowledge discovery( is the

    process of analy)ing data from different perspectives and summari)ing it into

    useful information - information that can be used to increase revenue, cuts costs, or

    both. $ata mining software is one of a number of analytical tools for analy)ing

    data. t allows users to analy)e data from many different dimensions or angles,

    categori)e it, and summari)e the relationships identified. Technically, data mining

    is the process of finding correlations or patterns among do)ens of fields in large

    relational databases.

    How Data Mining Works?

    *hile large-scale information technology has been evolving separate transaction

    and analytical systems, data mining provides the lin' between the two. $ata

    mining software analy)es relationships and patterns in stored transaction data

    based on open-ended user +ueries. #everal types of analytical software are

    available statistical, machine learning, and neural networ's. Generally any o!

    !o"r ty#es o! relationshi#s are so"ght$

    Classes #tored data is used to locate data in predetermined groups. For

    e!ample, a restaurant chain could mine customer purchase data to determine

    when customers visit and what they typically order. This information could

    be used to increase traffic by having daily specials.

    Cl"sters $ata items are grouped according to logical relationships or

    consumer preferences. For e!ample, data can be mined to identify mar'et

    segments or consumer affinities.

    Asso%iations $ata can be mined to identify associations. The beer-diaper

    e!ample is an e!ample of associative mining.

  • 7/26/2019 -Annotating Full Document

    3/48

    Se&"ential #atterns $ata is mined to anticipate behavior patterns and

    trends. For e!ample, an outdoor e+uipment retailer could predict the

    li'elihood of a bac'pac' being purchased based on a consumers purchase of

    sleeping bags and hi'ing shoes.

    Data 'ining %onsists o! !i(e 'a)or ele'ents$

    ( /!tract, transform, and load transaction data onto the data warehouse

    system.

    0( #tore and manage the data in a multidimensional database system.

    1( 2rovide data access to business analysts and information technology

    professionals.

    3( Analy)e the data by application software.

    4( 2resent the data in a useful format, such as a graph or table.

    Di!!erent le(els o! analysis are a(aila*le$

    Arti!i%ial ne"ral networks 5on-linear predictive models that learn through

    training and resemble biological neural networ's in structure.

    Geneti% algorith's "ptimi)ation techni+ues that use process such as

    genetic combination, mutation, and natural selection in a design based on the

    concepts of natural evolution.

    De%ision trees Tree-shaped structures that represent sets of decisions. These

    decisions generate rules for the classification of a dataset. #pecific decision

    tree methods include 6lassification and 7egression Trees &6A7T( and 6hi

  • 7/26/2019 -Annotating Full Document

    4/48

    #+uare Automatic nteraction $etection &6HA$(. 6A7T and 6HA$ are

    decision tree techni+ues used for classification of a dataset. They provide a

    set of rules that you can apply to a new &unclassified( dataset to predict

    which records will have a given outcome. 6A7T segments a dataset by

    creating 0-way splits while 6HA$ segments using chi s+uare tests to create

    multi-way splits. 6A7T typically re+uires less data preparation than

    6HA$.

    Nearest neigh*or 'etho+ A techni+ue that classifies each record in a

    dataset based on a combination of the classes of the krecord&s( most similar

    to it in a historical dataset &where k8(. #ometimes called the k-nearest

    neighbor techni+ue.

    R"le in+"%tion The e!traction of useful if-then rules from data based on

    statistical significance.

    Data (is"ali,ation The visual interpretation of comple! relationships in

    multidimensional data. %raphics tools are used to illustrate data

    relationships.

    Chara%teristi%s o! Data Mining$

    -arge &"antities o! +ata The volume of data so great it has to be analy)ed

    by automated techni+ues e.g. satellite information, credit card transactions

    etc.

    Noisy in%o'#lete +ata mprecise data is the characteristic of all data

    collection.

  • 7/26/2019 -Annotating Full Document

    5/48

    Co'#le. +ata str"%t"re conventional statistical analysis not possible

    Heterogeneo"s +ata store+ in lega%y syste's

    Bene!its o! Data Mining$

    ( t9s one of the most effective services that are available today. *ith the help

    of data mining, one can discover precious information about the customers

    and their behavior for a specific set of products and evaluate and analy)e,

    store, mine and load data related to them

    0( An analytical 67M model and strategic business related decisions can be

    made with the help of data mining as it helps in providing a complete

    synopsis of customers

    1( An endless number of organi)ations have installed data mining pro:ects and

    it has helped them see their own companies ma'e an unprecedented

    improvement in their mar'eting strategies &6ampaigns(

    3( $ata mining is generally used by organi)ations with a solid customer focus.

    For its fle!ible nature as far as applicability is concerned is being used

    vehemently in applications to foresee crucial data including industry

    analysis and consumer buying behaviors

    4( Fast paced and prompt access to data along with economic processing

    techni+ues have made data mining one of the most suitable services that a

    company see'

    A+(antages o! Data Mining$

    /0 Marketing 1 Retail$

    $ata mining helps mar'eting companies build models based on historical data

  • 7/26/2019 -Annotating Full Document

    6/48

    to predict who will respond to the new mar'eting campaigns such as direct mail,

    online mar'eting campaign;etc. Through the results, mar'eters will have

    appropriate approach to sell profitable products to targeted customers.

    $ata mining brings a lot of benefits to retail companies in the same way as

    mar'eting. Through mar'et bas'et analysis, a store can have an appropriate

    production arrangement in a way that customers can buy fre+uent buying products

    together with pleasant. n addition, it also helps the retail companies offer certain

    discounts for particular products that will attract more customers.

    20 3inan%e 1 Banking$ata mining gives financial institutions information about loan information and

    credit reporting.

  • 7/26/2019 -Annotating Full Document

    7/48

    financial transaction to build patterns that can detect money laundering or criminal

    activities.

    60 -aw en!or%e'ent$

    $ata mining can aid law enforcers in identifying criminal suspects as well as

    apprehending these criminals by e!amining trends in location, crime type, habit,

    and other patterns of behaviors.

    70 Resear%hers$

    $ata mining can assist researchers by speeding up their data analy)ing process=

    thus, allowing those more time to wor' on other pro:ects.

    S8ST9M ANA-8SIS9:ISTING S8ST9M$

    n this e!isting system, a data unit is a piece of te!t that semantically representsone concept of an entity. t corresponds to the value of a record under an attribute.

    t is different from a te!t node which refers to a se+uence of te!t surrounded by a

    pair of HTML tags. t describes the relationships between te!t nodes and data units

    in detail. n this paper, we perform data unit level annotation. There is a high

    demand for collecting data of interest from multiple *$

  • 7/26/2019 -Annotating Full Document

    8/48

    the semantic of each data unit. >nfortunately, the semantic labels of data units are

    often not provided in result pages. For instance, no semantic labels for the values

    of title, author, publisher, etc., are given. Having semantic labels for data units is

    not only important for the above record lin'age tas', but also for storing collected

    #77s into a database table.

  • 7/26/2019 -Annotating Full Document

    9/48

    we are the first to utili)e # for annotating #77s.

    *e employ si! basic annotators= each annotator can independently assign

    labels to data units based on certain features of the data units. *e also

    employ a probabilistic model to combine the results from different

    annotators into a single label. This model is highly fle!ible so that the

    e!isting basic annotators may be modified and new annotators may be added

    easily without affecting the operation of other annotators.

    *e construct an annotation wrapper for any given *$

  • 7/26/2019 -Annotating Full Document

    10/48

    MODU-9S$

  • 7/26/2019 -Annotating Full Document

    11/48

    attributes on the local search interface of the *$< will most li'ely appear in some

    retrieved #77s. For e!ample, +uery term CmachineD is submitted through the Title

    field on the search interface of the *$< and all three titles of the returned #77s

    contain this +uery term. Thus, we can use the name of search field Title to annotate

    the title values of these #77s. n general, +uery terms against an attribute may be

    entered to a te!tbo! or chosen from a selection list on the local search interface.

    "ur ?uery-based Annotator wor's as follows %iven a +uery with a set of +uery

    terms submitted against an attribute A on the local search interface, first find the

    group that has the largest total occurrences of these +uery terms and then assign

    gn&A( as the label to the group.

    S%he'a ;al"e Annotator

    Many attributes on a search interface have predefined values on the interface. For

    e!ample, the attribute 2ublishers may have a set of predefined values &i.e.,

    publishers( in its selection list. More attributes in the # tend to have predefined

    values and these attributes are li'ely to have more such values than those in L#s,

    because when attributes from multiple interfaces are integrated, their values are

    also combined. "ur schema value annotator utili)es the combined value set to

    perform annotation.

    The schema value annotator first identifies the attribute A: that has the highest

    matching score among all attributes and then uses gn&A:( to annotate the group %i.

    5ote that multiplying the above sum by the number of non)ero similarities is to

    give preference to attributes that have more matches &i.e., having non)ero

    similarities( over those that have fewer matches. This is found to be very effective

    in improving the retrieval effectiveness of combination systems in information

    retrieval

  • 7/26/2019 -Annotating Full Document

    12/48

    Co''on nowle+ge Annotator

    #ome data units on the result page are self-e!planatory because of the common

    'nowledge shared by human beings. For e!ample, Cin stoc'D and Cout of stoc'D

    occur in many #77s from e-commerce sites. Human users understand that it is

    about the availability of the product because this is common 'nowledge. #o our

    common 'nowledge annotator tries to e!ploit this situation by using some

    predefined common concepts.

    /ach common concept contains a label and a set of patterns or values. For

    e!ample, a country concept has a label CcountryD and a set of values such as

    C>.#.A.,D C6anada,D and so on. t should be pointed out that our common concepts

    are different from the ontologies that are widely used in some wor's in #emantic

    *eb. First, our common concepts are domain independent. #econd, they can be

    obtained from e!isting information resources with little additional human effort.

    Co'*ining Annotators

    "ur analysis indicates that no single annotator is capable of fully labeling all the

    data units on different result pages. The applicability of an annotator is the

    percentage of the attributes to which the annotator can be applied. For e!ample, if

    out of E attributes, four appear in tables, then the applicability of the table

    annotator is 3E percent. The average applicability of each basic annotator across all

    testing domains in our data set. This indicates that the results of different basic

    annotators should be combined in order to annotate a higher percentage of data

    units. Moreover, different annotators may produce different labels for a given

    group of data units. Therefore, we need a method to select the most suitable one for

    the group. "ur annotators are fairly independent from each other since each

    e!ploits an independent feature.

  • 7/26/2019 -Annotating Full Document

    13/48

    S8ST9M T9STING

    The purpose of testing is to discover errors. Testing is the process of trying

    to discover every conceivable fault or wea'ness in a wor' product. t provides a

    way to chec' the functionality of components, sub assemblies, assemblies andBor a

    finished product t is the process of e!ercising software with the intent of ensuring

    that the

    #oftware system meets its re+uirements and user e!pectations and does not fail in

    an unacceptable manner. There are various types of test. /ach test type addresses a

    specific testing re+uirement.

    TYPES OF TESTS

    Unit testing

    >nit testing involves the design of test cases that validate that the internal

    program logic is functioning properly, and that program inputs produce valid

    outputs. All decision branches and internal code flow should be validated. t is the

    testing of individual software units of the application .it is done after the

    completion of an individual unit before integration. This is a structural testing, that

    relies on 'nowledge of its construction and is invasive. >nit tests perform basic

    tests at component level and test a specific business process, application, andBor

    system configuration. >nit tests ensure that each uni+ue path of a business process

    performs accurately to the documented specifications and contains clearly defined

    inputs and e!pected results.

  • 7/26/2019 -Annotating Full Document

    14/48

    Integration testing

    ntegration tests are designed to test integrated software components to

    determine if they actually run as one program. Testing is event driven and is more

    concerned with the basic outcome of screens or fields. ntegration tests

    demonstrate that although the components were individually satisfaction, as shown

    by successfully unit testing, the combination of components is correct and

    consistent. ntegration testing is specifically aimed at e!posing the problems that

    arise from the combination of components.

    Functional test

    Functional tests provide systematic demonstrations that functions tested are

    available as specified by the business and technical re+uirements, system

    documentation, and user manuals.

    Functional testing is centered on the following items

    @alid nput identified classes of valid input must be accepted.

    nvalid nput identified classes of invalid input must be re:ected.

    Functions identified functions must be e!ercised.

    "utput identified classes of application outputs must be e!ercised.

    #ystemsB2rocedures interfacing systems or procedures must be invo'ed.

    "rgani)ation and preparation of functional tests is focused on re+uirements, 'ey

    functions, or special test cases. n addition, systematic coverage pertaining to

    identify

  • 7/26/2019 -Annotating Full Document

    15/48

    processes must be considered for testing.

  • 7/26/2019 -Annotating Full Document

    16/48

    >nit testing is usually conducted as part of a combined code and unit test

    phase of the software lifecycle, although it is not uncommon for coding and unit

    testing to be conducted as two distinct phases.

    Test strategy and approach

    Field testing will be performed manually and functional tests will be written

    in detail.

    Test o*)e%ti(es

    All field entries must wor' properly.

    2ages must be activated from the identified lin'.

    The entry screen, messages and responses must not be delayed.

    3eat"res to *e teste+

    @erify that the entries are of the correct format

    5o duplicate entries should be allowed

    All lin's should ta'e the user to the correct page.

    702 Integration Testing

    #oftware integration testing is the incremental integration testing of two or

    more integrated software components on a single platform to produce failures

    caused by interface defects.

    The tas' of the integration test is to chec' that components or software

  • 7/26/2019 -Annotating Full Document

    17/48

    applications, e.g. components in a software system or one step up software

    applications at the company level interact without error.

    Test Res"lts$ All the test cases mentioned above passed successfully. 5o defects

    encountered.

    6.3 Acceptance Testing

    >ser Acceptance Testing is a critical phase of any pro:ect and re+uires

    significant participation by the end user. t also ensures that the system meets the

    functional re+uirements.

    Test Res"lts$ All the test cases mentioned above passed successfully. 5o defects

    encountered.

    S8ST9M D9SIGN

  • 7/26/2019 -Annotating Full Document

    18/48

    Preprocessing Layer

    Knowledge Layer

    Presentation Layer

    User Search Interface

    World Wide Web

    User Query Input/Output HTML Pagennotated !esults

    !esult Pro"ection

    Analysis Layer

    #o$$on %no&ledge nnotator

    Pre'(/Su)( nnotator

    *re+uency,-ased nnotator

    Query,-ased nnotator

    Sche$a .alue nnotator

    Table nnotator

    nnotation

    HTML P0S

    1ML/OWL

    Query !esult 2ata

    Pro'le

  • 7/26/2019 -Annotating Full Document

    19/48

    . The $F$ is also called as bubble chart. t is a simple graphical formalism

    that can be used to represent a system in terms of input data to the system,

    various processing carried out on this data, and the output data is generatedby this system.

    0. The data flow diagram &$F$( is one of the most important modeling tools. t

    is used to model the system components. These components are the system

    process, the data used by the process, an e!ternal entity that interacts with

    the system and the information flows in the system.

    1. $F$ shows how the information moves through the system and how it is

    modified by a series of transformations. t is a graphical techni+ue that

    depicts information flow and the transformations that are applied as data

    moves from input to output.

    3. $F$ is also 'nown as bubble chart. A $F$ may be used to represent a

    system at any level of abstraction. $F$ may be partitioned into levels that

    represent increasing information flow and functional detail.

  • 7/26/2019 -Annotating Full Document

    20/48

    A d m i n

    E N D

    A d d S o u rc e

    U R L A U T H O R T I T L E

    D A T A B A S E

    S e a rc h b y U R L

    C O N T E N T

    U s e r

    S e a rc h b y Y e a r

    S e a r c h b y A u t h o r n a m e

    S e a r c h b y T it ieY E A R R IC E

    UM- DIAGRAMS

    >ML stands for >nified Modeling Language. >ML is a standardi)ed

    general-purpose modeling language in the field of ob:ect-oriented software

    engineering. The standard is managed, and was created by, the "b:ect Management

    %roup.

  • 7/26/2019 -Annotating Full Document

    21/48

    The goal is for >ML to become a common language for creating models of

    ob:ect oriented computer software. n its current form >ML is comprised of two

    ma:or components a Meta-model and a notation. n the future, some form of

    method or process may also be added to= or associated with, >ML.

    The >nified Modeling Language is a standard language for specifying,

    @isuali)ation, 6onstructing and documenting the artifacts of software system, as

    well as for business modeling and other non-software systems.

    The >ML represents a collection of best engineering practices that have

    proven successful in the modeling of large and comple! systems.

    The >ML is a very important part of developing ob:ects oriented software

    and the software development process. The >ML uses mostly graphical notations

    to e!press the design of software pro:ects.

    GOA-S$

    The 2rimary goals in the design of the >ML are as follows

    . 2rovide users a ready-to-use, e!pressive visual modeling Language so that

    they can develop and e!change meaningful models.

    0. 2rovide e!tendibility and speciali)ation mechanisms to e!tend the core

    concepts.

    1.

  • 7/26/2019 -Annotating Full Document

    22/48

    US9 CAS9 DIAGRAM$

    A use case diagram in the >nified Modeling Language &>ML( is a type of

    behavioral diagram defined by and created from a >se-case analysis. ts purpose is

    to present a graphical overview of the functionality provided by a system in terms

    of actors, their goals &represented as use cases(, and any dependencies between

    those use cases. The main purpose of a use case diagram is to show what system

    functions are performed for which actor. 7oles of the actors in the system can be

    depicted.

  • 7/26/2019 -Annotating Full Document

    23/48

    A D M IN U S E R

    D E T A I L S

    I N F O R M A T I O N

    A D D S O U R C E

    S E R A C H B Y U R L , A U T H O R N A M E , Y E A R , T IT I LE

    C-ASS DIAGRAM$

    n software engineering, a class diagram in the >nified Modeling Language

    &>ML( is a type of static structure diagram that describes the structure of a systemby showing the systems classes, their attributes, operations &or methods(, and the

    relationships among the classes. t e!plains which class contains information.

  • 7/26/2019 -Annotating Full Document

    24/48

    U S E R

    V i e w S o u rc eU r l, T it le , Au t h o r n ! e ,Ye r ,

    B r o ! s e " #

    S i n$

    A c t io n re ce i"e# ro " i$ e Se r" i ce %

    A D % I N

    A $ $ S ou rc eA $ $ C o n te n t

    A $ $ In &o r! tio n

    A d d S o u r c e "#

    S9=U9NC9 DIAGRAM$

    A se+uence diagram in >nified Modeling Language &>ML( is a 'ind of interaction

    diagram that shows how processes operate with one another and in what order. t is

    a construct of a Message #e+uence 6hart. #e+uence diagrams are sometimes called

    event diagrams, event scenarios, and timing diagrams.

  • 7/26/2019 -Annotating Full Document

    25/48

    A d m in S to ra & e U se r

    S e a r c hA d d i n & ' e y ! o rd

    A d d in & s o u r c e

    r o ( i d e I n ) o rm a ti o n

    *et In ) o r m at ion

    ACTI;IT8 DIAGRAM$

    Activity diagrams are graphical representations of wor'flows of stepwise activities

    and actions with support for choice, iteration and concurrency. n the >nified

    Modeling Language, activity diagrams can be used to describe the business and

    operational step-by-step wor'flows of components in a system. An activity

    diagram shows the overall flow of control.

  • 7/26/2019 -Annotating Full Document

    26/48

    L o ' i n

    V l i $ U%e r

    ( e t In & o r! t io n

    Co n $ i t i o n

    S i ' n U )

    Y e %

    F l % e

    A $ ! inU % e r

    A$ $ * e +w o r$

    A $ $ D e t i l%

    S e r c h

    B e ' i n

    E n $

    A$ $ in & o r! it o n

    A $ $ S o u rce

    D t B % e

    IN

  • 7/26/2019 -Annotating Full Document

    27/48

    can be achieved by inspecting the computer to read data from a written or printed

    document or it can occur by having people 'eying the data directly into the system.

    The design of input focuses on controlling the amount of input re+uired,

    controlling the errors, avoiding delay, avoiding e!tra steps and 'eeping the process

    simple. The input is designed in such a way so that it provides security and ease of

    use with retaining the privacy. nput $esign considered the following things

    *hat data should be given as inputI

    How the data should be arranged or codedI

    The dialog to guide the operating personnel in providing input.

    Methods for preparing input validations and steps to follow when error

    occur.

    OB@9CTI;9S

    .nput $esign is the process of converting a user-oriented description of the input

    into a computer-based system. This design is important to avoid errors in the datainput process and show the correct direction to the management for getting correct

    information from the computeri)ed system.

    0. t is achieved by creating user-friendly screens for the data entry to handle large

    volume of data. The goal of designing input is to ma'e data entry easier and to be

    free from errors. The data entry screen is designed in such a way that all the data

    manipulates can be performed. t also provides record viewing facilities.

    1.*hen the data is entered it will chec' for its validity. $ata can be entered with

    the help of screens. Appropriate messages are provided as when needed so that the

    user

  • 7/26/2019 -Annotating Full Document

    28/48

    will not be in mai)e of instant. Thus the ob:ective of input design is to create an

    input layout that is easy to follow

    OUT

  • 7/26/2019 -Annotating Full Document

    29/48

    Index Page:

    Register Page:

    User:

  • 7/26/2019 -Annotating Full Document

    30/48

    User Page:

    Search by Titile:``

  • 7/26/2019 -Annotating Full Document

    31/48

    Search by URL:

    Search by year:

  • 7/26/2019 -Annotating Full Document

    32/48

    Search by Author Name:

    Result Page:

  • 7/26/2019 -Annotating Full Document

    33/48

    Additional Information:

    rong !ey"ord:

  • 7/26/2019 -Annotating Full Document

    34/48

    Admin #ntry:

    Admin $ome :

  • 7/26/2019 -Annotating Full Document

    35/48

    Admin image U%laod:

    Adding Information:

  • 7/26/2019 -Annotating Full Document

    36/48

    Adding eb &ontent:

    About:

  • 7/26/2019 -Annotating Full Document

    37/48

    #nd:

  • 7/26/2019 -Annotating Full Document

    38/48

    CONC-USIONn this paper, we studied the data annotation problem and proposed a multi-

    annotator approach to automatically constructing an annotation wrapper for

    annotating the search result records retrieved from any given web database. This

    approach consists of si! basic annotators and a probabilistic method to combine the

    basic annotators. /ach of these annotators e!ploits one type of features for

    annotation and our e!perimental results show that each of the annotators is useful

    and they together are capable of generating high +uality annotation. A special

    feature of our method is that, when annotating the results retrieved from a web

    database, it utili)es both the L# of the web database and the # of multiple web

    databases in the same domain. *e also e!plained how the use of the # can help

    alleviate the local interface schema inade+uacy problem and the inconsistent label

    problem.

    n this paper, we also studied the automatic data alignment problem. Accurate

    alignment is critical to achieving holistic and accurate annotation. "ur method is a

    clustering based shifting method utili)ing richer yet automatically obtainable

    features. This method is capable of handling a variety of relationships between

    HTML te!t nodes and data units, including one-to-one, one-to-many, many-to-one,

    and one-to-nothing. "ur e!perimental results show that the precision and recall of

    this method are both above JK percent. There is still room for improvement in

    several areas. For e!ample, we need to enhance our method to split composite te!t

  • 7/26/2019 -Annotating Full Document

    39/48

    node when there are no e!plicit separators. *e would also li'e to try using

    different machine learning techni+ues and using more sample pages from each

    training site to obtain the feature weights so that we can identify the best techni+ue

    to the data alignment problem.

    R939R9NC9S

    A. Arasu and H. %arcia-Molina, C/!tracting #tructured $ata from *eb 2ages,D

    2roc. #%M"$ nt9l 6onf. Management of $ata, 0EE1.

    0 L. Arlotta, @. 6rescen)i, %. Mecca, and 2. Merialdo, CAutomatic Annotation of

    $ata /!tracted from Large *eb #ites,D 2roc. #i!th nt9l *or'shop the *eb and

    $atabases &*eb$

  • 7/26/2019 -Annotating Full Document

    40/48

    H. /lmeleegy, N. Madhavan, and A. Halevy, CHarvesting 7elational Tables from

    Lists on the *eb,D 2roc. @ery Large $atabases &@L$

  • 7/26/2019 -Annotating Full Document

    41/48

    4 H. He, *. Meng, 6. Ou, and P. *u, C6onstructing nterface #chemas for

    #earch nterfaces of *eb $atabases,D 2roc. *eb nformation #ystems /ng.

    &*#/( 6onf., 0EE4.

    G N. Heflin and N. Hendler, C#earching the *eb with #H"/,D 2roc. AAA

    *or'shop, 0EEE.

    L. aufman and 2. 7ousseeuw, Finding %roups in $ata An ntroduction to

    6luster Analysis. Nohn *iley Q #ons, JJE.

    K 5. rushmeric', $. *eld, and 7. $oorenbos, C*rapper nduction for

    nformation /!traction,D 2roc. nt9l Noint 6onf. Artificial ntelligence &N6A(,

    JJ.

    J N. Lee, CAnalyses of Multiple /vidence 6ombination,D 2roc. 0E thAnn. nt9l

    A6M #%7 6onf. 7esearch and $evelopment in nformation 7etrieval, JJ.

    0E L. Liu, 6. 2u, and *. Han, CR*7A2 An RML-/nabled *rapper

    6onstruction #ystem for *eb nformation #ources,D 2roc. /// Gth nt9l 6onf.

    $ata /ng. &6$/(, 0EE.

    0 *. Liu, R. Meng, and *. Meng, C@i$/ A @ision-

  • 7/26/2019 -Annotating Full Document

    42/48

    01 N. Madhavan, $. o, L. Lot, @. %anapathy, A. 7asmussen, and A.O. Halevy,

    C%oogle9s $eep *eb 6rawl,D 2roc. @L$< /ndowment, vol. , no. 0, pp. 03-

    040, 0EEK.

    03 *. Meng, 6. Ou, and . Liu, C

  • 7/26/2019 -Annotating Full Document

    43/48

    $atabases,D 2roc. 0th nt9l 6onf. *orld *ide *eb &***(, 0EE1.

    1 P. *u et al., CTowards Automatic ncorporation of #earch /ngines into a

    Large-#cale Metasearch /ngine,D 2roc. ///B*6 nt9l 6onf. *eb ntelligence

    &* 9E1(, 0EE1.

    10 ". Pamir and ". /t)ioni, C*eb $ocument 6lustering A Feasibility

    $emonstration,D 2roc. A6M 0st nt9l #%7 6onf. 7esearch nformation

    7etrieval, JJK.

    11 O. Phai and

  • 7/26/2019 -Annotating Full Document

    44/48

    / 9.tra%ting Str"%t"re+ Data !ro' We*

  • 7/26/2019 -Annotating Full Document

    45/48

    $ata e!traction from web pages is performed by software modules called

    wrappers. 7ecently, some systems for the automatic generation of wrappers have

    been proposed in the literature. These systems are based on unsupervised inference

    techni+ues ta'ing as input a small set of sample pages, they can produce a

    common wrapper to e!tract relevant data. However, due to the automatic nature of

    the approach, the data e!tracted by these wrappers have anonymous names. n the

    framewor' of our ongoing pro:ect 7oad7unner, we have developed a prototype,

    called Labeller, that automatically annotates data e!tracted by automatically

    generated wrappers. Although Labeller has been developed as a companion system

    to our wrapper generator, its underlying approach has a general validity and

    therefore it can be applied together with other wrapper generator systems. *e have

    e!perimented the prototype over several real-life web sites obtaining encouraging

    results.

    4 9.#eri'ents on M"ltistrategy -earning *y Meta>-earning

    AUTHORS$ 2. 6han and #. #tolfo

    n this paper, we propose meta-learning as a general techni+ue to combine the

  • 7/26/2019 -Annotating Full Document

    46/48

    results of multiple learning algorithms, each applied to a set of training data. *e

    detail several metalearning strategies for combining independently learned

    classifiers, each computed by different algorithms, to improve overall prediction

    accuracy. The overall resulting classifier is composed of the classifiers generated

    by the different learning algorithms and a meta-classifier generated by a meta-

    learning strategy. The strategies described here are independent of the learning

    algorithms used. 2reliminary e!periments using different strategies and learning

    algorithms on two molecular biologyse+uence analysis data sets demonstrate

    encouraging results. Machine learning techni+ues are central to automated

    'nowledge discovery systems and hence our approach can enhance the

    effectiveness of such systems.

    5 Co'*ining A##roa%hes !or In!or'ation Retrie(a

    AUTHORS$ *.

  • 7/26/2019 -Annotating Full Document

    47/48

    a standard techni+ue for improving the effectiveness of information retrieval.

    combination, for e!ample, has been studied e!tensively in the T7/6 evaluations

    and is the basis of the Cmeta-searchD engines used on the *eb. This paper

    e!amines the development of this techni+ue, including both e!perimental results

    and the retrieval models that have been proposed as formal framewor's for

    combination. *e show that combining approaches for information retrieval can be

    modeled as combining the outputs of multiple classifiers based on one or more

    representations, and that this simple model can provide e!planations for many of

    the e!perimental results. *e also show that this view of combination is very

    similar to the inference net model, and that a new approach to retrieval based on

    language models supports combination and can be integrated with the inference net

    model.

    6 Se'Tag an+ Seeker$ Bootstra##ing the Se'anti% We* (ia A"to'ate+

    Se'anti% Annotation

    AUTHORS$#. $ill et al.

  • 7/26/2019 -Annotating Full Document

    48/48

    This paper describes #ee'er, a platform for large-scale te!t analytics, and #emTag,

    an application written on the platform to perform automated semantic tagging of

    large corpora. *e apply #emTag to a collection of appro!imately 0G3 million web

    pages, and generate appro!imately 313 million automatically disambiguated

    semantic tags, published to the web as a label bureau providing metadata regarding

    the 313 million annotations. To our 'nowledge, this is the largest scale semantic

    tagging effort to date.*e describe the #ee'er platform, discuss the architecture of

    the #emTag application, describe a new disambiguation algorithm speciali)ed to

    support ontological disambiguation of large-scale data, evaluate the algorithm, and

    present our final results with information about ac+uiring and ma'ing use of the

    semantic tags. *e argue that automated large scale semantic tagging of ambiguous

    content can bootstrap and accelerate the creation of the semantic web.


Top Related