information extraction

Click here to load reader

Upload: fabienne-kael

Post on 03-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Information Extraction. Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology. Outline. Information Extraction Introduction Applications Table Reading Citation Extraction Chinese Named Entity Recognition. Introduction. Information Extraction. - PowerPoint PPT Presentation

TRANSCRIPT

  • Information ExtractionShih-Hung WuAssistant ProfessorCSIE, Chaoyang University of Technology

  • OutlineInformation ExtractionIntroductionApplicationsTable ReadingCitation ExtractionChinese Named Entity Recognition

  • Introduction

  • Information Extractionextracts pieces of information that are salient to the user's needs

  • Message Understanding Conferences (MUC) Evaluationsprovide prepared data and task definitions in addition to providing fully automated scoring software to measure machine and human performance.

    The databases now include named entities, multilingual named entities, attributes of those entities, facts about relationships between entities, and events in which the entities participated.

    The multilingual portion was known as "Multilingual Entitity Task (MET)"

  • ExamplesThe following fictional news story portrays the levels of detail that systems can extract:

    Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons. La Jolla Genomatics will release its product Geninfo in June 1999. Geninfo is a turnkey system to assist biotechnology researchers in keeping up with the voluminous literature in all aspects of their field. Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo. Oliver's brother, Ambrose, follows more in his father's footsteps and will be the CFO of L.J.G. headquartered in the Maddox family's hometown of La Jolla, CA.

  • Entities:

    Persons: Organizations:Locations:Artifacts:Dates:Fletcher MaddoxUCSD Business SchoolLa JollaGeninfoJune 1999Dr. MaddoxLa Jolla GenomaticsCAGeninfoOliverLa Jolla GenomaticsOliverL.J.G.AmbroseMaddox

  • Attributes:Attributes:

    NAME:Fletcher MaddoxDESCRIPTOR:former Dean of the UCSD Business School his father the firm's CEOCATEGORY:PERSONNAME:La Jolla GenomaticsDESCRIPTOR:CATEGORY:ORGANIZATIONNAME:GeninfoDESCRIPTOR:its productCATEGORY:ARTIFACTNAME:La JollaDESCRIPTOR:the Maddox family's hometownCATEGORY:LOCATION

  • Facts:

    PERSONEmployee_ofORGANIZATIONFletcher Maddox Fletcher Maddox Oliver AmbroseEmployee_of Employee_of Employee_of Employee_ofUCSD Business School La Jolla Genomatics La Jolla Genomatics La Jolla GenomaticsARTIFACTProduct_ofORGANIZATIONGeninfoProduct_ofLa Jolla GenomaticsLOCATION Location_ofORGANIZATIONLa JollaLocation_ofLa Jolla GenomaticsCALocation_ofLa Jolla Genomatics

  • Events:COMPANY-FORMATION_EVENT:

    RELEASE-EVENT:

    COMPANY:La Jolla GenomaticsPRINCIPALS:Fletcher Maddox Oliver AmbroseDATE:CAPITAL:

    COMPANY:La Jolla GenomaticsPRODUCT:GeninfoDATE:June 1999COST:

  • Information Extractioncurrent indicators of the state of the art:

    Items of Information Percentile ReliabilityEntities90Attributes80Facts70Events60

  • Technical definition of IEThe process of creating database entries by skimming a text and looking for occurrences of a particular class of object or event and for relationships among those objects and events [Russell, Norvig 2003]

  • Basic IE tasksExtract addresses from Web pagestarget: street, city, state, and zip code Extract storms from weather reporttarget: temperature, wind speed, and precipitation

  • IE ApplicationsCompetitive intelligencefind instances of corporate mergers and joint ventures.

    Intelligence gatheringterrorist activities. any damage to buildings or the infrastructure, as well as the time and location of the event. Health care deliverysummarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments..

  • TechnologyMethod in literatureRegular expressionsCascaded finite-state transducers

    Our approachesOntological domain knowledgeMachine LearningHybrid method

  • Regular expression approach example

    From the text17in SXGA Monitor for only $249.99Extractm m ComputerMonitors Size(m,Inches(17)) Price(m, $(249.99)) Resolution(m, 12801024)

  • Regular Expressions[0-9][0-9]+.[0-9] [0-9](.[0-9] [0-9])?

    $[0-9]+(.[0-9] [0-9])?

    Any digit from 0 to 9One or more digitsA period followed by two digitsA period followed by two digits, or nothing

    $249.99, $1.23, $100000, matches

  • WeaknessWhats the price ?List price $99.00, special sale price $78.00, shipping $3.00.

  • Cascaded finite-state transducers approach exampleFromBridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan.Extracte JointVentures Product(e, golf clubs) Date(e,Friday) Entity(e,Bridgetstone Sports Co) Entity(e, a local concern) Entity(e, a Japanese trading house)

  • Cascaded finite-state transducersA typical relational extraction systems consists of the following five stages:TokenizationComplex word handlingBasic group handlingComplex phrase handlingStructure merging

  • TokenizationWord segmentation->||, |

    Complex word handlingBridgestone Sports Co.CapitalizedWord+(Company|Co|Inc|Ltd)

    Intel Chairman Andy GroveCapitalizedWord+(Grove|Forest|Village|)

  • Basic group handlingNoun group, verb group, Preposition, Conjunction

    1 NG: Bridgestone Sports Co.2 VG: said3 NG: Friday4 NG: it5 VG: had set up6 NG: a joint venture7 PR: in8 NG: Taiwan9 PR: with10 NG: a local concern11 CJ: and12 NG: a Japanese trading house13 VG: to produce14 NG: golf clubs15 VG: to be shipped16 PR: to17 NG: Japan

  • Complex phrase handlingCompany+SetUp JointVenture (with Company+)?

    Structure mergingIf the next sentence says something about the same event.

  • A brief remarkIE works well for a restricted domainPredetermine the Subjects and how they are mentioned

  • Applications

  • Table Reading

    Citation Extraction

    Chinese NER

  • Semantic Search on Internet Tabular Information Extraction for Answering QueriesCIKM 2000

  • Table ReadingGives a algorithm to interpret tables of the type shown below where some cells span over multiple rows or columns.An example of interpretation is:(Attribute)=>(Value)(Adult-Price-Single Room-Economic class)=>35,450

  • Table Reading

    Sheet1

    TimeMorningAfternoon

    Mon

    Tue

    Wen

    Thr

    Fri

    Sat

    Sheet2

    Sheet3

    Sheet1

    TimeMorningAfternoonTimeMonMorningJhon Wang(2002)

    MonTimeMonAfternoonIndy Lai(2005)

    TueTimeTueMorningJimmy Lin(2007)

    WenTimeTueAfternoonWendy Lee(2001)

    ThrTimeWenMorningIndy Lai(2005)

    FriTimeWenAfternoonJhon Wang(2002)

    SatTimeThrMorningJimmy Lin(2007)

    Sheet2

    Sheet3

    Sheet1

    DateDurationDoctorID

    TimeMorningAfternoonMonMorningJhon Wang(2002)

    MonMonAfternoonIndy Lai(2005)

    TueTueMorningJimmy Lin(2007)

    WenTueAfternoonWendy Lee(2001)

    ThrWenMorningIndy Lai(2005)

    FriWenAfternoonJhon Wang(2002)

    SatThrMorningJimmy Lin(2007)

    Sheet2

    Sheet3

  • MethodTaggingLayoutRecognitionLayoutTransformationAmbiguousRelationsof Cells

  • MethodTaggingLayout IdentifyingLayout Trans.

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    Mon

    Tue

    Wen

    Thr

    Fri

    Sat

    DayDurationName&ID

    MonMorning

    MonAfternoon

    TueMorning

    TueAfternoon

    WenMorning

    Sheet3

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    Mon

    Tue

    Wen

    Thr

    Fri

    Sat

    DayDurationName&ID

    MonMorning

    MonAfternoon

    TueMorning

    TueAfternoon

    WenMorning

    Sheet3

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    CflightCDep-InfoCArr-InfoCAircraft

    CDepCityCDepDateTimeCArrCityCArrDateTime

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    CflightCDepCityCDepDateTimeCArrCityCArrDateTimeCAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    MonJhon Wang (2002)Indy Lai (2005)

    TueJimmy Lin (2007)Wendy Lee (2001)

    WenIndy Lai (2005)Jhon Wang (2002)

    ThrJimmy Lin (2007)Jhon Wang (2002)

    FriWendy Lee (2001)Indy Lai (2005)

    SatJhon Wang (2002)

    CTimeI2rDuration

    I6cdayI26Name&ID

    CdayCdurationCName&ID

    IDay(1)IDuration(1)IName&ID(1, 1)

    IDay(1)IDuration(2)IName&ID(2, 1)

    IDay(2)IDuration(1)IName&ID(1, 2)

    IDay(2)IDuration(2)IName&ID(2, 2)

    IDay(3)IDuration(1)IName&ID(1, 3)

    DayDurationName&ID

    MonMorningJhon Wang (2002)

    MonAfternoonIndy Lai (2005)

    TueMorningJimmy Lin (2007)

    TueAfternoonWendy Lee (2001)

    WenMorningIndy Lai (2005)

    Sheet3

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    Mon

    Tue

    Wen

    Thr

    Fri

    Sat

    DayDurationName&ID

    MonMorning

    MonAfternoon

    TueMorning

    TueAfternoon

    WenMorning

    Sheet3

  • Airline Schedule Ontology

  • TaggingC: Departure CityI: Departure City

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    Mon

    Tue

    Wen

    Thr

    Fri

    Sat

    DayDurationName&ID

    MonMorning

    MonAfternoon

    TueMorning

    TueAfternoon

    WenMorning

    Sheet3

  • Four Relations of Table CellsRelations of Concept - InstancesConcept - Instance of the ConceptConcept - Descent ConceptConcept - Instance of Descent ConceptInstance - Instance of the same Concept

  • Layout RecognitionC-ITableLayoutDescriptionsTemplateMatchingMatchedLayoutDescriptionDefined by Layout Syntax Grammar

  • Layout TransformationOriginLayoutDescriptionDestinationLayoutDescription

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    CflightCDep-InfoCArr-InfoCAircraft

    CDepCityCDepDateTimeCArrCityCArrDateTime

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    CflightCDepCityCDepDateTimeCArrCityCArrDateTimeCAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    MonJhon Wang (2002)Indy Lai (2005)

    TueJimmy Lin (2007)Wendy Lee (2001)

    WenIndy Lai (2005)Jhon Wang (2002)

    ThrJimmy Lin (2007)Jhon Wang (2002)

    FriWendy Lee (2001)Indy Lai (2005)

    SatJhon Wang (2002)

    CTimeI2rDuration

    I6cdayI26Name&ID

    CdayCdurationCName&ID

    IDay(0)IDuration(0)IName&ID(0, 0)

    IDay(0)IDuration(1)IName&ID(1, 0)

    IDay(1)IDuration(0)IName&ID(0, 1)

    IDay(1)IDuration(1)IName&ID(1, 1)

    IDay(2)IDuration(0)IName&ID(0, 2)

    DayDurationName&ID

    MonMorningJhon Wang (2002)

    MonAfternoonIndy Lai (2005)

    TueMorningJimmy Lin (2007)

    TueAfternoonWendy Lee (2001)

    WenMorningIndy Lai (2005)

    Sheet3

    CXImrZCXIx(1)IY(1)IZ(1, 1)CXCYCZ

    IncYImnWIx(1)IY(1)IZ(1, 1)

    IY(m)IZ(m, 1)

    IY(m)IZ(m, 1)

    IX(n)IY(1)IZ(1, n)

    IX(n)IY(1)IZ(1, n)

    IY(m)IZ(m, n)

    IY(m)IZ(m, n)

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    CflightCDep-InfoCArr-InfoCAircraft

    CDepCityCDepDateTimeCArrCityCArrDateTime

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    CflightCDepCityCDepDateTimeCArrCityCArrDateTimeCAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    IFlightIDepCityIDepDateTimeIArrCityIArrDateTimeIAircraft

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    MonJhon Wang (2002)Indy Lai (2005)

    TueJimmy Lin (2007)Wendy Lee (2001)

    WenIndy Lai (2005)Jhon Wang (2002)

    ThrJimmy Lin (2007)Jhon Wang (2002)

    FriWendy Lee (2001)Indy Lai (2005)

    SatJhon Wang (2002)

    CTimeI2rDuration

    I6cdayI26Name&ID

    CdayCdurationCName&ID

    IDay(0)IDuration(0)IName&ID(0, 0)

    IDay(0)IDuration(1)IName&ID(1, 0)

    IDay(1)IDuration(0)IName&ID(0, 1)

    IDay(1)IDuration(1)IName&ID(1, 1)

    IDay(2)IDuration(0)IName&ID(0, 2)

    DayDurationName&ID

    MonMorningJhon Wang (2002)

    MonAfternoonIndy Lai (2005)

    TueMorningJimmy Lin (2007)

    TueAfternoonWendy Lee (2001)

    WenMorningIndy Lai (2005)

    Sheet3

    IncXImnZIY(1)IZ(1)IW(1, 1)

    IY(m)IZ(m, 1)

    IZ(m)IW(m, 1)

    IX(n)IY(1)IZ(1, n)

    IY(n)IZ(1)IW(1, n)

    IY(m)IZ(m, n)

    IZ(m)IW(m, n)

  • Experiments23 tables from 23 web pages13 2-dimension tables, 10 complex tablesSuccess is no miss, Any miss results fail

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    Mon

    Tue

    Wen

    Thr

    Fri

    Sat

    DayDurationName&ID

    MonMorning

    MonAfternoon

    TueMorning

    TueAfternoon

    WenMorning

    Sheet3

    Without preprocessingWith preprocessing

    CasesSuccessFailRate (%)CasesSuccessFailRate (%)

    2D11284.622D11284.62

    Complex0100.00Complex5550.00

    Total111247.83Total16769.57

    Sheet1

    FlightDepartingArrivingAircraft

    CityDate & TimeCityDate & Time

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    FlightDep. CityDep. Date & TimeArr. CityArr. Date & TimeAircraft

    EVA Airways 10TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6647YVR07/14 09:00pmLAX07/14 11:45pm737

    Air Canada 9800TPE07/14 11:50pmYVR07/14 07:40pm744

    American Airlines 6501YVR07/15 06:35amLAX07/15 09:27am737

    China Airlines 61TPE07/14 08:10pmFRA07/15 06:50amM11

    American Airlines 83FRA07/15 10:40amORD07/15 01:05pm763

    American Airlines 473ORD07/15 02:30pmLAX07/15 04:35pm738

    Sheet2

    TimeMorningAfternoon

    Mon

    Tue

    Wen

    Thr

    Fri

    Sat

    DayDurationName&ID

    MonMorning

    MonAfternoon

    TueMorning

    TueAfternoon

    WenMorning

    Sheet3

    Without preprocessingWith preprocessing

    CasesSuccessFailRate (%)CasesSuccessFailRate (%)

    2D11284.622D11284.62

    Complex0100.00Complex5550.00

    Total111247.83Total16769.57

  • Conclusion & Future WorksLayout Transformation from complex tables to simple tables (1D, 2D). A general approach1. Tagging2. Semantic Layout Recognition3. Layout Transformation Ambiguous reduced by checking cell relations

  • ReferenceHuei-Long Wang, Shih-Hung Wu, I. C. Wang, Cheng-Lung Sung, W. L. Hsu, W. K. Shih, Semantic Search on Internet Tabular Information Extraction for Answering Queries, Ninth International Conference on Information and Knowledge Management (CIKM-2000), McLean, VA, November 6-11, 2000. pp. 243-249. (EI)H.-H. Chen, S.-C. Tsai, and J.-H. Tsai., Mining Tables from Large Scale HTML Texts, In Proc. 18th International Conference on Computational Linguistics, Saabrucken, Germany, July 2000.

  • A Knowledge-based Approach to Citation ExtractionIRI-2005

  • IntroductionIntegration of the bibliographical information of scholarly publications available on the InternetAccurate reference metadata extraction from heterogeneous reference sources.

    We propose a knowledge-based approach to reference metadata extraction INFOMAP: ontological knowledge representation framework Automatically extract the reference metadata.

  • Proposed Approach

  • Reference Data CollectionJournal Spider (journal agent)collect journal data from the Journal Citation Reports (JCR) indexed by the ISI and digital libraries on the Web. Citation data sourceISI web of scienceDBLPCiteseerPubMedPhase 1

  • Domain KnowledgePhase 2

  • INFOMAPINFOMAP as ontological knowledge representation framework extracts important citation concepts from a natural language text.Feature of INFOMAPrepresent and match complicated template structureshierarchical matching regular expressions semantic template matching frame (non-linear relations) matching

    Using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different kinds of reference formats or styles.

  • Reference Metadata ExtractionTable 1. Examples of different journal reference stylesPhase 3

  • Knowledge-based Reference Metadata Extraction - Online Service Phase 4

  • Citation Extraction From Text to BixTexW. L. Hsu, "The coloring and maximum independent set problems on planar perfect graphs," J. Assoc. Comput. Machin., (1988), 535-563.W. L. Hsu, "On the general feasibility test of scheduling lot sizes for several products on one machine," Management Science 29, (1983), 93-105. W. L. Hsu, "The distance-domination numbers of trees," Operations Research Letters 1, (3), (1982), 96-100.@article{ Author = {W. L. Hsu}, Title = {The coloring and maximum independent set problems on planar perfect graphs,"}, Journal = {J. Assoc. Comput. Machin.}, Volume = {}, Number = {}, Pages = {535-563}, Year = {1988 }}@article{ Author = {W. L. Hsu}, Title = {On the general feasibility test of scheduling lot sizes for several products on one machine,"}, Journal = {Management Science}, Volume = {29}, Number = {}, Pages = {93-105}, Year = {1983 }}@article{ Author = {W. L. Hsu}, Title = {The distance-domination numbers of trees,"}, Journal = {Operations Research Letters}, Volume = {1}, Number = {3}, Pages = {96-100}, Year = {1982 }}Figure 5. The system output of BibTex FormatFigure 3. The system input of knowledge-based RME

  • Figure 6. The online service of knowledge-based RME (http://bioinformatics.iis.sinica.edu.tw/CitationAgent/)SystemOutputSystemInput(Plain text)OutputBibTex

  • Experimental Results and DiscussionExperimental dataWe used EndNote to collect Bioinformatics citation data for 2004 from PubMed. A total of 907 bibliography records were collected from PubMed digital libraries on the Web. Reference testing data was generated for each of the six reference styles (BIOI, ACM, IEEE, APA, MISQ, and JCB). Randomly selected 500 records for testing from each of the six reference styles.

  • Experimental results of citation extraction from six reference styles

  • Example Results

  • The various structures of different styles(Analysis of structures of 30 reference styles )

  • Comparison with related worksKnowledge-based approachOur proposed knowledge-based method for scholarly publications can extract reference information from 907 records in various reference styles with a high degree of precision the overall average field accuracy is 97.87% for six major styles listed in Table 198.20% for the MISQ style87% for other 30 randomly selected styles

  • ConclusionsCitation extraction is a challenging problemThe diverse nature of reference stylesWe have proposed a knowledge-based citation extraction method for scholarly publications. The experimental results indicate that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different reference styles with a high degree of precision. The overall average field accuracy of citation extraction is 97.87% for six major reference styles.

  • Future ResearchIntegrate the ontological and the machine learning approaches to boost the performance of citation information extraction Maximum-Entropy Method (MEM)Hidden Markov Model (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM)

  • ReferenceMin-Yuh Day, Tzong-Han Tsai, Cheng-Lung Sung, Cheng-Wei Lee, Shih-Hung Wu, Chorng-Shyong Ong, and Wen-Lian Hsu, A Knowledge-based Approach to Citation Extraction, to appear in Proceedings of IEEE International Conference on Information Reuse and Integration (IEEE IRI-2005), pp.50-55. (EI)

  • Chinese Named Entity Recognition Using a Hybrid Approach of Machine Learning and Domain KnowledgeROCLING 2003, CLCLP 2004

  • Named Entity RecognitionNamed Entity

  • Sequential LabelingToken-basedCharactor-based

  • Machine Learningnamed entitycorpuscorpustarget named entity, corpus.NERNER

  • Hybrid NER methodDomain knowledge, , , Machine LearningSVM, Bigram/Trigram ModelHybridMaximum-Entropy FrameworkDomain knowledge serves as features

  • Statical knowledge is insufficientNew namesSARSAmbiguityContext dependence

  • Pure machine learning might sufferLack context informationWindow SizetokentagNENER, NER

  • Basic Concepts of Our ME-based Hybrid ApproachNEContext InformationInternal/External FeaturesTraining DataFeature, confidence

  • Internal/External FeaturesInternal Found within the name string itselfe.g., ExternalContexte.g.,

  • Tag Set (outcome)CharacterToken, Named Entity, , Tag Set/B-P /I-P /I-P/B-L /I-L /I-L/B-O /I-O /I-O

  • ME-based NER Framework-Feature RepresentationFor example:token,

    Feature f is active!!

  • ME-based NER Framework-TrainingGiven a set of features and a training corpusThe ME estimation process produces a model in which every feature fi has a weight i.Then we are allowed to compute:

  • ME-based NER Framework-DecodingTokenize the text and preprocess the testing sentenceFor each token, check which features are active and combine the i of active features according to Equation 1A Viterbi search is run to find the highest probability path

  • Hybrid NER Example

    The NER problem has been formulated as maximize p(o|h) and find its corresponding outcome o

    W0: the current tokenOsLsPsContext(History)Feature 1:

  • Advantages of Hybrid NER, ., Performance

  • Experiment-Data SetUnited Daily News (December, 2002)

  • Experiment ResultUse domain knowledge onlyME-based Hybrid

  • Performance Comparison

    Corpus: MET2 DatasetNumber of Entities: 3646

  • Conclusion and Future WorkConclusionHybrid ApproachHybrid ApproachPrecisionImprovement, Hybrid Improvement, Future WorkNamed EntityFeaturesNamed EntityMulti Iteration NERHierachical Named Entity

  • References[Tsai 2003] Tzong-Han Tsai, Shih-Hung Wu and Wen-Lian Hsu. (2003), Mencius: A Chinese Named Entity Recognizer Using Hybrid Model, in Proceedings of the Fifteenth Research on Computational Linguistics International Conference (ROCLING XV), pp.193-209, 2003. [Tsai 2004] Tzong-Han Tsai, Shih-Hung Wu, and Wen-Lian Hsu, "Mencius: A Chinese Named Entity Recognizer Based on a Maximum Entropy Framework," Computational Linguistics and Chinese Language Processing, Vol.9, No.1, pp.65-82, 2004. [Shih 2004] Cheng-Wei Shih, Tzong-Han Tsai, Shih-Hung Wu, Chiu-Chen Hsieh, and Wen-Lian Hsu, (2004) The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0, in Proceedings of the Fifteenth Conference on Computational Linguistics and Speech Processing (ROCLING XVI), pp. 305-313.

    Flowchart herePreprocessing for Complex TablesIntegration of the bibliographical information of scholarly publications available on the Internet is an important task in academic research. To accomplish this task, accurate reference metadata extraction for scholarly publications is essential for the integration of information from heterogeneous reference sources. We propose a knowledge-based approach to literature mining and focus on reference metadata extraction methods for scholarly publications. We adopt an ontological knowledge representation framework called INFOMAP to automatically extract the reference metadataIn the data collection stage, we use Journal Spider (a journal agent) to collect journal data from the Journal Citation Reports (JCR) indexed by the ISI and digital libraries on the Web. The major part of the citation data is taken from the ISI web of science, DBLP, Citeseer, and PubMed. We then cache the data (around 160,000 records) in the reference database as the knowledge representation data source

    Citation extraction is a challenging problem due to the diverse nature of reference styles.We have proposed a knowledge-based citation extraction method for scholarly publications. The experimental results indicate that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different reference styles with a high degree of precision. The overall average field accuracy of citation extraction is 97.87% for six major reference styles.

    Internal Feature: External Feature: