icdcrome november 2001crossmarc third meeting french nerc (first version and results) crossmarc...
DESCRIPTION
ICDCRome November 2001CROSSMARC Third meeting French Corpus 56 mono product description pages 7 manufacturers : SONY, ASUS, DELL… 17 models : VAIO, INSPIRON, L8400… 6 processors : PENTIUM III, CELERON… 5 OS : WIN MILLENIUM, WIN 98… Wide ranges of WEIGHTS, PRICES...TRANSCRIPT
![Page 1: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/1.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
French NERC(first version and results)
CROSSMARC Project IST-2000-25366
Third meetingRome November 2001
![Page 2: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/2.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
Summary
• Complete experiment on French corpus– French mono-product corpus– Detailed extraction performances– Examples of limits
• French NERC overview– XML DTD for named-entities extractions– Architecture & components description– Development & maintenance
![Page 3: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/3.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
French Corpus
• 56 mono product description pages
• 7 manufacturers : SONY, ASUS, DELL…• 17 models : VAIO, INSPIRON, L8400…• 6 processors : PENTIUM III, CELERON…• 5 OS : WIN MILLENIUM, WIN 98…• Wide ranges of WEIGHTS, PRICES...
![Page 4: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/4.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
Example of extraction
![Page 5: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/5.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
Detailed extraction performances[OK,KO]
• MANUF [56, 0], Small number of cases (7)• MODEL [56, 0], Great number of configurations (VAIO FX 101, 105, 201, 203, 205, 209, 808, PCG, QR10…)
• PROCESSOR [55, 1], Most of the cases are PENTIUM III & CELERON• SOFT_OS [51, 5], Small number of cases (WIN XX)
• PRICE [35, 21], Some limits, ambiguities due to component prices• RESOLUTION [39, 17], Some limits• SPEED [41, 15], Some limits, ambiguities due to component speed• CAPACITY [52, 4], ambiguities due to component capacities
![Page 6: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/6.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
(1a) Limits: Information does not exist
• No weight
![Page 7: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/7.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
(1b) Limits: Information does not exist
• No Soft_OS
![Page 8: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/8.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
(2) Limits: Information inside an image
<big><big><font face="Arial" color="#000080"><strong>13990.00</strong></font></big></big><img src="img/francb.gif">
![Page 9: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/9.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
(3) Limits:One description for several products
![Page 10: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/10.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
(4) Limits:Information outside of the page
![Page 11: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/11.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
(5) Limits:Information contains an error
Soft_OS = windows 200
![Page 12: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/12.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
Perspectives
• Ambiguities will be managed by the Fact Extractor Module
• Limits should be discussed by the Consortium– Information does not exist– Information inside an image– One description for several products– Information outside of the page – Information contains an error
![Page 13: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/13.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
French NERC Overview
laptops.xml
nerc.dtd
xml2nerc nerc-laptops.pl Nerc.pm
product.html
extraction.html
static step dynamic step
refers to
is processed by
generates
XMLPerlHTMLXHTML
![Page 14: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/14.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
nerc.dtd
<?xml version="1.0" encoding="iso-8859-1"?><!-- DTD French NERC --><!-- Informatique CDC 2001 --><!-- Project CROSSMARC --><!ELEMENT nerc (feature)+><!ATTLIST nerc domain CDATA #REQUIRED><!ELEMENT feature (element)+><!ATTLIST feature no CDATA #REQUIREDname CDATA #REQUIREDtype (STRING|INTEGER|DECIMAL|DOUBLE-INTEGER) #REQUIREDif CDATA #REQUIREDweak CDATA #IMPLIED><!ELEMENT element (form)+><!ATTLIST element norm CDATA #REQUIREDweak CDATA #IMPLIED><!ELEMENT form (#PCDATA)>
•DTD File•Domain independant rulebase metadescription
• nerc: main– domain
• feature: of a product (e.g., SPEED)– no– name– type– if– weak
• element: of a feature (e.g., MHz)– norm– weak
• form: string or regex of an element(e.g., "[Mm][Hh][Zz]")
![Page 15: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/15.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
laptops.xml (1)•XML File•Domain dependant matching rulebase description
![Page 16: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/16.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
laptops.xml (2)•Domain independant desambiguation
![Page 17: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/17.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
xml2nerc
• Perl Program• Domain independant XML to Perl translator• Refers to nerc.dtd: elements, attributes,
pcdata• Refers to Nerc.pm: main, matching and
desambiguation algorithms
![Page 18: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/18.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
Nerc.pm
• Perl Module• Domain independant pattern matching• Domain independant desambiguation
![Page 19: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/19.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
nerc-laptops.pl
• Generated domain dependant Perl Program• Applies pattern matching and desambiguation• Generates named-entities that are recognized• Refers to Nerc.pm: matching and
desambiguation algorithms
![Page 20: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/20.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
FNERC Development & Maintenance
nerc.dtdxml2nerc / Nerc.pmlaptops.xml
Level 2New PCDATA regex
Level 0New PCDATA string
Level 5New attribute
Level 1Attributes value
Domain dependent Domain independent
Level 4New attribute enum.
Level 3New attribute value
![Page 21: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001](https://reader035.vdocument.in/reader035/viewer/2022070605/5a4d1ad17f8b9ab059971541/html5/thumbnails/21.jpg)
Rome November 2001 CROSSMARC Third meeting ICDC
Perspectives
• WP1: Experimenting the NERC as a better evaluation function for the topic spider
• WP2: Improving the FNERC• WP3: Implementing desambiguation
techniques for the Fact Extractor Module