data extraction from html tables cui tao department of computer science brigham young university

12
Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

Post on 20-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

Data Extraction From HTML Tables

Cui Tao

Department of Computer Science

Brigham Young University

Page 2: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

Information In Tables

Nowadays, significant portion of the information on the Wed is stored in tables.

Page 3: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

The Ontology-Based Extraction

Page 4: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

The Ontology-Based Extraction

Page 5: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

Major Problems

In the tables, the values and their corresponding attributes are separately. But the ontology can only extract the data when they are together.

Sometimes the attributes in the table are the values in the database, the values in the table are only the identifier of the attributes.

Sometimes, the values in one cell of the table may informs several attribute values in the database.

Page 6: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

Attribute-Value Pair

Attribute: (part of the) constant/key word rule

Page 7: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

How To Solve This Problem?

Put the attribute-value pair together.Try both order.

Page 8: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

More General…

Page 9: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

The attributes in the table are actually values in the database…

Attribute Value

Page 10: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

How To Solve This Problem?

Put attribute in the file depends on the Boolean value

Page 11: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

Value Multiple Information

Page 12: Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University

More Problems …