[submission] final_presentation

27
Higgs-Reader Team C. Arif Jafer, Camilo Celis, Marcus Low,

Upload: marcus-low-junxiang

Post on 18-Jul-2015

85 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [submission] Final_Presentation

Higgs-Reader

Team C. Arif Jafer, Camilo Celis, Marcus Low,

Page 2: [submission] Final_Presentation

Contents❏ Overview❏ Problems & Requirements❏ Goals Met❏ Approach❏ Architecture❏ WebAnn - Training Set Creator Tool❏ The Korean Language Model❏ Final Reader❏ Demo

Page 3: [submission] Final_Presentation

Overview

Page 4: [submission] Final_Presentation

Overview

● A reader engine is composed of:○ A web-page text extraction algorithm, to find the

main article text○ Heuristics to find metadata, relevant images to the

main article○ User Interface to embody the reader engine

Page 5: [submission] Final_Presentation

Overview (boilerpipe)

Page 6: [submission] Final_Presentation

Overview

● Higgs-Reader (built upon DOM-Distiller)○ Boilerpipe extended with a Korean Language Model

■ Tools to train the model - Weka / C4.5 Decision Trees■ Tools to generate the training set - WebAnn■ Integration of the model back into the reader engine

○ Existing Heuristics in DOM-Distiller will be tuned to improve the performance for Korean Web pages

○ Final Reader Chrome Extension

Page 7: [submission] Final_Presentation

Goals Met● Extended the DOM-Distiller reader engine, with

enhanced support for Korean web pages. ● Created a new Korean Language model for text-

extraction● Tuned the existing heuristics to improve the

performance on Korean web sites● Created a Reader UI to embody the reader engine

Page 8: [submission] Final_Presentation

Problems Encountered● The existing reader engines, such as DOM-Distiller, had a poor support for

non-English web pages.● Korean websites did not commonly follow the website markup standards,

such as OpenGraph protocol, schema.org, etc.● Current HTML standards used by majority of websites tend to still use the

<div> or <table> tags to separate content. This eliminates the possibility of identifying the semantics of any particular section of HTML source.

● Poor performance on multi-page websites. It should be able to retrieve all or at most K number of the pages at once.

● Poor performance on detection of relevant images or other rich-content media.

Page 9: [submission] Final_Presentation

Requirements Met● A Korean language model was made and integrated into the Boilerpipe

algorithm.○ Tooling for creating the training set (WebAnn)

● The existing DOM-Distiller was tuned to work with Korean websites. ● Better support for web pages, with their layouts made with tables.● Better support for multi-page web pages.● Enhanced the relevant image detection heuristic● Chrome Extension Implementation (Final Reader)● Comparison mechanism for testing purposes

Page 10: [submission] Final_Presentation

Approach

● 4 Stages○ Web Page Annotator○ Korean Language Model for boilerpipe○ Reader Engine tuning○ Reader UI

Page 11: [submission] Final_Presentation

Approach / Architecture (Overall)

Page 12: [submission] Final_Presentation

Approach / Architecture (WebAnn)● Web Page Annotator (WebAnn)

○ Built as a Chrome extension○ Provides a simple UI to annotate

different sections of a web page with predefined labels.

■ HEADING■ FULL_CONTENT■ SUPPLEMENTARY■ COMMENTS■ RELEVANT_IMAGES

Page 13: [submission] Final_Presentation

WebAnn -- Training Set Creator Tool

Ordinary Webpage

Page 14: [submission] Final_Presentation

WebAnn -- Training Set Creator Tool

Annotator in Action

Page 15: [submission] Final_Presentation

Approach / Architecture (Machine Learning)

Page 16: [submission] Final_Presentation

Approach / Architecture (Language Model)● Korean Language Model

○ A corresponding model for each of the Models listed in Table 3.2○ Will be trained using Shallow Text features listed in Table 3.2

DensityRulesClassifier

HeuristicsFilterBase

IgnoreBlocksAfterContentFilter

IgnoreBlocksAfterContentFromEndFilter

KeepLargestFulltextBlockFilter

MinFullTextWordsFilter

NumWordsRulesClassifier

TerminatingBlocksFinder

prev_link_density

prev_text_density

prev_num_words

prev_num_words_in_anchor_text

curr_link_density

curr_text_density

curr_num_words

curr_num_words_in_anchor_text

next_link_density

next_text_density

next_num_words

next_num_words_in_anchor_text

Page 17: [submission] Final_Presentation

Approach / Architecture (Language Model)

● Korean Language Models○ Trained using C4.5 Decision Trees algorithm

■ Existing English language models also trained with this algorithm■ better performance on multi-category classification problems■ Good performance in supervised learning

○ Use the Weka ML toolset■ Provides a wide number of implementations for ML algorithms■ easy to compare and evaluate different models by tuning the

parameters■ Provides cross-validation features, such as k-fold cross validation

Page 18: [submission] Final_Presentation

Korean Language Heuristics

● Lack of <p> tags● Terminating Blocks

Page 19: [submission] Final_Presentation

Korean Language Model

Decision Tree based on Number of Words Decision Tree based on Density of Words

Page 20: [submission] Final_Presentation

Number of Words

Korean Language Model

English Model

boilerplate content

21032 621

225 647

Confusion Matrix

boilerplate content

21637 16

142 730

Correctly Classified Instances

22367 99.2986 %

Incorrectly Classified Instances

158 0.7014 %

Density of Words

English Model

boilerplate content

21105 548

220 652

Confusion Matrix

boilerplate content

21637 16

142 730

Correctly Classified Instances

22367 99.2986 %

Incorrectly Classified Instances

158 0.7014 %

Page 21: [submission] Final_Presentation

Approach / Architecture (Language Model)

Page 22: [submission] Final_Presentation

Approach / Architecture (Reader Engine)

● Reader Engine○ Based on the DOM-Distiller project○ New Language model will be integrated into Boilerpipe○ Existing Heuristics will be tuned to improve performance on Korean

web pages○ Built upon Google Web Toolkit (GWT)

■ Can use Java libraries■ Can use Java OOP features■ Compiler will produce cross-browser JS code■ Reader engine can be ported into any browser

Page 23: [submission] Final_Presentation

Approach / Architecture (Reader Engine)

Page 24: [submission] Final_Presentation

Final Reader UI (Implementation)

Final Reader Old Reader

Page 25: [submission] Final_Presentation

OLD READER (Live Demo)

● Small Chrome Extension using old dom-distiller code and old language model

Page 26: [submission] Final_Presentation

FINAL READER (Live Demo)

● Faster build cycles● Can be used to easily compare with Old

Reader extension

Page 27: [submission] Final_Presentation

Thank you