[submission] final_presentation

Higgs-Reader

Team C. Arif Jafer, Camilo Celis, Marcus Low,

Contents❏ Overview❏ Problems & Requirements❏ Goals Met❏ Approach❏ Architecture❏ WebAnn - Training Set Creator Tool❏ The Korean Language Model❏ Final Reader❏ Demo

Overview

Overview

● A reader engine is composed of:○ A web-page text extraction algorithm, to find the

main article text○ Heuristics to find metadata, relevant images to the

main article○ User Interface to embody the reader engine

Overview (boilerpipe)

Overview

● Higgs-Reader (built upon DOM-Distiller)○ Boilerpipe extended with a Korean Language Model

■ Tools to train the model - Weka / C4.5 Decision Trees■ Tools to generate the training set - WebAnn■ Integration of the model back into the reader engine

○ Existing Heuristics in DOM-Distiller will be tuned to improve the performance for Korean Web pages

○ Final Reader Chrome Extension

Goals Met● Extended the DOM-Distiller reader engine, with

enhanced support for Korean web pages. ● Created a new Korean Language model for text-

extraction● Tuned the existing heuristics to improve the

performance on Korean web sites● Created a Reader UI to embody the reader engine

Problems Encountered● The existing reader engines, such as DOM-Distiller, had a poor support for

non-English web pages.● Korean websites did not commonly follow the website markup standards,

such as OpenGraph protocol, schema.org, etc.● Current HTML standards used by majority of websites tend to still use the

<div> or <table> tags to separate content. This eliminates the possibility of identifying the semantics of any particular section of HTML source.

● Poor performance on multi-page websites. It should be able to retrieve all or at most K number of the pages at once.

● Poor performance on detection of relevant images or other rich-content media.

Requirements Met● A Korean language model was made and integrated into the Boilerpipe

algorithm.○ Tooling for creating the training set (WebAnn)

● The existing DOM-Distiller was tuned to work with Korean websites. ● Better support for web pages, with their layouts made with tables.● Better support for multi-page web pages.● Enhanced the relevant image detection heuristic● Chrome Extension Implementation (Final Reader)● Comparison mechanism for testing purposes

Approach

● 4 Stages○ Web Page Annotator○ Korean Language Model for boilerpipe○ Reader Engine tuning○ Reader UI

Approach / Architecture (Overall)

Approach / Architecture (WebAnn)● Web Page Annotator (WebAnn)

○ Built as a Chrome extension○ Provides a simple UI to annotate

different sections of a web page with predefined labels.

■ HEADING■ FULL_CONTENT■ SUPPLEMENTARY■ COMMENTS■ RELEVANT_IMAGES

WebAnn -- Training Set Creator Tool

Ordinary Webpage

WebAnn -- Training Set Creator Tool

Annotator in Action

Approach / Architecture (Machine Learning)

Approach / Architecture (Language Model)● Korean Language Model

○ A corresponding model for each of the Models listed in Table 3.2○ Will be trained using Shallow Text features listed in Table 3.2

DensityRulesClassifier

HeuristicsFilterBase

IgnoreBlocksAfterContentFilter

IgnoreBlocksAfterContentFromEndFilter

KeepLargestFulltextBlockFilter

MinFullTextWordsFilter

NumWordsRulesClassifier

TerminatingBlocksFinder

prev_link_density

prev_text_density

prev_num_words

prev_num_words_in_anchor_text

curr_link_density

curr_text_density

curr_num_words

curr_num_words_in_anchor_text

next_link_density

next_text_density

next_num_words

next_num_words_in_anchor_text

Approach / Architecture (Language Model)

● Korean Language Models○ Trained using C4.5 Decision Trees algorithm

■ Existing English language models also trained with this algorithm■ better performance on multi-category classification problems■ Good performance in supervised learning

○ Use the Weka ML toolset■ Provides a wide number of implementations for ML algorithms■ easy to compare and evaluate different models by tuning the

parameters■ Provides cross-validation features, such as k-fold cross validation

Korean Language Heuristics

● Lack of <p> tags● Terminating Blocks

Korean Language Model

Decision Tree based on Number of Words Decision Tree based on Density of Words

Number of Words

Korean Language Model

English Model

boilerplate content

21032 621

225 647

Confusion Matrix

boilerplate content

21637 16

142 730

Correctly Classified Instances

22367 99.2986 %

Incorrectly Classified Instances

158 0.7014 %

Density of Words

English Model

boilerplate content

21105 548

220 652

Confusion Matrix

boilerplate content

21637 16

142 730

Correctly Classified Instances

22367 99.2986 %

Incorrectly Classified Instances

158 0.7014 %

Approach / Architecture (Language Model)

Approach / Architecture (Reader Engine)

● Reader Engine○ Based on the DOM-Distiller project○ New Language model will be integrated into Boilerpipe○ Existing Heuristics will be tuned to improve performance on Korean

web pages○ Built upon Google Web Toolkit (GWT)

■ Can use Java libraries■ Can use Java OOP features■ Compiler will produce cross-browser JS code■ Reader engine can be ported into any browser

Approach / Architecture (Reader Engine)

Final Reader UI (Implementation)

Final Reader Old Reader

OLD READER (Live Demo)

● Small Chrome Extension using old dom-distiller code and old language model

FINAL READER (Live Demo)

● Faster build cycles● Can be used to easily compare with Old

Reader extension

Thank you

[submission] final_presentation

Documents

korean language model

korean websites

korean web sites

new korean language

reader engineproblems

domdistiller reader

existing reader engines

multipage web pages