[submission] final_presentation
TRANSCRIPT
Higgs-Reader
Team C. Arif Jafer, Camilo Celis, Marcus Low,
Contents❏ Overview❏ Problems & Requirements❏ Goals Met❏ Approach❏ Architecture❏ WebAnn - Training Set Creator Tool❏ The Korean Language Model❏ Final Reader❏ Demo
Overview
Overview
● A reader engine is composed of:○ A web-page text extraction algorithm, to find the
main article text○ Heuristics to find metadata, relevant images to the
main article○ User Interface to embody the reader engine
Overview (boilerpipe)
Overview
● Higgs-Reader (built upon DOM-Distiller)○ Boilerpipe extended with a Korean Language Model
■ Tools to train the model - Weka / C4.5 Decision Trees■ Tools to generate the training set - WebAnn■ Integration of the model back into the reader engine
○ Existing Heuristics in DOM-Distiller will be tuned to improve the performance for Korean Web pages
○ Final Reader Chrome Extension
Goals Met● Extended the DOM-Distiller reader engine, with
enhanced support for Korean web pages. ● Created a new Korean Language model for text-
extraction● Tuned the existing heuristics to improve the
performance on Korean web sites● Created a Reader UI to embody the reader engine
Problems Encountered● The existing reader engines, such as DOM-Distiller, had a poor support for
non-English web pages.● Korean websites did not commonly follow the website markup standards,
such as OpenGraph protocol, schema.org, etc.● Current HTML standards used by majority of websites tend to still use the
<div> or <table> tags to separate content. This eliminates the possibility of identifying the semantics of any particular section of HTML source.
● Poor performance on multi-page websites. It should be able to retrieve all or at most K number of the pages at once.
● Poor performance on detection of relevant images or other rich-content media.
Requirements Met● A Korean language model was made and integrated into the Boilerpipe
algorithm.○ Tooling for creating the training set (WebAnn)
● The existing DOM-Distiller was tuned to work with Korean websites. ● Better support for web pages, with their layouts made with tables.● Better support for multi-page web pages.● Enhanced the relevant image detection heuristic● Chrome Extension Implementation (Final Reader)● Comparison mechanism for testing purposes
Approach
● 4 Stages○ Web Page Annotator○ Korean Language Model for boilerpipe○ Reader Engine tuning○ Reader UI
Approach / Architecture (Overall)
Approach / Architecture (WebAnn)● Web Page Annotator (WebAnn)
○ Built as a Chrome extension○ Provides a simple UI to annotate
different sections of a web page with predefined labels.
■ HEADING■ FULL_CONTENT■ SUPPLEMENTARY■ COMMENTS■ RELEVANT_IMAGES
WebAnn -- Training Set Creator Tool
Ordinary Webpage
WebAnn -- Training Set Creator Tool
Annotator in Action
Approach / Architecture (Machine Learning)
Approach / Architecture (Language Model)● Korean Language Model
○ A corresponding model for each of the Models listed in Table 3.2○ Will be trained using Shallow Text features listed in Table 3.2
DensityRulesClassifier
HeuristicsFilterBase
IgnoreBlocksAfterContentFilter
IgnoreBlocksAfterContentFromEndFilter
KeepLargestFulltextBlockFilter
MinFullTextWordsFilter
NumWordsRulesClassifier
TerminatingBlocksFinder
prev_link_density
prev_text_density
prev_num_words
prev_num_words_in_anchor_text
curr_link_density
curr_text_density
curr_num_words
curr_num_words_in_anchor_text
next_link_density
next_text_density
next_num_words
next_num_words_in_anchor_text
Approach / Architecture (Language Model)
● Korean Language Models○ Trained using C4.5 Decision Trees algorithm
■ Existing English language models also trained with this algorithm■ better performance on multi-category classification problems■ Good performance in supervised learning
○ Use the Weka ML toolset■ Provides a wide number of implementations for ML algorithms■ easy to compare and evaluate different models by tuning the
parameters■ Provides cross-validation features, such as k-fold cross validation
Korean Language Heuristics
● Lack of <p> tags● Terminating Blocks
Korean Language Model
Decision Tree based on Number of Words Decision Tree based on Density of Words
Number of Words
Korean Language Model
English Model
boilerplate content
21032 621
225 647
Confusion Matrix
boilerplate content
21637 16
142 730
Correctly Classified Instances
22367 99.2986 %
Incorrectly Classified Instances
158 0.7014 %
Density of Words
English Model
boilerplate content
21105 548
220 652
Confusion Matrix
boilerplate content
21637 16
142 730
Correctly Classified Instances
22367 99.2986 %
Incorrectly Classified Instances
158 0.7014 %
Approach / Architecture (Language Model)
Approach / Architecture (Reader Engine)
● Reader Engine○ Based on the DOM-Distiller project○ New Language model will be integrated into Boilerpipe○ Existing Heuristics will be tuned to improve performance on Korean
web pages○ Built upon Google Web Toolkit (GWT)
■ Can use Java libraries■ Can use Java OOP features■ Compiler will produce cross-browser JS code■ Reader engine can be ported into any browser
Approach / Architecture (Reader Engine)
Final Reader UI (Implementation)
Final Reader Old Reader
OLD READER (Live Demo)
● Small Chrome Extension using old dom-distiller code and old language model
FINAL READER (Live Demo)
● Faster build cycles● Can be used to easily compare with Old
Reader extension
Thank you