logical structure recovery in scholarly articles with rich document features minh-thang luong, thuy...
TRANSCRIPT
Logical Structure Recovery in Scholarly Articles with Rich Document Features
Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan
• Logical structure annotation in ForeciteReader.• The view shows object navigation interface, currently focusing on the list of figure captions.
04/21/23 2
• Section navigation in ForeCiteReader environment with generic sections
04/21/23 3
Overview
• Methodology– Problem Formulation– Learning Model - CRF– Approach overview– Classification categories
• Raw-text features• Rich document representation• Experiments• Further analysis
04/21/23 4
Problem FormulationTwo related subtasks:• Logical structure (LS) classification
– scholarly document as an ordered collection of text lines– label each text line with a semantic category e.g. title,
author, address, etc.
• Generic section (GS) classification– take the headers of each section of text in a paper– deduce a generic logical purpose of the section.
Sequence labeling tasks - CRF
04/21/23 5
Learning Model - CRF
CRF in simplified formf: both state & transition functions
04/21/23 6
Binary feature
State function
Transition function
• Utilize CRF++ package http://crfpp.sourceforge.net/ • Input for line li to CRF++ is of the form “value1 … valuem categoryi"
Approach overview
04/21/23 7
Classification categories - example
04/21/23 8
Classification categories – full sets
• Logical structure subtask, 23 categories: address, affiliation, author, bodyText, categories, construct, copyright, email, equation, figure, figureCaption, footnote, keywords, listItem, note, page, reference, sectionHeader, subsectionHeader, subsubsectionHeader, table, tableCaption, and title.
• Generic section subtask, 13 categories: abstract, categories, general terms, keywords, introduction, background, relatedWork, methodology, evaluation, discussions, conclusions, acknowledgments, and references.
04/21/23 9
Raw-text features - LS
• Parscit token-level features +• Our line-level features:
– Location: relative position within document– Number: patterns of subsections, subsubsections,
categories, footnotes– Punctuation: patterns of emails & web linksbracket numbering equation– Length: 1token, 2token, 3token, 4token, 5+token
identify majority of lines as bodtyText04/21/23 10
Raw-text features - GS
• Naïve, yet effective features:– Positions– First and Second Words– Whole Header
04/21/23 11
Rich document representation – OCR output
• Linearlize XML output into CRF features: “Don't-Look-Now,-But-We've-Created-a-Bureaucracy. Loc_0 Align_left FontSize_largest Bold_yes Italic_no Picture_no Table_no Bullet_no".
04/21/23 12
Rich document representation – OCR features
• Position– Alignment: left, center, right & justified– Location: within-page location
• Format– FontSize: quantize base on frequency, e.g smaller, smaller,
base, -2, -1, 0– Bold – Italic
• Object– Bullet – Picture – Table
04/21/2313
Experiments - datasets
• LS: 20 ACM, 10 CHI 2008, 10 ACL 2009 – fully labeled • GS: 211 ACM papers – headers labeled
04/21/23 14
Skewed data
Experiments – metrics
TP: # correctly classified text lines (true positive)Similarly, FN, FP, and TN for true negatives.
• Category-specific performance: – F1measure = 2 x P x R / (P+R);
Precision = TP/(TP+FP), Recall = TP/(TP + FN)
• Overall performance: – Macro average: average of all category-specific F1
– Micro average: percentage of correctly labeled lines
04/21/23 15
Experiments – LS results
LSPC - baseline using only ParsCit features
LSPC+RT: LSPC + raw text features
LSPC+RT+RD: LSPC+RT + rich document features (OCR)
• LSPC+RT+RD , LSPC+RT > LSPC more than 10 F1 points
• LSPC+RT+RD < LSPC+RT: minor degradation for four categories
• LSPC+RT+RD > LSPC+RT: all other categories (many > 4 F1 scores)
Large improvements for footnote, sssHeaders
04/21/23 16
Experiments – GS results• GSmaxent: maximum entropy
based system (Nguyen and Kan, 2007)
• GSCRF: our system
• GSCRF > GSmaxent : in all categories except background
Large improvements for discussions
04/21/23 17
Further analysis – Text features
• All contribute to the final composite performance• Most influential: position
04/21/23 18
Further analysis – rich doc features
• Format contributes most to macro avg• While object influences micro average most • Format features help a wider spectrum of categories: paper metadata & section headers• Object features enhance fewer categories, but containing a large number of training data
e.g. list item, table
04/21/23 19
Further analysis – rich doc features
• Most features improve both metricsexcept align & table: trade off macro vs. micro
• Location, Font, and Bullet as the most effective features in each of the groups position, format, and object
04/21/23 20
Error analysis - LS
04/21/23 21
Error analysis - GS
• whole header: non-overlapping tokens with any of the memoized training data instances
Needs to use body text instead (Future work)• Similar relative positions of consecutive headers: background vs.
method, method vs. discussions, & discussions vs. Conclusions• The dataset skew also impacts: large number of method, while
much less for background and discussions categories many headers are mislabelled as method
04/21/23 22
04/21/23 23
Q & A
Thank you!
04/21/23 24