logical structure recovery in scholarly articles with rich document features minh-thang luong, thuy...

Logical Structure Recovery in Scholarly Articles with Rich Document Features

Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan

• Logical structure annotation in ForeciteReader.• The view shows object navigation interface, currently focusing on the list of figure captions.

04/21/23 2

• Section navigation in ForeCiteReader environment with generic sections

04/21/23 3

Overview

• Methodology– Problem Formulation– Learning Model - CRF– Approach overview– Classification categories

• Raw-text features• Rich document representation• Experiments• Further analysis

04/21/23 4

Problem FormulationTwo related subtasks:• Logical structure (LS) classification

– scholarly document as an ordered collection of text lines– label each text line with a semantic category e.g. title,

author, address, etc.

• Generic section (GS) classification– take the headers of each section of text in a paper– deduce a generic logical purpose of the section.

Sequence labeling tasks - CRF

04/21/23 5

Learning Model - CRF

CRF in simplified formf: both state & transition functions

04/21/23 6

Binary feature

State function

Transition function

• Utilize CRF++ package http://crfpp.sourceforge.net/ • Input for line li to CRF++ is of the form “value1 … valuem categoryi"

Approach overview

04/21/23 7

Classification categories - example

04/21/23 8

Classification categories – full sets

• Logical structure subtask, 23 categories: address, affiliation, author, bodyText, categories, construct, copyright, email, equation, figure, figureCaption, footnote, keywords, listItem, note, page, reference, sectionHeader, subsectionHeader, subsubsectionHeader, table, tableCaption, and title.

• Generic section subtask, 13 categories: abstract, categories, general terms, keywords, introduction, background, relatedWork, methodology, evaluation, discussions, conclusions, acknowledgments, and references.

04/21/23 9

Raw-text features - LS

• Parscit token-level features +• Our line-level features:

– Location: relative position within document– Number: patterns of subsections, subsubsections,

categories, footnotes– Punctuation: patterns of emails & web linksbracket numbering equation– Length: 1token, 2token, 3token, 4token, 5+token

identify majority of lines as bodtyText04/21/23 10

Raw-text features - GS

• Naïve, yet effective features:– Positions– First and Second Words– Whole Header

04/21/23 11

Rich document representation – OCR output

• Linearlize XML output into CRF features: “Don't-Look-Now,-But-We've-Created-a-Bureaucracy. Loc_0 Align_left FontSize_largest Bold_yes Italic_no Picture_no Table_no Bullet_no".

04/21/23 12

Rich document representation – OCR features

• Position– Alignment: left, center, right & justified– Location: within-page location

• Format– FontSize: quantize base on frequency, e.g smaller, smaller,

base, -2, -1, 0– Bold – Italic

• Object– Bullet – Picture – Table

04/21/2313

Experiments - datasets

• LS: 20 ACM, 10 CHI 2008, 10 ACL 2009 – fully labeled • GS: 211 ACM papers – headers labeled

04/21/23 14

Skewed data

Experiments – metrics

TP: # correctly classified text lines (true positive)Similarly, FN, FP, and TN for true negatives.

• Category-specific performance: – F1measure = 2 x P x R / (P+R);

Precision = TP/(TP+FP), Recall = TP/(TP + FN)

• Overall performance: – Macro average: average of all category-specific F1

– Micro average: percentage of correctly labeled lines

04/21/23 15

Experiments – LS results

LSPC - baseline using only ParsCit features

LSPC+RT: LSPC + raw text features

LSPC+RT+RD: LSPC+RT + rich document features (OCR)

• LSPC+RT+RD , LSPC+RT > LSPC more than 10 F1 points

• LSPC+RT+RD < LSPC+RT: minor degradation for four categories

• LSPC+RT+RD > LSPC+RT: all other categories (many > 4 F1 scores)

Large improvements for footnote, sssHeaders

04/21/23 16

Experiments – GS results• GSmaxent: maximum entropy

based system (Nguyen and Kan, 2007)

• GSCRF: our system

• GSCRF > GSmaxent : in all categories except background

Large improvements for discussions

04/21/23 17

Further analysis – Text features

• All contribute to the final composite performance• Most influential: position

04/21/23 18

Further analysis – rich doc features

• Format contributes most to macro avg• While object influences micro average most • Format features help a wider spectrum of categories: paper metadata & section headers• Object features enhance fewer categories, but containing a large number of training data

e.g. list item, table

04/21/23 19

Further analysis – rich doc features

• Most features improve both metricsexcept align & table: trade off macro vs. micro

• Location, Font, and Bullet as the most effective features in each of the groups position, format, and object

04/21/23 20

Error analysis - LS

04/21/23 21

Error analysis - GS

• whole header: non-overlapping tokens with any of the memoized training data instances

Needs to use body text instead (Future work)• Similar relative positions of consecutive headers: background vs.

method, method vs. discussions, & discussions vs. Conclusions• The dataset skew also impacts: large number of method, while

much less for background and discussions categories many headers are mislabelled as method

04/21/23 22

04/21/23 23

Q & A

Thank you!

04/21/23 24

logical structure recovery in scholarly articles with rich document features minh-thang luong, thuy...

Documents

section of text

crf features

rawtext features gsnave

linelevel features

effective features

generic section subtask

section navigation

logical structure recovery