knowledge extraction from technical documents knowledge extraction from technical documents *with...
TRANSCRIPT
© Generative Software Technologies Corp. 1
Knowledge Extraction from Technical Documents
*With first class-support for Feature Modeling
Rehan Rauf, Michal Antkiewicz, and Krzysztof Czarnecki
Generative Software Technologies Corp. Waterloo, Canada
http://gensoftech.com
© Generative Software Technologies Corp. 2
The Idea
© Generative Software Technologies Corp. 3
Specification Documents
Spec DocHeadingtext text text text text text text- text text text text text text - text text text text text text text text text text text text text text text text text text
text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text
Text Text Text Text Text Text
text text Text Text text text
text text text text text text
Section
Table
Paragraph
Physical structures
Functional Reqs
Business Rules
Use Case
Logical structures(specification elements)
© Generative Software Technologies Corp. 4
Recognize and extract specification elements
based on physical document
structure
© Generative Software Technologies Corp. 5
ET – Extraction Toolsearches for template instances
Spec Doctext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
Text Text Text
text text text
text text text
text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
Text Text Text
text Text text
text text
UC Template UC 1
UC 2
© Generative Software Technologies Corp. 9
Precondition:Documents have been authored with some
template in mind
© Generative Software Technologies Corp. 10
Application scenarios
© Generative Software Technologies Corp. 11
Import to Requirements Mgmt Tools
Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text
Text Text Text Text Text Text
text text Text Text text text
text text text text text text
DoorsHP Quality CenterRequisite Pro…
Functional Reqs
Business Rules
Use Case
Functional Reqs
Business Rules
Use Case
ET
© Generative Software Technologies Corp. 12
Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text
QT
Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text
Structured Query
Text Text Text Text Text Text
text text Text Text text text
text text text text text text
All use cases with actor = ‘customer’
Use Case
Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text
Functional Reqs
Use CaseUse Case
Business Rules
© Generative Software Technologies Corp. 13
Spec Doc
text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text
Headingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
Text Text Text Text Text Text
text text Text Text text text
text text text text text text
Spec DocHeadingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text
Tracing
Business Rules
Use Case
Use Case
© Generative Software Technologies Corp. 14
Spec Doc
text text text text text text text text text text text text texttext text text text text text text text text text text text text text text text text text text text text
Headingtext text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text
Text Text Text Text Text Text
text text Text Text text text
text text text text text text
Template Conformance Checking
Use Case
Use Case
© Generative Software Technologies Corp. 15
Main Challenge:Logical and Physical
Variation
© Generative Software Technologies Corp. 16
Challenge – Variation
Instances of Use Case
© Generative Software Technologies Corp. 17
Challenge – Variation
Instances of Use Case Logical components Component Identifiers
© Generative Software Technologies Corp. 18
Challenge – Variation
Instances of Use Case Logical components Component Identifiers
© Generative Software Technologies Corp. 19
Variation Types
Designed Accidental
Logical
Physical
© Generative Software Technologies Corp. 20
Designed Logical Variation
Optional component
© Generative Software Technologies Corp. 21
Designed Logical Alternatives
Deeper decomposition
Different methodologies lead to logical variation
© Generative Software Technologies Corp. 22
Designed Physical Variation
Different formatting
© Generative Software Technologies Corp. 23
Accidental Variation
LogicalMissing components, e.g., actor
PhysicalSpelling mistakes, e.g., “Actar”Style inconsistency, e.g., italics instead of bold
© Generative Software Technologies Corp. 24
Solution
© Generative Software Technologies Corp. 25
ET – Extraction Tool
Docs PSE
Physical componentsSections, lists, table cells
LSE
UC Template
Logical componentsActor, flow, extensions
Accidental variationvia match threshold
Designed variation
via template
© Generative Software Technologies Corp. 26
UC Template
Metamodel
UC
Name : String Flow
Action : String
*
1 1
SectionHeading
List
Paragraph
Mapping
© Generative Software Technologies Corp. 27
Example Template
© Generative Software Technologies Corp. 28
Logical Structure
© Generative Software Technologies Corp. 29
Mapping
© Generative Software Technologies Corp. 30
Regular Expressions
© Generative Software Technologies Corp. 31
Lists
© Generative Software Technologies Corp. 32
Component Nesting
© Generative Software Technologies Corp. 33
Optional Components
© Generative Software Technologies Corp. 34
Physical Alternatives
© Generative Software Technologies Corp. 35
Templates with Tables
© Generative Software Technologies Corp. 36
Logical Alternatives
© Generative Software Technologies Corp. 37
ET – Extraction Tool
Docs PSE
Physical components
Basic: Paragraph, cell, graphic
Composite: Sections, lists, tables, …
LSE
UC Template
Logical componentsActor, flow, extensions
© Generative Software Technologies Corp. 38
Physical Structure Extraction
Docs PSE
Physical components
Basic: Paragraph, cell, graphic
Composite: Sections, lists, tables, …
LSE
UC Template
Logical componentsActor, flow, extensions
Only part dependent on
document-format
© Generative Software Technologies Corp. 39
Performance
© Generative Software Technologies Corp. 40
Can we extract logical structures from real-world documents?
© Generative Software Technologies Corp. 41
Document Set
43 documents24 from 3 companies11 from public sources6 student projects2,000 to 23,000 words
ContentUse CasesData ObjectsBusiness RulesFunctional ReqsNon-Functional Reqs…
Docs
© Generative Software Technologies Corp. 42
ET2) Verify extraction
Template Development
UC1
UC Template
UC Template
1) Write template manually
UC2
??
3) Refine template
© Generative Software Technologies Corp. 43
Results
36 logical structuresUse cases, data objects, business rules, … Template sizes from 3 to 52 LOCTotal 942 instances
Nearly all instances perfectly recognized100% recall for 33 templates; over 80% for remaining 3100% precision for 35 templates; 87% for remaining 1
Error causesSevere formatting problems, e.g., manual line breaksForgotten ids
© Generative Software Technologies Corp. 44
Other Questions
Amount & kind of template change in refinement 1% – 25% LOC affected during refinement81% changes concern optionality (add ‘?’ or component)
Amount of iterations1 instance (11 cases) to 50% of all instances (6 cases)
e.g., 10 out of 20 (2 cases); mostly simple edits, add `?’
ImplicationStart with few examples, then edit the template based on expert knowledge (e.g., add `?’)
© Generative Software Technologies Corp. 45
Related Work
Import to Req Mgmt ToolsTools prescribe document structureManual markup for fine-grained extraction
Wrapper inductionMachine generated docs (web pages)Induced Regex not human readable (no modeling language)
Natural language processingCan benefit from structure-induced semantic tags
© Generative Software Technologies Corp. 46
Future: Template by Example
UC1
UC Template
UC2
3) Refine template
1) Mark up sample document
UC Template
TE 2) Extract template
3) Verify extraction
ET
© Generative Software Technologies Corp. 47
Summary
© Generative Software Technologies Corp. 48
ET – Design
48
Functional Reqs
B. Rules
Use Case
B. Rules
Use Case
Use Case
PSE
Physical components
Spec Doc
Spec Doc
Spec Doc
UC Template
LSE
Logical components
Spec Doc
Spec Doc
Use CaseQT
Query
Functional Reqs
B. Rules
Use Case
ET
Import
Tracing
Conformance
Application scenarios Template development
Evaluation results
Nearly all instancesperfectly recognized
43 real-world documents
© Generative Software Technologies Corp. 49
Technology available athttp://gensoftech.com/IntelligentET