beyond genes, proteins, and abstracts: a framework to capture scientific claims catherine blake...
TRANSCRIPT
![Page 1: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/1.jpg)
Beyond Genes, Proteins, and Abstracts:
A Framework to Capture Scientific Claims
Catherine Blake
School of Information and Library Science
University of North Carolina at Chapel Hill
http://www.ils.unc.edu/[email protected]
![Page 2: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/2.jpg)
2
Motivation• Relentless increase in electronically
available text– Life Sciences
• The NLM added the 17 millionth entry to PubMed in April 2007
• 5,200 journals indexed• 12,000 new articles each week !
– Chemistry – more than 110,000 articles in 1 year alone
• Consequences:– Hundreds of thousands of relevant articles– Implicit connections between literature go
unnoticed
Shift from Retrieval to Synthesis
![Page 3: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/3.jpg)
Entity Extraction
• Newspaper genre– People, places, and organizations– Message Understanding Conference
(MUC)• Biomedical genre
– Genes and proteins – Diseases and treatments– Chemical compounds– Challenges: BioCreative , GENIA,
JNLPBA3
![Page 4: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/4.jpg)
Relationship Extraction
• Newspaper genre– Person moving from one company to
another• Biomedicine genre
– genes and proteins e.g. binds, inhibits
– ARBITER (Rindflesch, Rajan, & Hunter, 2000)
– Geneways (Rzhetsky, et al, 2004)– relEx (Fundel, Kuffner, & Zimmer,
2007)– GENIA
www-tsujii.is.s.u-tokyo.ac.jp/GENIA
4
![Page 5: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/5.jpg)
Causal Relationships
• Newspaper genre– Causal relationships (Khoo, Chan, &
Niu, 1998)• Biomedical genre
– Causes and treats (Price & Delcambre, 2005)
– Causal knowledge (Khoo, Chan, Niu, 2000)
• Universal Grammar – Causatives (Comrie, 1974, 1981)– Action verbs (Thomson, 1987)
5
![Page 6: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/6.jpg)
Claim Definition
• “To assert in the face of possible contradiction”
• Example sentence reporting a claim– “This study showed that Tamoxifen
reduces the breast cancer risk”• Example Claim Framework
– Tamoxifenagent
– reduceschange
– [breast cancer risk] object6
![Page 7: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/7.jpg)
Goals• Create a Framework that reflects
how claims made in biomedical literature
• The Framework should– generalize beyond biomedicine– differentiate between different
levels of confidence in the claim– consider claims made in the full text
• Populate the Framework automatically
7
![Page 8: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/8.jpg)
The Claim Framework
• Information facets– concepts– change– basis of the claim
• Each information facet may have– modifiers– directionality
8
![Page 9: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/9.jpg)
The Claim Framework
9
CategoryConcept
AConcept
B
Nature of
change
Claim Basis
1. Explicit Claim
Agent Object Required Optional
2. Implicit Claim
Agent Object Optional Optional
3. Correlation
Required Required Required Optional
4. Comparison
Required Required Required Required
5. Observation
N/A Required Required Optional
![Page 10: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/10.jpg)
Explicit Claims
Indeed, glycine prevented Wy-14643-stimulated superoxide production by Kupffer cells.
Claim 1– glycineagent –
preventedchange – [Wy-14643-stimulated superoxide
production]object
Claim 2– [Kupffer cells]agent –
produceschange
– [Wy-14643-stimulated superoxide]object.
10
![Page 11: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/11.jpg)
Implicit Claims
In liver the number of peroxisomes increases from about 500-600/cell to > 5000/cell after exposure to peroxisome proliferators.
Claim 1– [Peroxisomes proliferators] agent
– increaseschangeDirection
– Peroxisomesobject
– [In the liver]agentModifier – [number]agentModifier
11
![Page 12: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/12.jpg)
Correlations
A weak but statistically significant correlation was observed between the plasma nm23-H1 level and the WBC count (Figure 1, n=102, r=0.437, P<0.0001)– [plasma nm23-H1 level] agent
– [WBC count] object
– correlation change
– [statistically significant] changeModifier12
![Page 13: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/13.jpg)
Comparisons
The plasma concentration of nm23-H1 was higher in patients with AML than in normal controls (P = .0001)
Claim 1– [plasma concentration of nm23-H1]
basis of claim
– [Patients with AML]agent
– higher changeDirection
– [normal controls]object
13
![Page 14: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/14.jpg)
Observations
However, the plasma nm21-H1 protein level was increased in SML-M3 patients (P=.0002)
Claim 1– [nm21-H1 protein level]object
– IncreasedchangeDirection
– [SML-M3 patients]objectModifier
14
![Page 15: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/15.jpg)
Working Hypothesis 1
The Claim Framework reflects how a scientist communicates her findings
– Full text documents randomly selected from biomedical literature will report findings using constructs within the Claim Framework
– Human annotators will agree on facets within the Claim Framework
– The Claim Framework will generalize to a variety of scientific literatures
15
![Page 16: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/16.jpg)
Working Hypothesis 2
Facets within the Claim Framework can be populated automatically
– The system will detect all claims identified by the human annotators (i.e. recall)
– The system will only identify claims that were identified by the human annotators (i.e. precision)
– The system design will generalize to new literatures by avoiding domain specific constructs
16
![Page 17: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/17.jpg)
Validating the Claim Framework
• Draft Claim Framework given to two annotators
• Pilot Study: Identify every claim– Include claims that don’t conform to the
framework– Don’t consider how this will be
automated
17
![Page 18: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/18.jpg)
Validating the Claim Framework
• Main study– 25 articles
• Verification– Random set
of sentences annotated twice
– Feedback provided daily
18
![Page 19: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/19.jpg)
Results• All documents
– Total number of sentences: 5535 – Sentences with >=1 claim: 1250
(22.6%)– Total number of claims: 3228– Average claims per sentence: 2.51 – Claims that did not fit in the
Framework: 31• Per document
– Average number of sentences: 191 – Average number of sentences with
>=1 claim:43
19
![Page 20: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/20.jpg)
Distribution of Claim Categories
20
Category Total (%) Pilot(%) Main(%)
Explicit 2489 77.11 332 83.42 215776.6
3
Implicit 87 2.70 3 0.75 84 2.98Observation 298 9.23 24 6.03 274 9.73Correlation 174 5.39 12 3.02 162 5.75Comparison 165 5.11 27 6.85 138 4.9
Total 3228 100 398 100 2830 100
![Page 21: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/21.jpg)
21
All DocumentsAnnotation Total (%) Words (Avg)Agent 2894 89.65 5221 1.80Agent Direction 285 8.83 291 1.02Agent Modifier 1246 38.60 4448 3.57Object 3197 99.04 6849 2.14Object Direction 271 8.40 283 1.04Object Modifier 1561 48.36 5383 3.44Change 1897 58.77 1953 1.03Change Direction 1337 41.42 1358 1.02Change Modifier 1147 35.53 1618 1.41Claim Basis 165 5.11 394 2.39Claim Basis Dir. 42 1.30 43 1.02Claim Basis Mod. 86 2.66 266 3.09
Total 3228 28107 8.70
![Page 22: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/22.jpg)
Inter Annotator Agreement
Information Facet KappaAgreement
Agent 0.71 substantial
Object 0.77 substantial
Change 0.57 moderate
Change+ChangeDir 0.88 almost perfect
22
![Page 23: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/23.jpg)
Location of Claims
23
Total Sentences With % %
SectionClaim
Total
section
claim
Abstract 98 309 31.72 7.84
Introduction 357 979 36.4728.5
6Method 6 1121 0.54 0.48
Result 293 1829 16.0223.4
4
Discussion 539 1406 38.3443.1
2
Total 1250 5535 22.58100.
00
![Page 24: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/24.jpg)
Findings thus far
• 99% of the claims made in these articles could be captured in the Claim Framework
• 22% of sentences report at least 1 claim
• 77% of the claims identified were explicit
• 8% of claims are made in the abstract
• Agreement– substantial between agents and
objects – almost perfect for change and change
direction
24
![Page 25: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/25.jpg)
Acknowledgements– This project supported in part by
– Renaissance Computing Institute (RENCI) Faculty Fellowship Program
– NSF Center for Environmentally Responsible Solvents and Processes (CERSP CHE-9876674)
– This project used resources provided by – the OSG, which is supported by the
NSF & the U.S. Department of Energy's Office of Science
• The speaker thanks• Nassib Nassar and Mats Rynge
(RENCI)• Amol Bapat and Ryan Jones (SILS)
![Page 26: Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims Catherine Blake School of Information and Library Science University of](https://reader035.vdocument.in/reader035/viewer/2022062716/56649dff5503460f94ae8730/html5/thumbnails/26.jpg)
Questions and Comments Welcome
Catherine [email protected]
http://www.ils.unc.edu/~cablakeSchool of Information and Library
ScienceUniversity of North Carolina at Chapel
Hill