Transcript
Page 1: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Comparing Frequency of Content-Comparing Frequency of Content-Bearing Words in Abstracts and Texts Bearing Words in Abstracts and Texts in Articles from Four Medical Journals:in Articles from Four Medical Journals:An Exploratory StudyAn Exploratory StudySeptember 4, 2001September 4, 2001

James E. Ries, Kuichun Su, Gabriel Peterson, MaryEllen C. Sievert, Timothy B. Patrick, David E. Moxley, Lawrence D. RiesCECS, HMI, Statistics, and SISLT

Page 2: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

AbstractAbstract• Retrieval tests have assumed that the abstract is

a true surrogate of the entire text. However, the frequency of terms in abstracts has never been compared to that of the articles they represent. Even though many sources are now available in full-text, many still rely on the abstract for retrieval …

• … In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent

Page 3: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

BackgroundBackground Many retrieval systems still use abstracts

as a surrogates for full text. Abstracts are often indexed with respect

to word occurrence by employing Zipf’s Law.– Product of occurrence frequency and rank of

occurrence frequency is constant– Most occurring and least occurring words

contribute little to article content.

Page 4: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

BackgroundBackground (cont.) (cont.)Previous studies have shown that

abstracts are sometimes inconsistent with their corresponding articles. However, no study has previously shown that abstracts and articles are inconsistent in a statistical sense.

Page 5: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

MethodsMethods 4 medical journals (BMJ, JAMA, Lancet,

and NEJM)– Two different countries– Many medical subdisciplines– Regarded as top journals– Available in electronic format

Studied all articles which contained an abstract and were 2 pages or longer during 1999.– 1,138 articles – 35 parsing problems = 1,103

articles

Page 6: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Methods (cont.)Methods (cont.)Text of articles and abstracts were

downloaded and stored in HTML.HTML was parsed into separate

abstract and article files via custom C++ parsing program.

References and figures were removed.

Page 7: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Methods (cont.)Methods (cont.)“Content-bearing words” extracted

from abstracts and articles– Numerical values, special characters,

and captions excluded and used as word delimiters

Removed words contained in a home-grown “stop word list” (words with little or no medical meaning)

Page 8: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Methods (cont.)Methods (cont.)Remaining words conflated using

NLM’s LVG tools.– E.g, “reading” -> “read”

Frequencies of all conflated words were calculated for abstracts and articles.

Page 9: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

AnalysisAnalysisUsed chi-squared test to determine

whether discrepancies between observed occurrences in abstract and occurrences in articles were due to sampling or were truly indicative of a difference in content.

Page 10: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Analysis (cont.)Analysis (cont.) Example: Rosing (Lancet)– Abstract contained 140 content bearing

words– “contraceptive” appeared 6 times in the

abstract and 35 times in the text of the article.– Since text contained 1081 content bearing

words, expect 140/1081 * 35 = 3.35 occurrences of this term in the abstract.

Page 11: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Analysis (cont.)Analysis (cont.) Example: Rosing (Lancet)– Actual number of occurrences was 6, the

square of the error divided by the expected was added to the chi-squared statistic for this particular word (i.e., ((6-3.35)^2)/3.35 = 2.10).

– Every other content bearing word in the article was compared to the abstract in this way, and sum of all of the errors was the total chi-squared statistic for the given article.

Page 12: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Analysis (cont.)Analysis (cont.)We reran our analysis using the

Bonferroni Inequality measure to assure that we would not have incorrect results simply by virtue of our large sample size.

Page 13: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Cumulative Results w/o BonferroniCumulative Results w/o BonferroniAgreement

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Jama NEJM BMJ Lancet

Agreement

Page 14: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Cumulative Results w/o BonferroniCumulative Results w/o Bonferroni

JournalChi-Sq Stat

Deg Freedom

Chi-Squared Dist Error Agree Disagree Agree Pct

Jama 454.08 560.18 85.8518% 32 270 23 92.15%NEJM 363.39 494.94 92.6729% 3 214 9 95.96%BMJ 295.71 410.13 94.6461% 0 197 7 96.57%Lancet 402.46 555.14 94.6339% 0 374 9 97.65%Total 35 1055 48 95.65%

Page 15: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Cumulative Results w/ BonferroniCumulative Results w/ BonferroniAgreement

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Jama NEJM BMJ Lancet

Agreement

Page 16: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Cumulative Results w/ BonferroniCumulative Results w/ Bonferroni

JournalChi-Sq Stat

Deg Freedom

Chi-Squared Dist Error Agree Disagree

Agree Pct

Jama 454.08 560.18 85.851823405% 32 283 10 96.59%NEJM 363.39 494.94 92.672927800% 3 220 3 98.65%BMJ 295.71 410.13 94.646128358% 0 203 1 99.51%Lancet 402.46 555.14 94.633930987% 0 378 5 98.69%Total 35 1084 19 98.28%

Page 17: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Future WorkFuture WorkUtilize a smaller, more standard stop

word list (see Su K, et. al., “Comparing Frequency of Word Occurances in Abstracts and Texts Using Two Stop Word Lists” in Fall 2001 AMIA Proceedings).

Explore “over agreement”.

Page 18: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

Future Work (cont.)Future Work (cont.) Compare phrases (terms) rather than

words. Utilize the UMLS to compare Concept

Unique Identifiers (CUI’s) via MetaMap rather than words or phrases.– Changes in agreement/disagreement may

indicate the use of synonyms which might still negatively affect retrieval.

Page 19: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

ConclusionConclusion In these four journals, the abstracts are

lexical, as well as intellectual, surrogates for the documents they represent.

Our test was “conservative” in the sense that we can only strongly state that a small number of abstract/article pairs do “disagree”. However, the remaining articles can only be said to not conclusively disagree.

Page 20: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

AcknowledgementsAcknowledgementsThis research was supported in part

by grant T15-089 LM0708-09 from the National Library of Medicine, United States of America.

Page 21: Comparing Frequency of Content- Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 James

QuestionsQuestions

•http://riesj.hmi.missouri.edu

[email protected]


Top Related