comparing frequency of content- bearing words in abstracts and texts in articles from four medical...

Comparing Frequency of Content-Comparing Frequency of Content-Bearing Words in Abstracts and Texts Bearing Words in Abstracts and Texts in Articles from Four Medical Journals:in Articles from Four Medical Journals:An Exploratory StudyAn Exploratory StudySeptember 4, 2001September 4, 2001

James E. Ries, Kuichun Su, Gabriel Peterson, MaryEllen C. Sievert, Timothy B. Patrick, David E. Moxley, Lawrence D. RiesCECS, HMI, Statistics, and SISLT

AbstractAbstract• Retrieval tests have assumed that the abstract is

a true surrogate of the entire text. However, the frequency of terms in abstracts has never been compared to that of the articles they represent. Even though many sources are now available in full-text, many still rely on the abstract for retrieval …

• … In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent

BackgroundBackground Many retrieval systems still use abstracts

as a surrogates for full text. Abstracts are often indexed with respect

to word occurrence by employing Zipf’s Law.– Product of occurrence frequency and rank of

occurrence frequency is constant– Most occurring and least occurring words

contribute little to article content.

BackgroundBackground (cont.) (cont.)Previous studies have shown that

abstracts are sometimes inconsistent with their corresponding articles. However, no study has previously shown that abstracts and articles are inconsistent in a statistical sense.

MethodsMethods 4 medical journals (BMJ, JAMA, Lancet,

and NEJM)– Two different countries– Many medical subdisciplines– Regarded as top journals– Available in electronic format

Studied all articles which contained an abstract and were 2 pages or longer during 1999.– 1,138 articles – 35 parsing problems = 1,103

articles

Methods (cont.)Methods (cont.)Text of articles and abstracts were

downloaded and stored in HTML.HTML was parsed into separate

abstract and article files via custom C++ parsing program.

References and figures were removed.

Methods (cont.)Methods (cont.)“Content-bearing words” extracted

from abstracts and articles– Numerical values, special characters,

and captions excluded and used as word delimiters

Removed words contained in a home-grown “stop word list” (words with little or no medical meaning)

Methods (cont.)Methods (cont.)Remaining words conflated using

NLM’s LVG tools.– E.g, “reading” -> “read”

Frequencies of all conflated words were calculated for abstracts and articles.

AnalysisAnalysisUsed chi-squared test to determine

whether discrepancies between observed occurrences in abstract and occurrences in articles were due to sampling or were truly indicative of a difference in content.

Analysis (cont.)Analysis (cont.) Example: Rosing (Lancet)– Abstract contained 140 content bearing

words– “contraceptive” appeared 6 times in the

abstract and 35 times in the text of the article.– Since text contained 1081 content bearing

words, expect 140/1081 * 35 = 3.35 occurrences of this term in the abstract.

Analysis (cont.)Analysis (cont.) Example: Rosing (Lancet)– Actual number of occurrences was 6, the

square of the error divided by the expected was added to the chi-squared statistic for this particular word (i.e., ((6-3.35)^2)/3.35 = 2.10).

– Every other content bearing word in the article was compared to the abstract in this way, and sum of all of the errors was the total chi-squared statistic for the given article.

Analysis (cont.)Analysis (cont.)We reran our analysis using the

Bonferroni Inequality measure to assure that we would not have incorrect results simply by virtue of our large sample size.

Cumulative Results w/o BonferroniCumulative Results w/o BonferroniAgreement

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Jama NEJM BMJ Lancet

Agreement

Cumulative Results w/o BonferroniCumulative Results w/o Bonferroni

JournalChi-Sq Stat

Deg Freedom

Chi-Squared Dist Error Agree Disagree Agree Pct

Jama 454.08 560.18 85.8518% 32 270 23 92.15%NEJM 363.39 494.94 92.6729% 3 214 9 95.96%BMJ 295.71 410.13 94.6461% 0 197 7 96.57%Lancet 402.46 555.14 94.6339% 0 374 9 97.65%Total 35 1055 48 95.65%

Cumulative Results w/ BonferroniCumulative Results w/ BonferroniAgreement

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Jama NEJM BMJ Lancet

Agreement

Cumulative Results w/ BonferroniCumulative Results w/ Bonferroni

JournalChi-Sq Stat

Deg Freedom

Chi-Squared Dist Error Agree Disagree

Agree Pct

Jama 454.08 560.18 85.851823405% 32 283 10 96.59%NEJM 363.39 494.94 92.672927800% 3 220 3 98.65%BMJ 295.71 410.13 94.646128358% 0 203 1 99.51%Lancet 402.46 555.14 94.633930987% 0 378 5 98.69%Total 35 1084 19 98.28%

Future WorkFuture WorkUtilize a smaller, more standard stop

word list (see Su K, et. al., “Comparing Frequency of Word Occurances in Abstracts and Texts Using Two Stop Word Lists” in Fall 2001 AMIA Proceedings).

Explore “over agreement”.

Future Work (cont.)Future Work (cont.) Compare phrases (terms) rather than

words. Utilize the UMLS to compare Concept

Unique Identifiers (CUI’s) via MetaMap rather than words or phrases.– Changes in agreement/disagreement may

indicate the use of synonyms which might still negatively affect retrieval.

ConclusionConclusion In these four journals, the abstracts are

lexical, as well as intellectual, surrogates for the documents they represent.

Our test was “conservative” in the sense that we can only strongly state that a small number of abstract/article pairs do “disagree”. However, the remaining articles can only be said to not conclusively disagree.

AcknowledgementsAcknowledgementsThis research was supported in part

by grant T15-089 LM0708-09 from the National Library of Medicine, United States of America.

QuestionsQuestions

•http://riesj.hmi.missouri.edu

•[email protected]

comparing frequency of content- bearing words in abstracts and texts in articles from four medical...

Documents

text of articles

article content

analysis cont

frequency of content

occurring words

conflated words

corresponding articles

word delimitersremoved