comparing frequency of content- bearing words in abstracts and texts in articles from four medical...
Post on 17-Jan-2016
Embed Size (px)
Comparing Frequency of Content-Comparing Frequency of Content-Bearing Words in Abstracts and Texts Bearing Words in Abstracts and Texts in Articles from Four Medical Journals:in Articles from Four Medical Journals:An Exploratory StudyAn Exploratory StudySeptember 4, 2001September 4, 2001
James E. Ries, Kuichun Su, Gabriel Peterson, MaryEllen C. Sievert, Timothy B. Patrick, David E. Moxley, Lawrence D. RiesCECS, HMI, Statistics, and SISLT
AbstractAbstract• Retrieval tests have assumed that the abstract is
a true surrogate of the entire text. However, the frequency of terms in abstracts has never been compared to that of the articles they represent. Even though many sources are now available in full-text, many still rely on the abstract for retrieval …
• … In these four journals, the abstracts are lexical, as well as intellectual, surrogates for the documents they represent
BackgroundBackground Many retrieval systems still use abstracts
as a surrogates for full text. Abstracts are often indexed with respect
to word occurrence by employing Zipf’s Law.– Product of occurrence frequency and rank of
occurrence frequency is constant– Most occurring and least occurring words
contribute little to article content.
BackgroundBackground (cont.) (cont.)Previous studies have shown that
abstracts are sometimes inconsistent with their corresponding articles. However, no study has previously shown that abstracts and articles are inconsistent in a statistical sense.
MethodsMethods 4 medical journals (BMJ, JAMA, Lancet,
and NEJM)– Two different countries– Many medical subdisciplines– Regarded as top journals– Available in electronic format
Studied all articles which contained an abstract and were 2 pages or longer during 1999.– 1,138 articles – 35 parsing problems = 1,103
Methods (cont.)Methods (cont.)Text of articles and abstracts were
downloaded and stored in HTML.HTML was parsed into separate
abstract and article files via custom C++ parsing program.
References and figures were removed.
Methods (cont.)Methods (cont.)“Content-bearing words” extracted
from abstracts and articles– Numerical values, special characters,
and captions excluded and used as word delimiters
Removed words contained in a home-grown “stop word list” (words with little or no medical meaning)
Methods (cont.)Methods (cont.)Remaining words conflated using
NLM’s LVG tools.– E.g, “reading” -> “read”
Frequencies of all conflated words were calculated for abstracts and articles.
AnalysisAnalysisUsed chi-squared test to determine
whether discrepancies between observed occurrences in abstract and occurrences in articles were due to sampling or were truly indicative of a difference in content.
Analysis (cont.)Analysis (cont.) Example: Rosing (Lancet)– Abstract contained 140 content bearing
words– “contraceptive” appeared 6 times in the
abstract and 35 times in the text of the article.– Since text contained 1081 content bearing
words, expect 140/1081 * 35 = 3.35 occurrences of this term in the abstract.
Analysis (cont.)Analysis (cont.) Example: Rosing (Lancet)– Actual number of occurrences was 6, the
square of the error divided by the expected was added to the chi-squared statistic for this particular word (i.e., ((6-3.35)^2)/3.35 = 2.10).
– Every other content bearing word in the article was compared to the abstract in this way, and sum of all of the errors was the total chi-squared statistic for the given article.
Analysis (cont.)Analysis (cont.)We reran our analysis using the
Bonferroni Inequality measure to assure that we would not have incorrect results simply by virtue of our large sample size.
Cumulative Results w/o BonferroniCumulative Results w/o BonferroniAgreement
Jama NEJM BMJ Lancet
Cumulative Results w/o BonferroniCumulative Results w/o Bonferroni
Chi-Squared Dist Error Agree Disagree Agree Pct
Jama 454.08 560.18 85.8518% 32 270 23 92.15%NEJM 363.39 494.94 92.6729% 3 214 9 95.96%BMJ 295.71 410.13 94.6461% 0 197 7 96.57%Lancet 402.46 555.14 94.6339% 0 374 9 97.65%Total 35 1055 48 95.65%
Cumulative Results w/ BonferroniCumulative Results w/ BonferroniAgreement
Jama NEJM BMJ Lancet
Cumulative Results w/ BonferroniCumulative Results w/ Bonferroni
Chi-Squared Dist Error Agree Disagree
Jama 454.08 560.18 85.851823405% 32 283 10 96.59%NEJM 363.39 494.94 92.672927800% 3 220 3 98.65%BMJ 295.71 410.13 94.646128358% 0 203 1 99.51%Lancet 402.46 555.14 94.633930987% 0 378 5 98.69%Total 35 1084 19 98.28%
Future WorkFuture WorkUtilize a smaller, more standard stop
word list (see Su K, et. al., “Comparing Frequency of Word Occurances in Abstracts and Texts Using Two Stop Word Lists” in Fall 2001 AMIA Proceedings).
Explore “over agreement”.
Future Work (cont.)Future Work (cont.) Compare phrases (terms) rather than
words. Utilize the UMLS to compare Concept
Unique Identifiers (CUI’s) via MetaMap rather than words or phrases.– Changes in agreement/disagreement may
indicate the use of synonyms which might still negatively affect retrieval.
ConclusionConclusion In these four journals, the abstracts are
lexical, as well as intellectual, surrogates for the documents they represent.
Our test was “conservative” in the sense that we can only strongly state that a small number of abstract/article pairs do “disagree”. However, the remaining articles can only be said to not conclusively disagree.
AcknowledgementsAcknowledgementsThis research was supported in part
by grant T15-089 LM0708-09 from the National Library of Medicine, United States of America.