a literature-based approach to scientific discovery don r. swanson c2003 professor, div. humanities...

92
A Literature-Based Approach to Scientific Discovery Don R. Swanson c2003 Professor, Div. Humanities The University of A presentation at UIC Sep 4, 2003

Upload: imogen-shepherd

Post on 25-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

A Literature-Based Approach to Scientific Discovery

Don R. Swanson c2003Professor, Div. HumanitiesThe University of Chicago

A presentation at UIC Sep 4, 2003

2

Literature-based discovery? ---the very idea.

1. It means deriving, from the public record of science new solutions to scientific problems.2. The possibility arises, for example, when two articles considered together for the first time suggest new information of scientific interest not apparent from either article alone. This mode of discovery is the focus of the Arrowsmith project.

3

Undiscovered public knowledge

To speak of “new information of scientific interest”is a reference to the state of the literature, not toa state of mind. What someone may know or notknow is a quite different concept from what is inthe public record. It is public rather than private knowledge with which we are concerned in ourquest for discovery.

5

A problem-oriented approach

Focusing on the professional and researchliterature of biology and medicine, begin witha specific problem of substantial scientific interest -- such as finding a previously unknown cause, cure, or treatment for a particular disease.

6

The growth and fragmentation of science

In response to overwhelming growth, sciencespontaneously, and somewhat mysteriously,divides itself into specialties. In this way, thelabor of both producing and assimilating itsliterature is divided into more or less manageable chunks. But the inevitableconsequence of this fragmentation is the mutual isolation of the chunks.

7

The growth of science literature can be seen as a whole -- but

Time line

8

-- is more cogently visualized as the growth of specialty literatures

Time line

10

Citation clusters -- specialization and fragmentation of science

Within a specialty, authors cite one another to amuch greater extent than they cite authors outside of the specialty.

When specialty literatures become too large to assimilate, they divide into sub-specialties.

Connections and relationships between mutually-isolated specialties may go unnoticed.

11

The Connection Explosion

The number of potential connections between units of specialized literatures grows much faster than the number of units themselves. The number of pairwise connections, forexample, increases as the square of the number of units thatcould be connected -- more accurately, as n(n-1)/2.

1 3 6 10 4950 500,000

100 1000

12

The Connection Explosion in Medline

So, for 10,000,000 Medline records, there would be 50,000 billion possible 2-wayconnections between individual articles.But if we think of the literature as dividedinto 10,000 units or “chunks”, then the number of possible 2-way connections between chunks is much less -- only 50 million.

13

What is meant by chunks and specialties

I have introduced the work “chunk” to stress thefact that “specialty” may have other meaningsnot necessarily intended here. A “chunk” issomething within which there is goodcommunication but which tends to have poorcommunication with other chunks. It is a setof articles that cite one another appropriately,but cite relatively few articles in other chunks.

14

Visualizing literatures as sets of articles -- using Venn diagrams

1. A “literature” or set of articles can be visualizedas though in a “container” -- a closed figure encirclinga set of points, each point representing a scientificarticle or a database record.

2. Two sets intersect if they contain somearticles in common. Articles within theintersection are members of both sets-- S1 AND S2.

S1 S2

15

Disjoint, non-interactive sets

Disjoint means that sets have no articles in common --that is, they do not intersect:

If there are no cross-citations -- from one set tothe other -- the two sets will be called non-interactive.

16

Disjoint and non-interactive, but nevertheless related, sets.

Two literatures, A and C, even with no records incommon and no citations from one to the other, mightbe related through shared attributes -- such as key words,phrases, index terms, concepts, or authors.

A C

17

If two non-interactive sets wererelated, who would know it?

Because there are no cross-citations, no one reading one literature will be led to the other -- at least not in the usual way of following chains of citations. Hence if two such literatures have scientifically interesting connections, it ispossible that such connections are unintended and unnoticed.

???

18

Science literatures -- constantly changing worlds to be explored.

19

Online searching ofbibliographic databases

1. Database searching consists of forming and combining sets of bibliographic records.

2. Each search statement creates a set and displays how many records are in it.

A9000

3. Records can be displayed, thus providing valuable relevance feedback.

S1 [search term - A] -- 9000

20

Online searching -- continued

4. In forming search statements, the three Booleanoperators AND, OR, NOT correspond to theintersection, union, and complement of sets.

21

Online searching -- continued

5. If you form two sets, you can find theintersection, and the number of records in it.

S1 term A ---- 9000S2 term C ---- 3000S3 S1 AND S2 --- 90

A9000

A9000

C3000

90

C3000

If S3 = 0, A and C are disjoint

C3000 If S3=S2, then C is a subset of A

A

22

In sum, tools for interactive searching

1. Instant display of the sets you formed, showing the size of each set.2. The 3 Boolean operators, and additional operators such as truncation and proximity.3. Commands that permit display of any records found -- either the full record or specified fields. (Inspection of how relevant and non-relevant records were indexed can provide invaluable clues for revising and refining a search.)

23

Introduction to Medline Searching -- handout

Part I Algebra of Sets and Venn DiagramsPart II Rules of the Game -- main types of search commands find - combine - displayPart III Search Strategy: -- online searching is the art of forming and combining sets Part IV PubMed Puzzles: -- some surprises and some lessons

24

Explaining Literature-Based Discovery by an example

Reprint handout: Migraine and magnesium:elevenneglected connections. Perspect. Biol. Med. 1988 [referenced here as MigMag88]MigMag88 is a pre-Arrowsmith study. Arrowsmith is not only an aid to LBD, it is a result of LBD, and evolved from the search techniques and strategies that MigMag88 describes. Arguably, the best way to learn how to use Arrowsmith is to begin a process of LBDwithout it, substituting for it a Medline exploration.

25

LBD without Arrowsmith

In brief, the process described in MigMag88that led from migraine to magnesium is this:1. Search title-words in Medline for “migraine”. 2. Examine a few dozen or more records looking for potential intermediate links in the chain of events thatmight lead to migraine.3. Then start a new title-word search for these links.4. Examine the new titles looking for links that might be still earlier in the same chain of events.

26

-- not just an exercise

The Medline search process described is not just an exercise for learningabout Arrowsmith; it is also a usefulpreparation for any Arrowsmith search.It can help you create a better input andso, plausibly, a better output.

27

The two stages of Literature-Based Discovery

Stage 1: Getting from the problem (migraine) to a conjectural solution (magnesium) [hypothesis generation]Stage 2: Exploring in depth the connections between the two. [hypothesis generation]In MigMag88, Stage 1 is described in the section“A systematic trial-and-error search strategy”;the rest of the paper is devoted to Stage 2.

28

Assembling other people’s ideas

Stage 2 of MigMag88 resembles aliterature review, but its author neither has,nor claims, any expertise in the two subjectsunder review. The result is virtually a cutand paste of what the real experts have saidin print about migraine and magnesiumseparately.

29

The ABC Model of Complementarity

If one area of literature shows that A is related to B and a different area shows that B is related to C , then bringing together these two areasfor the first time may suggest a novelhypothesis that connects A with C,an implicit but not explicit connection.

30

Venn Diagram -- ABC Model

A CB

Articles about an AB relationship.

Articles about a BC relationship.

AB BC

AB and BC are complementary but disjoint :They can reveal an implicit relationship between A and C in the absence of any explicit relation.

31

An ABC example based on title words in Medline

Magnesium-deficient rat as a model of epilepsy.Lab Animal Sci 28:680-5, 1978

The relation of migraineand epilepsy. Brain 92: 285-300, 1969

A magnesium8011

C migraine2756An unintended link

Venn diagram: sets of Medline records; A,C are disjoint.

22 45

B epilepsy

32

Complementarity

Two sets of articles are defined as complementary if, considered together, they suggest new information (in this case a possible migraine-magnesium connection) not apparent in eitherset taken alone. Complementarity does notnecessarily imply logical transitivity, but rather is used in the looser sense of suggestibility.

33

Introducing Arrowsmith

Arrowsmith is software that finds words, phrases, subject headings, authors, and other attributescommon to two downloaded sets of databaserecords -- the purpose of which is to help theuser see new connections within the scientific literature that lead to novel plausible hypothesesthat are worth testing. Medline cannot do this.

34

What Arrowsmith can do that Medline can’t

Arrowsmith finds all “interesting” words, phrases, or subject headings (B-list terms) common to two sets of records (A, C).

calcium blocker

epilepsy

vascular reactivity

platelet aggregation

Bi i=1,2,..A C

magnesium migraine

35

Arrowsmith extends power of Boolean “AND”, whether or not A,C are disjoint

B

Arrowsmith findsA AND Bi as well as Bi AND C forall Bi; i=1,2,3…

i

Medline findsA AND Cbut not anunknown Bi;

A C

(Bi within AC intersection are presumed to be known.)

36

Filtering the B-list

A large (8000-word) pre-compiled stoplist (words to be excluded) is built into Arrowsmith and applied automatically.

The user may delete entries from the B-list. Terms that remain are “interesting”.

B-list editing by the user is optional; terms with rank-0 are now automatically removed.

Ranking is based on subject headings.

37

A,C input; B-list + titles as output

The first output is the B-list. For each term on the B-list, all titles (from

files A,C) containing that term are brought together and displayed.

The title display is vital, for it provides the contexts in which the Bterm occurred and may suggest a complementary relationship.

38

Using the output

The purpose of the output is to create suggestive juxtapositions of titles.

For each Bi-term, the ABi and BiC title-displays (+ abstracts & full text) may help the user construct a plausible testable hypothesis that connects A with C.

If ABi and BiC are disjoint, the hypothesis may be novel.

39

LBD goal is a testable hypothesis

Assuming that a plausible, novel, testablehypothesis has been developed, the nextgoal then is to stimulate a clinical or laboratory test of it, or simply stimulatemore research. One can ask, didMigMag88 stimulate more research?

40

What Arrowsmith can do that Medline can’t: a 2nd example

Arrowsmith finds all “interesting” words, phrases, or subject headings (the B-list) common to two sets of records (A, C).

Hantaan

Marburg

AphthovirusA C

virulence stabilityexp viruses exp viruses

Semliki Forest

Lassa

B

41

ABC model -- a new interpretation

The five B-viruses have each been investigatedin the context of both A and C (virulence andstability). This fact may be of interest because A and C together have more implications for the threat of viruses as weapons of warfare or terrorism than either set taken alone. (Ref: JASIST August 2001 p. 797-812)

42

Arrowsmith and Term extraction

Because Arrowsmith begins by extracting termsfrom downloaded Medline records, the next 2 slides illustrate the Medline record and the extraction process ---

43

Fields and “Terms” in a Medline Record

UI 89317153AU LeDucIN U.S. Army Medical Research Institute..TI Epidemiology of hemorrhagic fever ….SO Reviews of Infectious Diseases 1989…MH Arenaviridae Infections/ep MH Ebola VirusMH FlavivirusMH Marburg Virus Disease/epAB Twelve distinct viruses associated with hemorrhagic fever in humans….

1st 2 letters atleft mark field.

“Terms” aresubject headings(MH) or wordsand phrases fromthe title (TI) or abstract (AB)fields.

45

An integrated picture of a Medline and Arrowsmith search

A C

Sets of Records

Medline

First use Medline to create two disjoint sets A, C;

NEXT

46

Arrowsmith forms sets of termsextracted from Medline records.

A C

Sets of Records

Sets of Extracted Terms

Terms are extracted from records to form “2nd order sets”; Intersection (called “B-list”) is first Arrowsmith output.

B-list

Medline

Arrowsmith

Terms from A Terms from C

B-list records

47

But suppose the two sets are not disjoint -- i.e. they intersect?

A C

Sets of RecordsMedline

All title words and phrases within the AC intersection will be on the B-list because Arrowsmith in that case willmatch two identical sets of titles, AB and BC. If the intersection is large, then it may dominate the B-list. A conventional Medline search will provide all AC titles, so Arrowsmith is not needed for this purpose.

48

Visualizing how the intersection of A&C may dominate the B-list

A C

Sets of Records

Sets of Extracted Terms

B-listA “direct” (A AND C) Medline search yields a subset

of the B-list (“direct B-list”), which presumably is known.

Direct B-list

Medline

Arrowsmith

49

Arrowsmith removes direct-AC

Arrowsmith is designed on the assumption that the direct-AC articles are already wellknown and separately explored by the user.Accordingly, all direct-AC records are removed from file A at the outset,to prevent their unnecessary contributionto the B-list.

50

Implications of A-C intersection:first, understand what it contains!

A C

CA

Disjoint: Best opportunity for finding previously unknown connections.

Small overlap: May be as good, or better.

Large overlap: not promising for novelty.Use conventional Medline search for A.C.However, try to find subsets of A, Cwhich are disjoint, then apply Arrowsmith.

51

Preparation of A,C input files:Arrowsmith search strategies 1

Refer to handout on Medline Searching for discussionof recall, precision, and search strategies. High recallmeans you are trying to get everything, which inevitably brings with it a lot of junk, whereas highprecision means you settle for less because you wanteverything you do get to be very relevant. Arrowsmith requires a high-recall search for findingthe direct-AC literature, but requires high-precisionfor the two A,C input files in order to minimize junkin the B-list.

52

Preparation of A,C input files:Arrowsmith search strategies 2

The previous discussion of LBD without Arrowsmithhas new cogency, for it is the same exploratoryprocess you should now follow in preparing the A,C input files. The reprint handout MigMag88 provides an example of that process.

53

Preparation of A,C input files:Arrowsmith search strategies 3

A focus on searching title words and phrasesmay be important, for three reasons: 1. -- it tends to improve precision 2. -- it improves the odds that B-terms aremeaningfully linked to their correspondingA and C terms, because they are closer to themin titles than in abstracts.3. -- complementarity is easier to recognize and the amount of text to be examined is muchless than in scanning through complete records.

54

Preparation of A,C input files:Arrowsmith search strategies 4

Precision may be further improved by searchingsubject headings in conjunction with title words-- that is, forming an intersection, as follows:

migraine[TI] AND migraine[MH]

Forming similar intersections with other subjectheadings is also an option, and may be used asa means of improving the rank of B-list terms.

55

Preparation of A,C input files:Arrowsmith search strategies 5

Date of Publication time limits maybe a powerful precision tool. If youknow that connections before a certaindate cannot be relevant, then search only the later literature. Use cautionand apply separately to A and C literatures because you may want newideas in A to connect to old C-literatureor vice versa.

56

Preparation of A,C input files:Arrowsmith search strategies 6

Search strategy determines, among other things,the size of the input files A, C. Size matters!A presupposition that underlies connecting mutually isolated or noninteractive literaturesis that, within themselves, each of the separateliteratures is highly interactive or highlyinter-communicative -- that is, articles withinFile A cite each other extensively and similarlyfor File C.

57

Preparation of A,C input files:Arrowsmith search strategies 7

How large is too large?That is difficult to answer; we need more dataon the typical size of citation clusters. Butexperience with excessively long ArrowsmithB-lists suggests that the optimal size may be in the range of 100 to 5000 articles.

58

Preparation of A,C input files:Arrowsmith search strategies 8

If Precision is too high, then recall may betoo low, and good terms might be lost fromthe B-list. (A, C too small.) If Recall is too high, then Precision maybe too low, and good terms will be buried.(A,C are too large). Best compromise is to start with highPrecision and gradually increase Recall inrepeated searches.

59

UMLS vs MeSH: Meaning vs Use

The meaning of a word is often imprecise;what counts is how words are used. Henceit is always important to examine the output of a search. UMLS and MeSH are addressed to different problems. MeSH is designed toindex the medical literature-- assign termsto articles, and so is primarily about use.UMLS is a massive thesaurus-like compilation, primarily about meaning.

60

Preparation of A,C input files:The sublanguage effect.

Sublanguages have been investigated by linguistssince the 1960s. They entail restricted lexicons andgrammatical operations. A set of all articles with, say, “migraine” in the title may create a restrictedsublanguage and so a more cohesive literature -- quite possibly a literature with a strong internalcommunication pattern. Arrowsmith probably works best when it seeks connections across cohesive clusters. This ideamay give further support to title searching.

61

Arrowsmith - U of Chicago andArrowsmith-UIC

Arrowsmith originated on the UC-website and wasinstalled at UIC in mid-2001. UIC and Marc Weeberthen developed an ingenious and more user-friendly interface that imported and incorporated PubMed and the Medline database, for “one-stop shopping”.Semantic filtering using the UMLS was also installedat UIC. Both sites continue to be available, but thereare some differences. The presentation up to thispoint applies to either system. The next slide outlineshow the University of Chicago site differs from UIC.

62

Arrowsmith at The University of Chicago: http//kiwi.uchicago.edu

1. Continues to be developed and maintained by Swanson.2. Medline searching is independent; downloaded files are transmitted as input to Arrowsmith-on-kiwi --3. -- which can accept input files from most versions ofMedline, including particularly PubMed and Ovid.4. It integrates hypothesis-generation (pseudo 1-node)with hypothesis-testing (2-node).5. For a 2-node search, the B-list is automatically ranked using subject headings in a process to be described next.6. For the “1-node” search, the A-list is ranked.

63

A method for evaluating a B-list

The work outlined briefly here is covered in somedetail in Progress Reports 2 and 3 submitted to UICby Swanson, and are available on request.Central to the idea of ranking a B-list is some way to evaluate that ranking. The MigMag88 paper istaken as a model testbed for Arrowsmith output.It has the principal advantage that a fairly largenumber of apparently valid connections were found,as evidenced by the arguments in MigMag88.

64

Using Medical Subject Headings (MH) to rank the B-list

The input files, in MEDLINE format (also calledthe FIELDTAG format), contain subject headingswith an MH tag in the leftmost field. All MH fieldsare extracted from both Files A and C, and termscommon to the two files are identified and filteredthrough a MeSH stoplist. These are called MHB terms. The original input is then converted toabbreviated records that contain only the identifier,title, abstract, authors, source journal, and MHBterms, for further processing.

65

Automatic ranking of the B-list

After title B-list has been created, all MHB termscommon to the corresponding A and C recordsfor each title B-term, are highlighted in blue.A rank number is assigned to each title B-termbased on the number of blue-highlighted MHBterms with which it is associated. The effective-ness of the ranking was tested on previously runproblems for which the most valuable B-links were already known, as in MigMag88.

66

Ranking results for MigMag88

Rank # Total B Target B Precis% >=3 43 16 37 2 36 6 13 1 135 15 11 0 131 9 7

Chi-squared test T O E>0 214 37 28.5 0 131 9 17.5chi2=12.6 p<.005

Precision tends to increase as the rank# increases.For 0 rank vs. all higher, results strongly significant.Two other studies showed similar trend for precision.Raynaud/EPA study n.s. because numbers too small.Arg/SmC for rank 0 vs all higher, significant p<.005

67

Two Kinds of Arrowsmith Searching

1. Hypothesis generation.[pseudo 1-node] Output is list of A-candidates, ranked by B-terms.2. Hypothesis testing.[2-node] Out put is B-list and associated titles in AB, BC.

A C

B

Thus far, the discussion has been about hypothesis testing

68

Arrowsmith finds words and phrases common to titles (AA,C) -- BBblist.

C

Bi i=1,2,..

AAdietary

deficitstoxins

drugs

Hypothesis generation

69

Explanation of AA notationand development of ranked A-list

AA denotes a broad category within which more specific terms, A, will be sought.

Arrowsmith will decompose the titles in theAABB intersection into component words andphrases called the A-list and will then rankthe A-list according to the number of B-termswith which each A-term is associated.

70

Venn diagram for ranked A-list

C

AA

A2

A14

2

Choose Ai, thenre-do as 2-node

Ai are rankedby no. BB

BB

71

Selecting from A-listinitiates hypothesis testing

Users select term from A-list for hypothesis-testing.A new B-list is generated as a subset of the BB listapplicable specifically to A. AB and BC titles can be explored online -- user clicks on B-term to see titles in which the term is used, in both A and C. File A is a subset of AA, and so these results aremore restrictive -- and perhaps of higher precision-- than a new 2-node A-C search.

72

Human & Machine functions in hypothesis generation mode

User selects problem C and conjectural set AA; conducts Medline search.

Arrowsmith produces BB-list from AA, C. Arrowsmith removes rank-0 from BB-list. Arrowsmith produces ranked and grouped A-

list. User optionally may edit A-list and form

groups, then let Arrowsmith try again.

73

The Arrowsmith low-frequency list

Arrowsmith creates word-frequency lists for both the A and C literatures. Low-frequency words may reveal earliest indications of the novel relationship that issought. Thus Arrrowsmith can call attention to a discovery already made, but perhaps not widely known.

74

Arrowsmith as a guide to the literature

By processing downloaded database records, Arrowsmith can help the user decide what to read, and by so doing can stimulate new medical hypotheses --

-- the plausibility of which can be assessed by reading the literature.

Finally, the hypotheses can be tested only through clinical or laboratory investigation.

75

Published studies of CBD Literatures

1986 Dietary fish oil -- Raynaud’s Disease 1988 Magnesium deficiency -- Migraine 1990 Arginine -- Somatomedin C 1994 Mg deficiency -- Neurologic Disease 1996 Indomethacin --Alzheimer’s Disease 1996 Estrogen ---- Alzheimer’s Disease 1998 Phospholipase A2 -- Schizophrenia 2001 Viruses as potential weapons

76

--continued

2001 Genetic packaging technologies --potential for virus warfare. (Smalheiser)

2001 Five potentially new therapeutic applications of thalidomide. (Weeber: Ch 6)

77

Purpose of publishing a studyof complementary disjoint literatures

Place in refereed biomedical journal. Purpose is to present a convincing argument

that the literature-derived hypothesis (A--C via B) is novel, plausible, and worth testing.

3 measures of success: acceptance for publication, stimulation of a test, and corroboration of hypothesis.

More details of each of the 8 studies follow:

78

Fish oil, Raynaud’s Syndrome, and undiscovered public knowledge,

Perspect. Biol. & Med. 30(1): 7-18, 1986 Reference sources: 25 fish oil, 34 Raynaud Connections: blood viscosity, platelet function,

vascular reactivity, red-cell deformability, prostaglandins, serotonin, thromboxane

A pre-Arrowsmith study. Arrowsmith applied later to 353 fish oil titles and

585 Raynaud titles yielded B-list of 31 terms that included all of the 7 above.

79

Fish-oil / Raynaud literatures - continued

The two literatures taken together, but not separately, suggested a novel medical hypothesis: -- that dietary fish-oil may be beneficial for (at least some) Raynaud patients.

Corroborated in a controlled clinical trial 2 years later: B.B. Chang. et. al. Surgical Forum, 39: 324-326, 1988

80

Migraine and Magnesium: eleven neglected connections

Perspect. Biol. & Med. 31(4): 526-557, 1988 References sources: 63 magnesium, 65 migraine Connections: Type A pers., vascular reactivity,

calcium blockers, platelet activity, spreading depression, epilepsy, serotonin, inflammation, prostaglandins, substance P, brain hypoxia

Arrowsmith applied later to 8011 magnesium titles and 2756 migraine titles yielded a B-list of 103 terms that included 9 of the above 11.

81

Migraine and Mg -- cont.hypothesis and corroborations

Hypothesis: Mg deficiency may be implicated in migraine. About 4 articles published during 20-year period before 1988.

Between 1989 and 1997, more than 12 different groups of medical researchers reported a systemic or local magnesium deficiency in migraine or a favorable response (in 2-6 mo.) of migraine patients to dietary supplementation with magnesium. 1 report of negative results.

82

Migraine AND Magnesium -- before and after 1988

70 72 74 76 77 78 79 80 82 84 86 88 90 92 94 96 98 00 02

11 9 7 5 3 1

As of mid-03, total number of articles inmedical literature indexed with both termsis about 60.

MigMag88, a stimulus?

83

The natural history of intersections

The preceding slide suggests that a similartime-line plot might be of value for anyrelatively small intersection -- to determineprimarily if there is a key year beforewhich articles were few in number andscattered and after which a substantialincrease took place. Such a pattern couldrepresent a new and relatively unknownscientific discovery -- literature based or not.

84

Perspect. Biol. & Med. 33(2): 157-186, 1990 Ref sources: 85 somatomedin C, 51 arginine Connections: growth hormone, malnutrition,

acromegaly, protein synthesis, lean body mass anabolic effects, immune function, wound healing.

Arrowsmith applied later to 3244 arginine titles and 1162 SmC titles yielded B-list of 160 terms, that included 4 of above 7.

Somatomedin C and arginine: implicit connections between mutually isolated

literatures

85

Somatomedin C and arginine --continued

Hypothesis: Anabolic effects of arginine are accompanied by and possibly due to systemic or local release of SmC.

Tested 3, 5, 8 years later: 1 neg., 3 pos. Corpas, Endocrine Revs 14: 20-39, 1993 - Kirk et al, Surgery 114: 155-160, 1993 + Hurson et al, JPEN: 227-230, 1995 + Chevalley et al, Bone 23(2):103-9, 1998 +

86

Assessing a gap in the biomedical literature: magnesium deficiency and

neurologic disease.

Authors: Smalheiser and Swanson Neuroscience Res Comm. 15(1):1-8, 1994

A: Graded dietary manipulation of Mg. C: Neurologic diseases. B: NMDA-receptor-mediated excitotoxicity

CA BMg++

87

Indomethacin and Alzheimer’s disease

Authors: Smalheiser and Swanson Neurology 46: 583, 1996.

A: 5008 titles with “indomethacin” C: 7002 titles with “Alzheimer”

A,C not disjoint; A may be protective in C B: 103 terms in edited B-list; connections include

fluidity, killer, muscarinic, peroxidation, TRH, acetylcholine.

Latter is unexpected, potentially adverse

88

Linking estrogen to Alzheimer’s disease: An informatics approach

Authors: Smalheiser and Swanson Neurology 47: 809-810, 1996

A: 16300 titles with “(o)estrogen()” C: 8200 titles with “Alzheimer”

A,C not disjoint; 70 articles on both Edited B-list: 194 terms. Antioxidant activity of estrogen merits

attention; free radicals implicated in AD.

89

Calcium-independent phospholipase-A2 and schizophrenia

Authors: Smalheiser and Swanson Archives of General Psychiatry 55: 752-3, 1998

A: 54 titles with “calcium-independent phospholipase A2”

C: 21,000 titles with “schizophrenia” A AND C: Ross reports that A elevated in C B-list: 38 terms, including vitamin E Oxidative stress from vit E/Se deficiency increases

Ca-iPLA2 in lung, liver of rats (Kuo).

90

Ca-iPLA2 in schizophrenia -- continued

Chronic oxidative stress may occur in schizophrenia.

Proposed hypothesis: Rats treated as in Kuo may have elevated Ca-iPLA2 in serum when assayed as in Ross.

If confirmed, this would provide animal model for studying mechanisms and consequences of elevated Ca-iPLA2.

91

Current R&D at UChicago

1. The ranked A-list with associated B-terms offers the possibility of revealing hidden relationships among A-terms based on B-terms in common.2 Substitution of essentially random sets of articles for either File A or File C or both is a mode of investigation that appears to be fruitful. Investigation of the role of sublanguages is a related area.

92

Logical inconsistency in (static) probabilistic approach

1. Lexical statistics works as follows: find words thatco-occur with “migraine” significantly more often thanone would expect by chance. These are taken as interesting “bridge” terms.2. Thus words that co-occur very few times are discarded as not interesting. Clearly, “magnesium”co-occurs rarely with “migraine”. But “magnesium” isknown to be the prime target word, hence mostinteresting of all. So which are interesting -- abnormally frequent, or abnormally rare words?

93

Back to the future -- worlds in collision

94

Sublanguages and changing distributions

Each of the worlds to be explored is acluster of papers that intercommunicateand develop their own sublanguage.As worlds collide, sublanguagesinvade each other and bring aboutchanging frequency distributions..

Acknowledgment of Support 1 R01 LM07292-01 A collaborative grant: Univ of Chicago and Univ of Illinois-ChicagoArrowsmith Data Mining Techniques in neuro-informatics.Co-sponsored by NLM and NIMH 6/15/01 - 5/31/06