grandrounds2004.ppt
TRANSCRIPT
![Page 1: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/1.jpg)
Mining Medical Mountains: How Bioinformatics Can Help
Medical Science
David Wishart
University of Alberta
![Page 2: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/2.jpg)
The Library of Congress
• 120 million items in storage• 54 million manuscripts• 18 million books• 12 million photographs• 4.5 million maps• 4.4 million technical reports• 1.1 million PhD dissertations• ~20 Terabytes of data
![Page 3: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/3.jpg)
Some Numbers…• 3 scientific journals in 1750• 120,000 scientific journals today• 500,000 medical articles/year• 4,000,000 scientific articles/year• 14,000,000 abstracts in PubMed derived from
4600 journals• 3,307,998,701 web pages on Google• 500,000,000,000,000 bytes on the Web
![Page 4: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/4.jpg)
Some Numbers…
• A researcher would have to scan 130 different journals and read 27 papers per day to follow a single disease, such as breast cancer.
• Baasiri, R.A., Glasser, S.R., Steffen, D.L. & Wheeler, D.A. Oncogene 18, 7958-7965 (1999)
![Page 5: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/5.jpg)
Some Graphs:
![Page 6: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/6.jpg)
Multiplexed CE with Fluorescent detection
ABI 3700 96x700 bases
![Page 7: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/7.jpg)
Genomes• 5 vertebrates (human, mouse, rat, fugu)
• 2 plants (arabadopsis, rice)• 2 insects (fruit fly, mosquito)• 2 nematodes (C. elegans, C. briggsae)• 1 sea squirt• 4 parasites (plasmodium, guillardia)• 4 fungi (S. cerevisae, S. pombe)• 140 bacteria and archebacteria• 1000+ viruses
![Page 8: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/8.jpg)
The Human Genome
• 3.2 billion bases on 24 chromosomes
• 3,201,762,515 bases sequenced (99%)
• 23,531 - 31,609 genes (predicted)
• 50,000+ named genes (synonyms)
• 4000+ human diseases
• 850-1039 disease causing genes (ID’s)
![Page 9: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/9.jpg)
A Tidal Wave of Data
Made worse by….
![Page 10: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/10.jpg)
The Language of Biology
• The EGF receptor binds epidermal growth factor which triggers the phosphorylation of PLC-gamma followed by the binding and subsequent phosphorylation of Grb2 and SOS which leads to the formation of a Raf1-MEK complex which, in turn, leads to a p21ras auto-phosphorylation cascade. The complex then phosphorylates a MAP kinase which is transported to the nucleus via a nuclear transport signal which triggers the transcription of c-Fos, c-Myc and c-Jun which upon release in the rough ER are transported to…
![Page 11: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/11.jpg)
How To Make Sense of This?
• How to acquire biological or medical knowledge from English text?
• How to build facts and relationships from scientific/medical articles?
• How to put 100+ years of useful data into readily accessible electronic repositories (the back fill problem)?
![Page 12: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/12.jpg)
Some Solutions
• Text Mining…
• Create electronic repositories of abstracts and articles (PubMed/Entrez)
• Create glossaries & thesaurus’ of terms• Employ machine learning methods to parse
electronic text to extract or interpret key pieces of “atomic” information (SVM, Naïve Bayes, Reference Point Logistics, etc.)
![Page 13: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/13.jpg)
PubMed
http://www.ncbi.nlm.nih.gov/PubMed/
![Page 14: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/14.jpg)
PubMed• Allows users to search by journal, key
words, titles etc.
• Uses MeSH (Medical SubHeadings) to allow automated search of synonyms (renal transplant = kidney transplantation)
• API available to query PubMed automatically and remotely
• Few users know how to use PubMed properly or to its full extent
![Page 15: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/15.jpg)
“ouellette bf” [au] AND yeast
Details
![Page 16: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/16.jpg)
![Page 17: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/17.jpg)
MeSH: Medical Subject Heading
("ouellette bf"[au] AND (("yeasts"[MeSH Terms] OR "saccharomyces cerevisiae"[MeSH Terms]) OR yeast[Text Word]))
![Page 18: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/18.jpg)
Integrated Text/Sequence Searching with Entrez
![Page 19: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/19.jpg)
PubCrawler
http://www.pubcrawler.ie/
![Page 20: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/20.jpg)
PubCrawler• Free "alerting" service that scans daily
updates to the NCBI Medline (PubMed) and GenBank databases
• Lists new database entries that match search parameters (keywords, author names, etc.) specified by the user
• Results are presented as an HTML Web page (Entrez-like format)
• Can be downloaded or run as a service
![Page 21: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/21.jpg)
![Page 22: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/22.jpg)
![Page 23: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/23.jpg)
MedMiner
http://discover.nci.nih.gov/textmining/filters.html
![Page 24: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/24.jpg)
MedMiner
• A text miner that filters, extracts and organizes relevant sentences in the literature based on a gene, gene-gene or gene-drug query
• Combines GeneCards and PubMed searches with an integrated text filter
• L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein, (1999) BioTechniques 27:1210-1217.
![Page 25: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/25.jpg)
![Page 26: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/26.jpg)
MedGene
http://hipseq.med.harvard.edu/MEDGENE/login.jsp
![Page 27: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/27.jpg)
MedGene• A list of human genes associated with a
particular human disease in ranking order • A list of human genes associated with multiple
human diseases in ranking order • A list of human diseases associated with a
particular human gene in ranking order • A list of human genes associated with a
particular human gene in ranking order• The sorted gene list from other disease related
high-throughput experiments, (i.e. micro-array
![Page 28: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/28.jpg)
![Page 29: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/29.jpg)
MedGene Performance
• Was able to identify >2400 genes associated with breast cancer in the literature
• Existing databases only list 260 genes (of which MedGene found 240)
• Could save ~100’s of hours of literature searching & combing
![Page 30: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/30.jpg)
PolySearch
![Page 31: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/31.jpg)
PolySearch
• Searches over 14 million PubMed Records
• Searches against 1622 diseases (and synonyms)
• Searches using 9300 genes with 42,500 synonyms
• Assesses quality using SCI list of impact factors for 8600+ journals
![Page 32: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/32.jpg)
PolySearch• Supports PubMed text searching for gene &
disease associations (user provides disease name)
• Automatically scores & ID’s genes and searches for known SNPs or mutations against std. databases
• Grabs gene sequences and generates primers around SNPs
• Archives (MySQL database) or sends results as HTML page to user
![Page 33: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/33.jpg)
Other Examples of Text or Web Mining
![Page 34: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/34.jpg)
http://textomy.iit.nrc.ca/
![Page 35: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/35.jpg)
Pre-BIND
• Donaldson et al. BMC Bioinformatics 2003 4:11
• Used Support Vector Machine (SVM) to scan literature for protein interactions
• Precision, accuracy and recall of 92% for correctly classifying PI abstracts
• Estimated to capture 60% of all abstracted protein interactions for a given organism
![Page 36: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/36.jpg)
Proteome Analyst
• Uses Naïve Bayes methods in combination with sequence homology to identify “tokens” or nuggets of important information from text (titles, keywords, InterPro numbers and other data)
• Produces quantitative estimates (queryable reliability scores) of protein function, location, etc.
![Page 37: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/37.jpg)
GenePublisher
• Processes raw genechip data and produces a publishable report in 1-2 hours of processor time
• Mines existing databases to build up or extract relationships
• Learns from previous analyses and remembers previous associations
http://www.cbs.dtu.dk/services/GenePublisher/
![Page 38: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/38.jpg)
GenePublisher Output
![Page 39: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/39.jpg)
Continuing Problems in Text Mining Biomedical
Literature are…
![Page 40: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/40.jpg)
A Serious Naming Problem
• Sonic Hedgehog• Draculin• Profilactin• Knobhead• Lunatic Fringe• Fidgetin• Mortalin• Antiquitin• Accelerin
• Cockeye• Clootie Dumpling• SnaFu• Gleeful• Bang Senseless• Bride of Sevenless• Crack• Christmas Factor• Orphanin
![Page 41: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/41.jpg)
And Exotic Terminology…
• J. Med. Genetics 10, 1962-6 (1973) "Mobius Syndrome with Poland’s Anomaly.“
• Heavy use of Eponyms (Werner’s syndrome, Down’s syndrome, Angelman’s syndrome, Creutzfeld-Jacob disease, etc. etc.)
![Page 42: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/42.jpg)
Some Challenges
• How to name or describe proteins, genes, drugs, diseases and conditions consistently and coherently?
• How to ascribe and name a function, process or location consistently?
• How to describe interactions, partners, reactions and complexes?
• How to classify genes & proteins (a universal taxonomy of sequences and structures)?
![Page 43: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/43.jpg)
Some Solutions
• Develop controlled or restricted vocabularies (IUPAC-like naming conventions)
• Create thesaurus’, central repositories or synonym lists (MeSH terms in PubMed)
• Work towards synoptic reporting and structured abstracting
![Page 44: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/44.jpg)
Synoptic or Structured Abstract
J Am Acad Dermatol. 2004 Mar;50(3):431-4. Related Articles, Links
Demand outstrips supply of US pediatric dermatologists: Results from a national survey.
Hester EJ, McNealy KM, Kelloff JN, Diaz PH, Weston WL, Morelli JG, Dellavalle RP.
BACKGROUND: The US pediatric dermatology workforce was last examined in 1986 when limited employment
opportunity was found. OBJECTIVE: We sought to re-examine pediatric dermatology workforce issues. METHODS:
US dermatology chairpersons and residency program directors were surveyed for: (1) agreement with pediatric
dermatology workforce statements; and (2) pediatric dermatology faculty and fellow numbers. RESULTS: Respondents
agreed that having a pediatric dermatologist or dermatologists on faculty is important, and that a shortage of pediatric
dermatologists exists, but did not agree that increasing pediatric dermatology training requirements will increase this
shortage. Almost half of the programs (45/94) employed a full-time pediatric dermatologist, and 24 programs had
currently been recruiting a pediatric dermatologist for more than 1 year. Only 6 pediatric dermatology fellows were
in training. CONCLUSION: Given that open pediatric dermatology faculty positions greatly exceed the number of
fellows in training and that formal training requirements will be increasing, the shortage of pediatric dermatologists
will likely continue.
![Page 45: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/45.jpg)
GO-Gene Ontology
• To produce a controlled vocabulary that changes as biological knowledge changes
• Categorizes according to 1) molecular function; 2) biological process; and 3) cellular component
• Represents contributions and consensus opinions from multiple experts in various fields
• Aim is to have every known protein and gene annotated consistently
http://www.geneontology.org/
![Page 46: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/46.jpg)
NIH’s Medical Ontology Research Program
http://lhncbc.nlm.nih.gov/lhc/servlet/Turbine/template/home%2CHome.vm
![Page 47: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/47.jpg)
MeSH
![Page 48: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/48.jpg)
OMIM
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
![Page 49: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/49.jpg)
DrugBank
http://redpoll.pharmacy.ualberta.ca/drugbank/
![Page 50: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/50.jpg)
Bioinformatics
Medinformatics
![Page 51: Grandrounds2004.ppt](https://reader033.vdocument.in/reader033/viewer/2022060108/554e8510b4c90573338b463c/html5/thumbnails/51.jpg)
Conquering the Mountain