Download - Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis
Detecting the Knowledge Structure of Bioinformatics
with Text Mining and Citation Analysis
Min Song, PhDAssociate Professor
Department of Library andInformation Science
Yonsei University
Outline• Introduction and Background• Research Problem• Methods
• Data Processing• Topic Modeling• Citation Analysis• Identification of Important Articles by PageRank• Visualization
• Results & Discussion• Summary & Future Work
Introduction• Bioinformatics has grown into the
cross-disciplinary field and proliferated into new areas of life Sciences• 400,000 biological researchers – worldwide• sequencing industry to grow from $1.5B to $100B in 20 years (NextGen Informatics, 2011) • Increasing number of biological databases including PubMed and PubMed
Central
• Understanding the trends in and the structure ofBioinformatics is increasingly important
• Bibliometrics analysis has been applied to Bioinformaticsfor this purpose (Glänzel et al., 2009; Bansard et al., 2007; Huang et al., 2010)
Research Problem• Bibliometrics analysis utilizes quantitative
analysis and statistics to describe patterns of publication within a given field or body of literature (Osareh, 1996)
• Problems of Current Approaches
• The current Bibliometrics analysis relies primarily on Thomson’s Web of Science product which results in the following problems:
• Manually processing citation data• Incomplete coverage
• Only use citation analysis• Can’t handle big data
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.
Goal• Detecting the trends in and the structure of
the field ofBioinformatics
• We introduce novel techniques to detect the knowledge structure of and trends in Bioin-formatics by Text Mining techniques and automated citation analysis
• Mining PubMed Central full-text with • topic modeling• word co-occurrence• named entity recognition• MeSH
• Novel author co-citation analysis• Visualization
What is PubMed Central?• PubMed Central (PMC) is the U.S. National Li-
brary of Medicine's digital archive of biomedical and life sciences journal literature
• Provides free and unrestricted access (XML format)
• Integrates journal literature with other valu-able information resources in the NCBI data-base family (e.g., PubMed, Nucleotide, Pro-tein)
• Launched in February 2000 • 383 journals, 1,512,652 articles, 4.3m unique
visitors in April 2008
Citation Analysis• Citation Graphs• Link-based algorithms
• HITS• PageRank Representative Publications
Text-based
Co-citation
Citation-based
Documents
QUANTIFY SIMILARITIES
Boolean In-put Vectors
CosineBiblio-graphic coupling
(BC)
Combine
Methods – Data Collections1. Advanced Bioinformatics 2. Algorithms for Molecular Biology 3. Biochemistry 4. BioData Mining 5. Bioinformatics 6. Bioinformation 7. BMC Bioinformatics 8. BMC Genomics 9. BMC Systems Biology 10. Briefings in Functional Genomics &
Proteomics 11. BMC Research Notes 12. Bulletin of Mathematical Biology 13. Cancer Informatics 14. Comparative and Functional Genomics 15. EURASIP Journal on Bioinformatics and
Systems Biology 16. The EMBO Journal
17. Evolutionary Bioinformatics 18. Genome Biology 19. Genome Medicine 20. Genomics 21. Genome Integration 22. Journal of Biotechnology 23. Journal of Biomedical Semantics 24. Journal of Proteome Research 25. Journal of Proteomics 26. Journal of Computer-Aided Molecular
Design 27. Journal of Computational Neuroscience 28. Journal of Molecular Biology 29. Journal of Molecular Modelling 30. Journal of Theoretical Biology 31. Mammalian Genome 32. Molecular & Cellular Proteomics 33. Molecular Systems Biology 34. Neuroinformatics 35. Pharmacogenetics and Genomics 36. Physiological Genomics 37. PLoS Computational Biology 38. PLoS Biology 39. PLoS Genetics 40. Protein Science 41. Proteomics 42. Source Code for Biology and Medicine 43. Statistical Methods in Medical Research 44. Theoretical Biology and Medical
Modelling 45. Trends in Biochemical Sciences 46. Trends in Biotechnology 47. Trends in Genetics Total 20,869 articles from 47 Journals
Overall Procedure of Our Ap-proach
Parse PubMed Central
C i t a t i o n
R e l a t i o n a l D B
T e x t
R e l a t i o n a l D B Text Analysis
Word cooccurrence
MeSH term frequency
Topic Modeling with LDA
Link Analysis
Ranking important articles by PageRank
Detect Organization and Country with
NER
Bibliometric Analysis
MeSH = Medical Subject Headings
Word co-occurrence analysis and MeSH term frequency• Important concept identifications by word co-
occurrence • The most widely used measure of co-occurrence is
mutual information (MI)
• We use the log-likelihood ratio (LLR) in that it is more appropriate than MI in the treatment of a mixture of high-frequency bigrams and low-frequency bigrams
• Important concept identifications by MeSH Term • Counting MeSH terms assigned to each article• MeSH terms are not assigned to PubMed Central
• Mapping from PubMed Central to PubMed record andthen extract MeSH terms
Topic Modeling• Topic Modeling by LDA
• We are to explore the salient topics in core literature of Bioinformatics.
• We use Latent Dirichlet Allocation (LDA) proposed in (Blei et al., 2003) for topic model generation
• LDA is a generative model that enables sets of observations to be accounted for by unobserved groups which explains similarity of documents in the collection
• In LDA, each group is described as a random mix-ture
over latent topics where each topic is a discrete distribution over the vocabulary of the collection
Detection of Organization and Country• We apply a Named Entity Recognition (NER)
technique to identify country and organiza-tion from the text
Citation Analysis• Build a Citation Network from the Datasets
• 990,000 citation nodes from about 20,000 papers• Apply the PageRank algorithm to the network
to identify the important articles
Citation Network (Complexity and Social Networks, 2012)
PageRank - definition• u: a web page• Fu: set of pages u points to • Bu: set of pages that point to u• Nu=|Fu|: the number of links from u • c: a factor used for normalization
uBv vNvRcuR )()(
• The equation is recursive, but it may be com-puted by starting with any set of ranks and iterat-ing the computation until it converges.
• The definition corresponds to the probability distribution of a random walk on the web graphs.
Results and Discussion• Term Co-location Analysis
Keyword Word Co-occurrence and LLC ScoreGene gene expression - 36947.5, gene ontology - 4729.7, expressed
genes - 4115.5, genes involved - 3423.9, gene regulation - 1314.1
Genome genome wide - 15485.4, whole genome - 5401.7, human genome - 2950.3, genome sequence - 1821.2, functional genomics - 1805.4
Expression expression patterns - 4231.7, expression profiles - 6517.0, expression data - 3546.1, expression levels - 3187.4
Data data sets - 6593.5, microarray data - 6305.9, expression data - 3546.1
Protein protein interaction - 4824.8, protein interactions - 3186.5, protein coding - 2841.8, protein protein - 2719.8
Algorithm clustering algorithm - 676.0, clustering algorithms - 585.0, new algorithm - 502.0, proposed algorithm - 416.7, alignment algorithms - 266.3
Database public databases - 1309.8, relational database - 1296.7, database search - 363.4
Computer computer simulations - 538.7, computer program - 317.2, computer aided - 278.6, computer science - 223.1, computational model - 221.9
Keywords with High Ranked Word Co-occurrence
Results and Discussion• Term Relationship based on Latent Semantic
Indexing Term Weight (based on LSI)
unlocking 0.546030797332329
musings 0.5412877797836922
cataloguing 0.4929569570474993
obama 0.48770037665119187
note 0.46722318828979653
interchanges 0.44477015248681573
pufferfish 0.4341307613774561
korea 0.4116133344695534
loving 0.40199584778314623
egenomics 0.3861955559200091
rickettsia 0.36268544163299155
ddra 0.3380806646805069
methane 0.33630023219143507
parahaemolyticus 0.335803654780268
mitochondrion 0.3299703122927084
natto 0.2974579417003239
Results and Discussion (Cont’d)gene - expression 36947.5amino - acid 16483.9genome - wide 15485.4high - throughput 14185.2large - scale 10554binding - sites 9450.1factor - transcription 8580.7saccharomyces - cerevisiae 7867.8E - coli 6849.4expression - profiles 6517microarray - data 6305.9expression - patterns 4231.7expression - levels 3187.4
Top Ranked Word Pairs by LLC
Results and Discussion (Cont’d)• Out of 20,869 documents, there are 19,954 documents that have the corre-
sponding MEDLINE records (95.6% matching). In 19,954 documents, 8,412 documents have MeSH terms (42.2%)
MeSH Term FrequencyAnimals 5178Humans 4883Computational Biology 3070Algorithms 2980Gene Expression Profiling 2702Oligonucleotide Array Sequence Analysis 2192Software 2154Molecular Sequence Data 1868Models, Biological 1579Computer Simulation 1568Mice 1511Sequence Analysis, DNA 1489Base Sequence 1374Genomics 1344Evolution, Molecular 1336Databases, Genetic 1325Models, Genetic 1289Sequence Alignment 1278Proteins 1135
Results and Discussion (Cont’d)• Topic Modeling
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5identification model human expression datasignaling gene cells profiling timeusing mapping detection regulatory informationcerevisiae protein pathway specific proteinsaccharomyces human protein mouse classificationgenes structural from transcriptional fromsmall between dna molecular highnon computational analysis dynamic analysisalignment binding stem regulation massdna based elegans evolution microarraycancer genomes recognition cancer throughputyeast structure structure genes basednetwork biology caenorhabditis comparative algorithmgenome tool evolution sequence sequencesystem interactions complex early expressedexpression domains gene support spectrometrygenomic role 1 discovery databaseactivity cell induced proteins differentiallyscreening new nuclear machine identifyingspecific length strand during pcr
Results and Discussion (Cont’d)• Topic Modeling
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10transcription expression analysis gene genomegenomic gene protein genetic usingevolutionary analysis networks new geneprediction data interaction system fromfactor using methods metabolism widesites genes based chromosome sequencesanalysis microarray genomics zebrafish datadna control web functional methodcoli from genome annotation largegene human biology open wholegenome cell genetic associated rnaescherichia C biological integrating diseaseacid case hiv reveals networkscopy assessment systems life pathwaysnumber size sequence bacteria shortbinding network data transcriptome drosophilaorganization multiple bayesian among scaleevolution quality tool loss alternativeestimation transcriptiona
lstructure mammalian regions
arabidopsis cells approach microarrays development
Results and Discussion (Cont’d)
Relationship between a paper and its citation
Results and Discussion (Cont’d)
Publication productivity by year
Results and Discussion (Cont’d)
Relationship between an author and the number of citations received
Results and Discussion (Cont’d)• Important Articles Identified by PageRank
Rank Title Journal Title
1Gapped BLAST and PSI-BLAST: A new generation of protein database search programs
Nucleic Acids Res
2 Basic local alignment search tool J Mol Biol
3 Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat Genet
4
CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Res
5 R: A language and environment for statistical computing
Book
6 Initial sequencing and analysis of the human genome Nature
7Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
Science
8 The Protein Data Bank Nucleic Acids Res
9Bioconductor: open software development for computational biology and bioinformatics
Genome Biology
10Exploration, normalization, and summaries of high density oligonucleotide array probe level data
Biostatistics
Results and Discussion (Cont’d)
Research productivity by country
Results and Discussion (Cont’d)
Research Productivity by Institute
Institute Frequency University of California 1678 Harvard Medical School 811 Stanford 768 National Institutes of Health 430 University of Washington 400 Yale University 373 University College London 329 Massachusetts Institute of Technology 310 Washington University 290 University of Toronto 287 Wellcome Trust Genome Campus 256 University of Illinois 252 University of Oxford 248 University of Michigan 240 University of Cambridge 236 University of North Carolina 235 Princeton University 234 Baylor College of Medicine 230 Columbia University 229 Cornell University 227
Results and Discussion (Cont’d)
Visualization of Citation Graph
Visualization of citation network that shows highly cited papers and papers that cite them
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Summary and Future Work• We have analyzed the field of Bioinformatics
with Text Mining techniques and citation analysis
• We proposed several novel approaches to de-tect the field of Bioinformatics
• We identified that Bioinformatics has grown very fast and collaboration among authors widely spreads out cross the disciplines.
• We also identified that Bradford law is not applied to Bioinformatics. It will require fur-ther analysis on why Bioinformatics is an unique field that Bradford law is not applica-ble.
• Fine tune Visualization• Compare to Web of Science Data
References• Nagarajan M., Mohamed Idhris L., Chellappandi P., Kumaravel J.P.S.
and Premalatha. V. Information Use by Scholars in Bioinformatics: A Bibliometric View, 2011
International Conference on Information Communication and Man-agement IPCSIT vol.16 (2011)• Church, K., and Hanks, P., Word Association Norms, Mutual Informa-
tion and Lexicography, Computational Linguistics, Vol 16:1, pp. 22-29, (1991).
• Patra, S K, Mishram S. (2006), Bibliometric study of bioinformatics lit-erature,
Scientometrics, 67 : 477–489.• Zhao, D. (2006) Towards All-Author Co-Citation Analysis, Information
Processing and Management, 42: 1578-1591• Butler, L. (2006) RQF Pilot Study Project – History and Political Science
Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007.
• Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005.
• Brusic, V. (2007) The growth of bioinformatics, Briefings in Bioinfor-matics. VOL 8. NO 2. 69-70
References• Bansard Y, Rebholz-Schuhmann D, Cameron G, Clark D, van Mulli-
gen E, Beltrame E, Barbolla E, Hoyo D., Martin-Sanchez H, Mi-lanesi L, Tollis I, van der Lei J, Coatrieux J L: Medical informatics and bioinformatics: a bibliometric study. IEEE transactions on in-formation technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2007, 11(3): 237-243
• Perez-Iratxeta C, Andrade-Navarro M A, Wren J D: Evolving re-search trends in bioinformatics. Briefings in Bioinformatics 2007, 8(2): 88-95.
• Glänzel W, Janssens F, Thijs B: A comparative analysis of publica-tion activity and citation impact based on the core literature in bioinformatics.
Scientometrics 2009, 79:109-129.• Blei, D., Ng A., and Jordan, M. Latent Dirichlet allocation. Journal
of Machine Learning Research, 3:993{1022, January 2003.• Huang H, Andrews J, Tang J: Citation characterization and impact
normalization in bioinformatics journals. Journal of the American Society of Information
Science and Technology 2011, doi: 10.1002/asi.21707
Questions?• Thank you!
Questions?
Thank You!