Download - Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Detecting the Knowledge Structure of Bioinformatics

with Text Mining and Citation Analysis

Min Song, PhDAssociate Professor

Department of Library andInformation Science

Yonsei University

Outline• Introduction and Background• Research Problem• Methods

• Data Processing• Topic Modeling• Citation Analysis• Identification of Important Articles by PageRank• Visualization

• Results & Discussion• Summary & Future Work

Introduction• Bioinformatics has grown into the

cross-disciplinary field and proliferated into new areas of life Sciences• 400,000 biological researchers – worldwide• sequencing industry to grow from $1.5B to $100B in 20 years (NextGen Informatics, 2011) • Increasing number of biological databases including PubMed and PubMed

Central

• Understanding the trends in and the structure ofBioinformatics is increasingly important

• Bibliometrics analysis has been applied to Bioinformaticsfor this purpose (Glänzel et al., 2009; Bansard et al., 2007; Huang et al., 2010)

Research Problem• Bibliometrics analysis utilizes quantitative

analysis and statistics to describe patterns of publication within a given field or body of literature (Osareh, 1996)

• Problems of Current Approaches

• The current Bibliometrics analysis relies primarily on Thomson’s Web of Science product which results in the following problems:

• Manually processing citation data• Incomplete coverage

• Only use citation analysis• Can’t handle big data

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

Goal• Detecting the trends in and the structure of

the field ofBioinformatics

• We introduce novel techniques to detect the knowledge structure of and trends in Bioin-formatics by Text Mining techniques and automated citation analysis

• Mining PubMed Central full-text with • topic modeling• word co-occurrence• named entity recognition• MeSH

• Novel author co-citation analysis• Visualization

What is PubMed Central?• PubMed Central (PMC) is the U.S. National Li-

brary of Medicine's digital archive of biomedical and life sciences journal literature

• Provides free and unrestricted access (XML format)

• Integrates journal literature with other valu-able information resources in the NCBI data-base family (e.g., PubMed, Nucleotide, Pro-tein)

• Launched in February 2000 • 383 journals, 1,512,652 articles, 4.3m unique

visitors in April 2008

Citation Analysis• Citation Graphs• Link-based algorithms

• HITS• PageRank Representative Publications

Text-based

Co-citation

Citation-based

Documents

QUANTIFY SIMILARITIES

Boolean In-put Vectors

CosineBiblio-graphic coupling

(BC)

Combine

Methods – Data Collections1. Advanced Bioinformatics 2. Algorithms for Molecular Biology 3. Biochemistry 4. BioData Mining 5. Bioinformatics 6. Bioinformation 7. BMC Bioinformatics 8. BMC Genomics 9. BMC Systems Biology 10. Briefings in Functional Genomics &

Proteomics 11. BMC Research Notes 12. Bulletin of Mathematical Biology 13. Cancer Informatics 14. Comparative and Functional Genomics 15. EURASIP Journal on Bioinformatics and

Systems Biology 16. The EMBO Journal

17. Evolutionary Bioinformatics 18. Genome Biology 19. Genome Medicine 20. Genomics 21. Genome Integration 22. Journal of Biotechnology 23. Journal of Biomedical Semantics 24. Journal of Proteome Research 25. Journal of Proteomics 26. Journal of Computer-Aided Molecular

Design 27. Journal of Computational Neuroscience 28. Journal of Molecular Biology 29. Journal of Molecular Modelling 30. Journal of Theoretical Biology 31. Mammalian Genome 32. Molecular & Cellular Proteomics 33. Molecular Systems Biology 34. Neuroinformatics 35. Pharmacogenetics and Genomics 36. Physiological Genomics 37. PLoS Computational Biology 38. PLoS Biology 39. PLoS Genetics 40. Protein Science 41. Proteomics 42. Source Code for Biology and Medicine 43. Statistical Methods in Medical Research 44. Theoretical Biology and Medical

Modelling 45. Trends in Biochemical Sciences 46. Trends in Biotechnology 47. Trends in Genetics Total 20,869 articles from 47 Journals

Overall Procedure of Our Ap-proach

Parse PubMed Central

C i t a t i o n

R e l a t i o n a l D B

T e x t

R e l a t i o n a l D B Text Analysis

Word cooccurrence

MeSH term frequency

Topic Modeling with LDA

Link Analysis

Ranking important articles by PageRank

Detect Organization and Country with

NER

Bibliometric Analysis

MeSH = Medical Subject Headings

Word co-occurrence analysis and MeSH term frequency• Important concept identifications by word co-

occurrence • The most widely used measure of co-occurrence is

mutual information (MI)

• We use the log-likelihood ratio (LLR) in that it is more appropriate than MI in the treatment of a mixture of high-frequency bigrams and low-frequency bigrams

• Important concept identifications by MeSH Term • Counting MeSH terms assigned to each article• MeSH terms are not assigned to PubMed Central

• Mapping from PubMed Central to PubMed record andthen extract MeSH terms

Topic Modeling• Topic Modeling by LDA

• We are to explore the salient topics in core literature of Bioinformatics.

• We use Latent Dirichlet Allocation (LDA) proposed in (Blei et al., 2003) for topic model generation

• LDA is a generative model that enables sets of observations to be accounted for by unobserved groups which explains similarity of documents in the collection

• In LDA, each group is described as a random mix-ture

over latent topics where each topic is a discrete distribution over the vocabulary of the collection

Detection of Organization and Country• We apply a Named Entity Recognition (NER)

technique to identify country and organiza-tion from the text

Citation Analysis• Build a Citation Network from the Datasets

• 990,000 citation nodes from about 20,000 papers• Apply the PageRank algorithm to the network

to identify the important articles

Citation Network (Complexity and Social Networks, 2012)

PageRank - definition• u: a web page• Fu: set of pages u points to • Bu: set of pages that point to u• Nu=|Fu|: the number of links from u • c: a factor used for normalization

uBv vNvRcuR )()(

• The equation is recursive, but it may be com-puted by starting with any set of ranks and iterat-ing the computation until it converges.

• The definition corresponds to the probability distribution of a random walk on the web graphs.

Results and Discussion• Term Co-location Analysis

Keyword Word Co-occurrence and LLC ScoreGene gene expression - 36947.5, gene ontology - 4729.7, expressed

genes - 4115.5, genes involved - 3423.9, gene regulation - 1314.1

Genome genome wide - 15485.4, whole genome - 5401.7, human genome - 2950.3, genome sequence - 1821.2, functional genomics - 1805.4

Expression expression patterns - 4231.7, expression profiles - 6517.0, expression data - 3546.1, expression levels - 3187.4

Data data sets - 6593.5, microarray data - 6305.9, expression data - 3546.1

Protein protein interaction - 4824.8, protein interactions - 3186.5, protein coding - 2841.8, protein protein - 2719.8

Algorithm clustering algorithm - 676.0, clustering algorithms - 585.0, new algorithm - 502.0, proposed algorithm - 416.7, alignment algorithms - 266.3

Database public databases - 1309.8, relational database - 1296.7, database search - 363.4

Computer computer simulations - 538.7, computer program - 317.2, computer aided - 278.6, computer science - 223.1, computational model - 221.9

Keywords with High Ranked Word Co-occurrence

Results and Discussion• Term Relationship based on Latent Semantic

Indexing Term Weight (based on LSI)

unlocking 0.546030797332329

musings 0.5412877797836922

cataloguing 0.4929569570474993

obama 0.48770037665119187

note 0.46722318828979653

interchanges 0.44477015248681573

pufferfish 0.4341307613774561

korea 0.4116133344695534

loving 0.40199584778314623

egenomics 0.3861955559200091

rickettsia 0.36268544163299155

ddra 0.3380806646805069

methane 0.33630023219143507

parahaemolyticus 0.335803654780268

mitochondrion 0.3299703122927084

natto 0.2974579417003239

Results and Discussion (Cont’d)gene - expression 36947.5amino - acid 16483.9genome - wide 15485.4high - throughput 14185.2large - scale 10554binding - sites 9450.1factor - transcription 8580.7saccharomyces - cerevisiae 7867.8E - coli 6849.4expression - profiles 6517microarray - data 6305.9expression - patterns 4231.7expression - levels 3187.4

Top Ranked Word Pairs by LLC

Results and Discussion (Cont’d)• Out of 20,869 documents, there are 19,954 documents that have the corre-

sponding MEDLINE records (95.6% matching). In 19,954 documents, 8,412 documents have MeSH terms (42.2%)

MeSH Term FrequencyAnimals 5178Humans 4883Computational Biology 3070Algorithms 2980Gene Expression Profiling 2702Oligonucleotide Array Sequence Analysis 2192Software 2154Molecular Sequence Data 1868Models, Biological 1579Computer Simulation 1568Mice 1511Sequence Analysis, DNA 1489Base Sequence 1374Genomics 1344Evolution, Molecular 1336Databases, Genetic 1325Models, Genetic 1289Sequence Alignment 1278Proteins 1135

Results and Discussion (Cont’d)• Topic Modeling

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5identification model human expression datasignaling gene cells profiling timeusing mapping detection regulatory informationcerevisiae protein pathway specific proteinsaccharomyces human protein mouse classificationgenes structural from transcriptional fromsmall between dna molecular highnon computational analysis dynamic analysisalignment binding stem regulation massdna based elegans evolution microarraycancer genomes recognition cancer throughputyeast structure structure genes basednetwork biology caenorhabditis comparative algorithmgenome tool evolution sequence sequencesystem interactions complex early expressedexpression domains gene support spectrometrygenomic role 1 discovery databaseactivity cell induced proteins differentiallyscreening new nuclear machine identifyingspecific length strand during pcr

Results and Discussion (Cont’d)• Topic Modeling

Topic 6 Topic 7 Topic 8 Topic 9 Topic 10transcription expression analysis gene genomegenomic gene protein genetic usingevolutionary analysis networks new geneprediction data interaction system fromfactor using methods metabolism widesites genes based chromosome sequencesanalysis microarray genomics zebrafish datadna control web functional methodcoli from genome annotation largegene human biology open wholegenome cell genetic associated rnaescherichia C biological integrating diseaseacid case hiv reveals networkscopy assessment systems life pathwaysnumber size sequence bacteria shortbinding network data transcriptome drosophilaorganization multiple bayesian among scaleevolution quality tool loss alternativeestimation transcriptiona

lstructure mammalian regions

arabidopsis cells approach microarrays development

Results and Discussion (Cont’d)

Relationship between a paper and its citation


Publication productivity by year


Relationship between an author and the number of citations received

Results and Discussion (Cont’d)• Important Articles Identified by PageRank

Rank Title Journal Title

1Gapped BLAST and PSI-BLAST: A new generation of protein database search programs

Nucleic Acids Res

2 Basic local alignment search tool J Mol Biol

3 Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Nat Genet

4

CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Nucleic Acids Res

5 R: A language and environment for statistical computing

Book

6 Initial sequencing and analysis of the human genome Nature

7Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring

Science

8 The Protein Data Bank Nucleic Acids Res

9Bioconductor: open software development for computational biology and bioinformatics

Genome Biology

10Exploration, normalization, and summaries of high density oligonucleotide array probe level data

Biostatistics


Research productivity by country


Research Productivity by Institute

Institute Frequency University of California 1678 Harvard Medical School 811 Stanford 768 National Institutes of Health 430 University of Washington 400 Yale University 373 University College London 329 Massachusetts Institute of Technology 310 Washington University 290 University of Toronto 287 Wellcome Trust Genome Campus 256 University of Illinois 252 University of Oxford 248 University of Michigan 240 University of Cambridge 236 University of North Carolina 235 Princeton University 234 Baylor College of Medicine 230 Columbia University 229 Cornell University 227


Visualization of Citation Graph

Visualization of citation network that shows highly cited papers and papers that cite them

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Summary and Future Work• We have analyzed the field of Bioinformatics

with Text Mining techniques and citation analysis

• We proposed several novel approaches to de-tect the field of Bioinformatics

• We identified that Bioinformatics has grown very fast and collaboration among authors widely spreads out cross the disciplines.

• We also identified that Bradford law is not applied to Bioinformatics. It will require fur-ther analysis on why Bioinformatics is an unique field that Bradford law is not applica-ble.

• Fine tune Visualization• Compare to Web of Science Data

References• Nagarajan M., Mohamed Idhris L., Chellappandi P., Kumaravel J.P.S.

and Premalatha. V. Information Use by Scholars in Bioinformatics: A Bibliometric View, 2011

International Conference on Information Communication and Man-agement IPCSIT vol.16 (2011)• Church, K., and Hanks, P., Word Association Norms, Mutual Informa-

tion and Lexicography, Computational Linguistics, Vol 16:1, pp. 22-29, (1991).

• Patra, S K, Mishram S. (2006), Bibliometric study of bioinformatics lit-erature,

Scientometrics, 67 : 477–489.• Zhao, D. (2006) Towards All-Author Co-Citation Analysis, Information

Processing and Management, 42: 1578-1591• Butler, L. (2006) RQF Pilot Study Project – History and Political Science

Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007.

• Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005.

• Brusic, V. (2007) The growth of bioinformatics, Briefings in Bioinfor-matics. VOL 8. NO 2. 69-70

References• Bansard Y, Rebholz-Schuhmann D, Cameron G, Clark D, van Mulli-

gen E, Beltrame E, Barbolla E, Hoyo D., Martin-Sanchez H, Mi-lanesi L, Tollis I, van der Lei J, Coatrieux J L: Medical informatics and bioinformatics: a bibliometric study. IEEE transactions on in-formation technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2007, 11(3): 237-243

• Perez-Iratxeta C, Andrade-Navarro M A, Wren J D: Evolving re-search trends in bioinformatics. Briefings in Bioinformatics 2007, 8(2): 88-95.

• Glänzel W, Janssens F, Thijs B: A comparative analysis of publica-tion activity and citation impact based on the core literature in bioinformatics.

Scientometrics 2009, 79:109-129.• Blei, D., Ng A., and Jordan, M. Latent Dirichlet allocation. Journal

of Machine Learning Research, 3:993{1022, January 2003.• Huang H, Andrews J, Tang J: Citation characterization and impact

normalization in bioinformatics journals. Journal of the American Society of Information

Science and Technology 2011, doi: 10.1002/asi.21707

Questions?• Thank you!

Questions?

Thank You!

Download - Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Top Related