how to carry outbioinformaticstudies from ... · transcriptomics metabolomics metagenomics ......
TRANSCRIPT
1
How to carry out bioinformatic studiesfrom computational point of view
Qin Ma, Ph.D.
Bioinformatics and Mathematical Bioscience Lab
04/29/2016
Bioinformatics and Mathematical Biosciences Lab @ SDSU
Bioinformatics
• This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes.
– Temple Smith, Current Topics in Computational Molecular Biology (2002)
4
Bioinformatics
GenomicsTranscriptomicsMetabolomicsMetagenomicsEpigenomicsProteomics
Interactomics…
Omics data
Systems biology
Some Concepts in Bioinformatics
• Models
• Algorithms
• Programs/Tools
Some Concepts in Bioinformatics
• Models
• Algorithms
• Programs/Tools
The answer is NO!
Problem: to find a path through the nodes that would cross each edge once and only once.
A
B
C
D
Seven Bridges of Konigsberg
Its negative solution by Leonhard Euler in 1735 laid the foundations of graph theory.
The problem was to find a walk through the city that would cross each bridge once and only once.
A
B
C
D
A
B
C
D
Model: Euler tour ó no nodes of odd degree
Mathematical models in Bioinformatics
Some Concepts in Bioinformatics
• Models
• Algorithms
• Programs/Tools
Some Concepts in Bioinformatics
• Models
• Algorithms
• Programs/Tools
Algorithm Design
I + XI = Xa b
cd e f
gh
i j
Problem: to move the smallest number of matches to make the formula correct
Famous algorithms in Bioinformatics
Smith–Waterman algorithm
• Like physics, where general rules and laws are taught at thestart, biology will surely be presented to future generationsof students as a set of basic systems.
--Temple Smith
BLAST
Michael Waterman
Some Concepts in Bioinformatics
• Models
• Algorithms
• Programs/Tools
1. C/C++, PERL, R, Python, PHP, JAVA, etc
Development of bioinformatic tools in BMBL
6
1. RNA-seq analysis pipeline
Funded by Scholarly Excellence Funds of SDSU
24
RNA-sequencing reads
Read quality check
Qualified reads mapping
Replicate sample quality check
Gene assembly
Differentialexpression analysis
Pathwayenrichmentanalysis
cis-regulatorymotif
identification
FastQC
HISAT2
CufflinksDMINDA
FastX-FastQ
EdgeR
16
2. Bi-clustering for identification of co-expressed genes under some conditions
25
Gen
es
Conditions
Gen
es
Conditions
Bi-clustering
• The original QUBI C program was written in GNU C,which is limited in portabil ity. And memory leak isanother concern.
• In QUBIC-R, C code was refactored and transformedinto C++, data structures was changed and C po interswas replaced by STL containers.
• Core function structures was optimized to facilitatepackage updates and further development.
• Consequently, the efficiency of the program has beensignificantly increased (Fig.2A) .
• The output format of QU BIC-R can be used by othernetwork analysis software, such as Cytoscape (Smoot etal. 2011)
3. Functions
5. References
2. Implementation
1. Introduction
QUBIC: an R/Bioconductor package of qualitative biclustering for gene co-expression analysisJuan Xie1, Qin Ma1,2,3
[email protected], [email protected] Department of Mathematics and Statistic, South Dakota State University, Brookings, SD, 2 Department of Plant Science, South Dakota State University, Brookings, SD, 3 BioSNTR, Brookings, SD.
Acknowledgement
• Biclustering can discover the underlying structure of gene expression data (Fig.1), it is successful approach to conduct gene co-expression analysis.
• QUBIC has been reviewed as one of the best biclusteringalgorithms (Eren et al.,2013).
• A web server was developed to facilitate common users (Zhou et al.,2012)
• This R package (QUBIC-R) provides efficient and optimized implementation of QUBIC, with significantly improved efficiency and comprehensive functions.
• QUBIC-R is freely available online at http://bioconductor.org/packages/QUBIC/
Fig.2. : (A) Comparison of CPU running time between QUBIC-R and QUBIC; (B)Heatmap visualization for a single bicluster; (C) Heatmap visualization for twobiclusters; (D) Co-expression network for a single bicluster; (E) Co-expression networkfor two biclusters; (F) network for a single bicluster regenerated by Cytoscape; and (G)networks for twobiclusters regenerated byCytoscape.
• Eren, K., M. Deveci , O. Küçüktunç and Ü. V. Ça talyürek (2013). "Acomparative analysis of bic lustering algorithms for gene expression data."Briefings in bioinformatics 14(3):279-292
• Li, G., Q. Ma, H. Tang, A. H. Paterson and Y. Xu (2009). "QUBIC: aqualitative biclustering algorithm for analyses of gene expression data."Nucleic Acids Research 37(15): e101.
• Zhou, F., Q. Ma, G. Li and Y. Xu (2012). "QServer: a bic lustering serverfor pre-diction and assessmentof co-expressed gene clusters
• Smoot, M. E., K. Ono, J. Ruscheinski, P.-L. Wang and T. Ideker (2011)."Cytoscape 2.8: new fea tures for data integration and networkvisualization." Bioinformatics 27(3):431-432.
It is supported by supported by the State of South Dakota Rese archInnovation Center and the Agriculture Experiment Station ofSouth Dakota State University
Six functions are included in QUBIC-R:• qudiscretize creates a discrete matrix for a
given matrix.• BCQU and BCQUD perform biclustering for
continuous and discretized gene expression data, respectively.
• quheatmap draws heatmap for any single predicted bicluster or for two biclusters(Fig.2 B, C).
• qunetwork creates co-expression networks based on the identified biclusters (Fig.2 D, E).
• qunet2xml can convert the constructed networks into XGMML format for further analysis in Cytoscape(Fig.2 F,G), Biomaxand Jnets.4. Conclusion
• QUBIC-R implements the well-cited biclustering algorithm, QUBIC.
• It efficiently optimized the source code, improving the original efficiency by 44%.
• It also provides integrated functions to visualize the identified biclusters and corresponding co-expression networks.
• It offers output for further advanced analysis. • QUBIC-R can be a powerful tool for gene
expression data mining and co-expression network modeling
Genes
Conditions
Genes
Conditions
Biclustering
Fig.1Heatamp visualization of gene expression matrixbefore and after biclustering
F.
G.
Bicluster4
Bicluster3
Bicluster7
19
26
3. DNA motif identification and analyses
Development of Computational Tools in DNA motif identification and analysesJinyu Yang1, Qin Ma1,2,3
1Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA, 2Department of Plant Science, South Dakota State University, Brookings, SD, USA, 3BioSNTR, Brookings, SD, USA
Background
Computational Identification of cis-regulatory motif
Transcription initiation is regulated through interactions betweentranscriptionfactors (TFs) andtheir binding sites (motifs).
Key: TF RNA Polymerase
Key holes: cis-regulatory elements
Regulation of gene transcription
The essence of our algorithms: assessing the possibility for each nucleotide in a given promoter to be in a motif.
DOOR 2.0: operon database
Complete andreliable operon database covering 2,072 bacteria genomes and with overall accuracyof ~90% evalu
ated by Brouwer (2008) on Brief Bioinformatics.
111
Regulon Prediction
A new computational framework and a novel graph model integrating the motif comparison and clustering for regulon prediction.
Orthologouspromoters
Orthologousoperons
Operon……
Vertexblow-upandClustering
Phylogeneticfootprintingmotiffinding
ConstructionofCo-regulationGraph
σωω −
= max),( BACRS
A
......
MotifsofoperonA
MotifSimilarityEvaluation
2,1max ωω =1,1ω
2,2ω
nm ,ω
A
B
C
D
Meta-Cluster
AB
Cluster2
Cluster1
Vertexblow-upMotifsofoperonB
B
A B
],),([ ,1,12,1max nmωωωω
],),([ ,1,12,1max nmωωωω
maxω1,1ω
nm ,ω
An outline of the regulon prediction framework
MP3: phylogenetic footprinting
A phylogenetic footprinting framework (MP3) for prokaryotes based on a new orthologous data preparation procedure and a novel promoter scoring and pruning method.
Collection of ortholog promoters Motif voting Curve: scores on each nucleotide
M Dscan
M EM E
CUBI C
CO NSENSUS
Bi oprospector
BO BRO
Predicted motifs Curve fitting Graph model to cluster binding sites
P romote rs Ortho logousoperons
R1
r0
r1
r2
Rt
References
1. An integrative and applicable phylogen etic footp rinting framewo rk fo rcis-regulatory motif s identification in prokaryotic geno mes. BM CGenomics, under review.
2. Bacterial regulon modeling and predi ction based on systematic cisregulato ry motif an alyses, B Liu, C Zhou, H Zhang, G Li, Q Liu, Q Ma,Scientific Reports, 2016.
3. DMINDA: an integrated web server for DNA motif identification andanalyses, Q Ma, H Zhang, X Mao, C Zhou, B Liu, X Chen, Y Xu, Nucleicacids research, 42, W12-19, 2014.
4. DOOR 2.0: presenting operons and their fun ctions through dyn ami c andintegrat ed view s, X Mao, Q Ma, C Zhou, X Ch en, H Zhang, J Yang, F Mao,W Lai, Y Xu, Nucleic acids research 42 (D1), D654-D659, 2014.
5. An integrated toolkit for accu rat e prediction and analysis of ci s-regulato ry motif s at a g enome scale, Q Ma*, B Liu*, C Zhou, Y Yin, G Li, YXu, Bioinformatics 29 (18), 2261-2268, 2012.