how to carry outbioinformaticstudies from ... · transcriptomics metabolomics metagenomics ......

20
1 How to carry out bioinformatic studies from computational point of view Qin Ma, Ph.D. Bioinformatics and Mathematical Bioscience Lab 04/29/2016 Bioinformatics and Mathematical Biosciences Lab @ SDSU

Upload: others

Post on 21-Aug-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

1

How to carry out bioinformatic studiesfrom computational point of view

Qin Ma, Ph.D.

Bioinformatics and Mathematical Bioscience Lab

04/29/2016

Bioinformatics and Mathematical Biosciences Lab @ SDSU

Page 2: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Bioinformatics

• This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes.

– Temple Smith, Current Topics in Computational Molecular Biology (2002)

4

Bioinformatics

GenomicsTranscriptomicsMetabolomicsMetagenomicsEpigenomicsProteomics

Interactomics…

Omics data

Systems biology

Page 3: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools

Page 4: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools

Page 5: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

The answer is NO!

Problem: to find a path through the nodes that would cross each edge once and only once.

A

B

C

D

Page 6: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Seven Bridges of Konigsberg

Its negative solution by Leonhard Euler in 1735 laid the foundations of graph theory.

The problem was to find a walk through the city that would cross each bridge once and only once.

A

B

C

D

A

B

C

D

Model: Euler tour ó no nodes of odd degree

Page 7: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Mathematical models in Bioinformatics

Page 8: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools

Page 9: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools

Page 10: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Algorithm Design

I + XI = Xa b

cd e f

gh

i j

Problem: to move the smallest number of matches to make the formula correct

Page 11: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Famous algorithms in Bioinformatics

Smith–Waterman algorithm

• Like physics, where general rules and laws are taught at thestart, biology will surely be presented to future generationsof students as a set of basic systems.

--Temple Smith

BLAST

Page 12: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Michael Waterman

Page 13: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools

1. C/C++, PERL, R, Python, PHP, JAVA, etc

Page 14: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Development of bioinformatic tools in BMBL

6

Page 15: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

1. RNA-seq analysis pipeline

Funded by Scholarly Excellence Funds of SDSU

24

RNA-sequencing reads

Read quality check

Qualified reads mapping

Replicate sample quality check

Gene assembly

Differentialexpression analysis

Pathwayenrichmentanalysis

cis-regulatorymotif

identification

FastQC

HISAT2

CufflinksDMINDA

FastX-FastQ

EdgeR

Page 16: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

16

Page 17: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

2. Bi-clustering for identification of co-expressed genes under some conditions

25

Gen

es

Conditions

Gen

es

Conditions

Bi-clustering

Page 18: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

• The original QUBI C program was written in GNU C,which is limited in portabil ity. And memory leak isanother concern.

• In QUBIC-R, C code was refactored and transformedinto C++, data structures was changed and C po interswas replaced by STL containers.

• Core function structures was optimized to facilitatepackage updates and further development.

• Consequently, the efficiency of the program has beensignificantly increased (Fig.2A) .

• The output format of QU BIC-R can be used by othernetwork analysis software, such as Cytoscape (Smoot etal. 2011)

3. Functions

5. References

2. Implementation

1. Introduction

QUBIC: an R/Bioconductor package of qualitative biclustering for gene co-expression analysisJuan Xie1, Qin Ma1,2,3

[email protected], [email protected] Department of Mathematics and Statistic, South Dakota State University, Brookings, SD, 2 Department of Plant Science, South Dakota State University, Brookings, SD, 3 BioSNTR, Brookings, SD.

Acknowledgement

• Biclustering can discover the underlying structure of gene expression data (Fig.1), it is successful approach to conduct gene co-expression analysis.

• QUBIC has been reviewed as one of the best biclusteringalgorithms (Eren et al.,2013).

• A web server was developed to facilitate common users (Zhou et al.,2012)

• This R package (QUBIC-R) provides efficient and optimized implementation of QUBIC, with significantly improved efficiency and comprehensive functions.

• QUBIC-R is freely available online at http://bioconductor.org/packages/QUBIC/

Fig.2. : (A) Comparison of CPU running time between QUBIC-R and QUBIC; (B)Heatmap visualization for a single bicluster; (C) Heatmap visualization for twobiclusters; (D) Co-expression network for a single bicluster; (E) Co-expression networkfor two biclusters; (F) network for a single bicluster regenerated by Cytoscape; and (G)networks for twobiclusters regenerated byCytoscape.

• Eren, K., M. Deveci , O. Küçüktunç and Ü. V. Ça talyürek (2013). "Acomparative analysis of bic lustering algorithms for gene expression data."Briefings in bioinformatics 14(3):279-292

• Li, G., Q. Ma, H. Tang, A. H. Paterson and Y. Xu (2009). "QUBIC: aqualitative biclustering algorithm for analyses of gene expression data."Nucleic Acids Research 37(15): e101.

• Zhou, F., Q. Ma, G. Li and Y. Xu (2012). "QServer: a bic lustering serverfor pre-diction and assessmentof co-expressed gene clusters

• Smoot, M. E., K. Ono, J. Ruscheinski, P.-L. Wang and T. Ideker (2011)."Cytoscape 2.8: new fea tures for data integration and networkvisualization." Bioinformatics 27(3):431-432.

It is supported by supported by the State of South Dakota Rese archInnovation Center and the Agriculture Experiment Station ofSouth Dakota State University

Six functions are included in QUBIC-R:• qudiscretize creates a discrete matrix for a

given matrix.• BCQU and BCQUD perform biclustering for

continuous and discretized gene expression data, respectively.

• quheatmap draws heatmap for any single predicted bicluster or for two biclusters(Fig.2 B, C).

• qunetwork creates co-expression networks based on the identified biclusters (Fig.2 D, E).

• qunet2xml can convert the constructed networks into XGMML format for further analysis in Cytoscape(Fig.2 F,G), Biomaxand Jnets.4. Conclusion

• QUBIC-R implements the well-cited biclustering algorithm, QUBIC.

• It efficiently optimized the source code, improving the original efficiency by 44%.

• It also provides integrated functions to visualize the identified biclusters and corresponding co-expression networks.

• It offers output for further advanced analysis. • QUBIC-R can be a powerful tool for gene

expression data mining and co-expression network modeling

Genes

Conditions

Genes

Conditions

Biclustering

Fig.1Heatamp visualization of gene expression matrixbefore and after biclustering

F.

G.

Bicluster4

Bicluster3

Bicluster7

Page 19: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

19

26

3. DNA motif identification and analyses

Page 20: How to carry outbioinformaticstudies from ... · Transcriptomics Metabolomics Metagenomics ... cis-regulatorymotifs identification in prokaryotic genomes. BMC Genomics, underreview

Development of Computational Tools in DNA motif identification and analysesJinyu Yang1, Qin Ma1,2,3

1Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA, 2Department of Plant Science, South Dakota State University, Brookings, SD, USA, 3BioSNTR, Brookings, SD, USA

Background

Computational Identification of cis-regulatory motif

Transcription initiation is regulated through interactions betweentranscriptionfactors (TFs) andtheir binding sites (motifs).

Key: TF RNA Polymerase

Key holes: cis-regulatory elements

Regulation of gene transcription

The essence of our algorithms: assessing the possibility for each nucleotide in a given promoter to be in a motif.

DOOR 2.0: operon database

Complete andreliable operon database covering 2,072 bacteria genomes and with overall accuracyof ~90% evalu

ated by Brouwer (2008) on Brief Bioinformatics.

111

Regulon Prediction

A new computational framework and a novel graph model integrating the motif comparison and clustering for regulon prediction.

Orthologouspromoters

Orthologousoperons

Operon……

Vertexblow-upandClustering

Phylogeneticfootprintingmotiffinding

ConstructionofCo-regulationGraph

σωω −

= max),( BACRS

A

......

MotifsofoperonA

MotifSimilarityEvaluation

2,1max ωω =1,1ω

2,2ω

nm ,ω

A

B

C

D

Meta-Cluster

AB

Cluster2

Cluster1

Vertexblow-upMotifsofoperonB

B

A B

],),([ ,1,12,1max nmωωωω

],),([ ,1,12,1max nmωωωω

maxω1,1ω

nm ,ω

An outline of the regulon prediction framework

MP3: phylogenetic footprinting

A phylogenetic footprinting framework (MP3) for prokaryotes based on a new orthologous data preparation procedure and a novel promoter scoring and pruning method.

Collection of ortholog promoters Motif voting Curve: scores on each nucleotide

M Dscan

M EM E

CUBI C

CO NSENSUS

Bi oprospector

BO BRO

Predicted motifs Curve fitting Graph model to cluster binding sites

P romote rs Ortho logousoperons

R1

r0

r1

r2

Rt

References

1. An integrative and applicable phylogen etic footp rinting framewo rk fo rcis-regulatory motif s identification in prokaryotic geno mes. BM CGenomics, under review.

2. Bacterial regulon modeling and predi ction based on systematic cisregulato ry motif an alyses, B Liu, C Zhou, H Zhang, G Li, Q Liu, Q Ma,Scientific Reports, 2016.

3. DMINDA: an integrated web server for DNA motif identification andanalyses, Q Ma, H Zhang, X Mao, C Zhou, B Liu, X Chen, Y Xu, Nucleicacids research, 42, W12-19, 2014.

4. DOOR 2.0: presenting operons and their fun ctions through dyn ami c andintegrat ed view s, X Mao, Q Ma, C Zhou, X Ch en, H Zhang, J Yang, F Mao,W Lai, Y Xu, Nucleic acids research 42 (D1), D654-D659, 2014.

5. An integrated toolkit for accu rat e prediction and analysis of ci s-regulato ry motif s at a g enome scale, Q Ma*, B Liu*, C Zhou, Y Yin, G Li, YXu, Bioinformatics 29 (18), 2261-2268, 2012.