analyzing digital gene expression data in galaxy supervisors: peter-bram a.c. ’t hoen kostas...

24
Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Upload: andrea-lawrence

Post on 19-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Analyzing digital gene expression data in Galaxy

Supervisors:

Peter-Bram A.C. ’t Hoen

Kostas Karasavvas

Students:

Ilya Kurochkin

Ivan Rusinov

Page 2: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

GalaxyGalaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.

Page 3: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Adding new tool in Galaxy

To add new tool in Galaxy you need:• Tool definition file in xml format

• The tool script

Page 4: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

...

Page 5: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

SAGE• Sequence and count short tags representative for a

transcript• Absolute abundance of transcript

Page 6: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Existing pipeline for analyzing DeepSAGE data

GAPSS: General analysis pipeline for second generation sequencers

Implemented in Galaxy

Some final steps were missed:- Gene annotation (ENSEMBL/Biomart) and summarization- Statistical analysis of differential gene expression

Page 7: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Existing workflow

Page 8: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Gene annotation and summarization

Tool for counting DeepSAGE tags in

ENSEMBL annotated exons.

Tool for automatic BioMart format file obtaining.

Page 9: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Obtain BioMart format file

Page 10: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Count DeepSAGE tags in annotated exons

Input files:1) BioMart format file:

2) SAM format file:

Page 11: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Count DeepSAGE tags in annotated exons

Page 12: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Count DeepSAGE tags in annotated exons

Output file:

Page 13: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Count DeepSAGE tags in annotated exons

1. For each line in SAM file reads all Biomart file. (~1 second/line)

2. BioMart file load into dictionary, data splits by chromosome name and strand. (50 seconds for 10,000 lines)

3. SAM file is loaded into dictionary, data splits by chromosome name, strand and genomic position. (16 seconds for 10,000 lines)

4. Work with several SAM files.

5. Both files are loaded into dictionaries. (16 seconds for 10,000 lines; ~16 minutes for 7,768,787 lines)

6. Sort BioMart dictionary by exon coordinates, problem with crossing and repeated exons.

7. Binary search for position from SAM file in sorted list of exon coordinates was implemented. (77 seconds for 7,768,787 lines)

Page 14: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

About R/Bioconductor

• R is a language and environment for statistical computing and graphics.

• Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.

Page 15: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Statistical analysis of differential gene expression

Tool for examining differential expression of replicated count data using edgeR package of Bioconductor

Tool for estimating the variance in count data and test for differential expression using DESeq package of Bioconductor

Page 16: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Analysis of differentially expressed genes (edgeR)

Input files:1. DeepSAGE tags in annotated

exons counter output file2. Metadata file

Design matrix Contrast vector

1

-1

0

Generalized linear model

Page 17: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Analysis of differentially expressed genes (edgeR)

Page 18: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Analysis of differentially expressed genes (edgeR)

Output file:

Page 19: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Analysis of differentially expressed genes (DESeq)

Test for differences between the base means of two levels

Input files:1. DeepSAGE tags in annotated

exons counter output file2. Metadata file

Create a CountDataSet object

Estimate the effective library size for a CountDataSet

Estimate the variance functions for a CountDataSet

Page 20: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Analysis of differentially expressed genes (DESeq)

Page 21: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Analysis of differentially expressed genes (DESeq)

Output file:

Page 22: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Comparison of results obtained by edgeR and DESeq

Page 23: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Full workflow

Page 24: Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov

Thank you for your attention

Any questions?