Download - undergrad thesis
Differential Gene Expression In Rna-Seq Data For Oral Squamous Cell
Carcinoma Using Bioconductor
19-Apr-15 1
By:
Kasturi P Chandwadkar
BBI 8th sem
BI-12
Overview
• Introduction
• Methodology
• Results
• Conclusion
• References
19-Apr-15 1
INTRODUCTION
• Oral squamous cell carcinoma(OSCC) represents 90% of oral cancer and the chances increase with the increase in age.
• Techniques for assessing and quantifying RNA by high-throughput sequencing are collectively known as “RNA- Seq”.
• RNA-Seq has been applied to get the complex transcriptomes /genes of mammalian samples, including human embryonic kidney and B-cells, mouse embryonic stem cells, blastomeres, and different mouse tissues
19-Apr-15 3
ADVANTAGES OF RNA SEQ
• One of the advantages of RNA-Seq over other profiling technologies like microarray is the ability to query all transcripts without prior knowledge about the location and structures of genes.
• RNA-Seq is not limited to detecting transcripts that correspond to existing genomic sequence.
• RNA-Seq has very low background signal because DNA sequences can unambiguously mapped to unique regions of the genome
19-Apr-15 4
R AND BIOCONDUCTOR PACKAGES• R (http://cran.at.r-project.org) is a comprehensive statistical
environment and programming language for professional data analysis and graphical display.
• Bioconductor (http://www.bioconductor.org/) provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, sequence and genome analysis.
• Packages used for differential gene expression:• Biostrings
• biomaRt
• baySeq
• DESeq
• edgeR
19-Apr-15 5
Methodology• RETRIEVAL OF NGS DATA• The RNA-Seq data (FASTQ files) of oral squamous cell carcinoma was taken
from Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) with accession number GSE20116
• MAPPING OF GENOMIC READS• The short reads are mapped/aligned to the reference genome using Bowtie.
• GENERATING COUNT FILE• A count file is matrix in which counts represent the number of times the
genomic region mapped with the reference genome and Id represents the genomic region annotation.
• GETTING DIFFRENTAL EXPESSION GENES• edgeR• DESeq• baySeq
19-Apr-15 6
RNA-Seq analysis pipeline for detecting DGE
19-Apr-15 7
SHORT READS
ALIGN READS TO REFERENCE GENOME
PREPARE COUNT FILE FROM SAM FILE
GET DIFFERENTIAL GENE EXPRESSION
edgeR baySeqDESeq
List of DEG List of DEG List of DEG
Venn diagram of DEG from three packages
Results• DATA EXPLORATION
19-Apr-15 8
Outlier in the data
edgeR
Gene id logFC P-value
KRT36 -8.103353 7.842049e-15
SFTPB -8.120520 2.535246e-14
CA3 -6.443105 1.804193e-13
TNNC2 -6.431288 3.040273e-13
MAGEA11 8.881312 1.124744e-12
19-Apr-15 9
TOP 5 DIFFERENTIALLY EXPRESSED GENES
deSeq
19-Apr-15 10
Gene id logFC p-value
FBP2 Infinite 1.576300e-05
TUSC5 Infinite 9.160142e-04
UTS2R Infinite 1.520430e-03
ADIPOQ 7.231394 1.444721e-03
C6 7.162190 9.311805e-05
TOP 5 UPREGULATED GENES
Gene id logFC p-value
EMX1 -Infinite 1.765941e-03
VTCN1 7.289467 4.408178e-07
HOXD11 5.504204 2.041803e-04
HOXC8 5.503361 1.621344e-04
C5orf38 5.428227 9.407919e-05
TOP 5 DOWNGULATED GENES
bayseq
19-Apr-15 11
Gene id LIKELIHOOD FDR
RRAGD 0.9987850 0.001214965
TGFBR3 0.9981198 0.001547566
PYGM 0.9973711 0.001908003
SH3BGRL2 0.9973000 0.002106007
PLA2G2A 0.9972789 0.002229018
TOP 5 DIFFERENTIALLY EXPRESSED GENES
Venn Diagram Of DGE With P-value Less Than 0.01
19-Apr-15 12
Conclusion
• We have demonstrated that our DGE method can be successfully applied to RNA-Seq samples in tumor and matched normal tissues.
• By using three different statistical methods for inferring differential gene expression in oral squamous cell carcinoma (OSCC) we got 215 genes common using three packages.
• 1054 genes are common between edgeR and DESeq, 217 are common in between DESeq and baySeq and 278 are common between edgeR and baySeq.
19-Apr-15 13
Below is table with some of the differential expressed genes in cancer sample which may be related to cancer.
Gene id Description
KRT36 keratin, type I cuticular
ADIPOQ adiponectin C1Q and collagen domain containing
PLA2G2A Phospholipase A2, group IIA (platelets, synovial fluid)
CEACAM7 Carcinoembryonic antigen-related cell adhesion molecule
SPINK7 Serine peptidase inhibitor, Kazal type 7 (putative)
esophagus cancer related gene 22
ALDH1A2 Aldehyde dehydrogenase 1 family, member
ENDOU Endonuclease, polyU-specific
ANGPTL1 Angiopoietins
GDF10 Growth differentiation factor 10
TUSC5 Tumor suppressor candidate 5
4/19/2015 14
REFERENCES• [1] Published online 15 October 2008 | Nature 455, 847 (2008) |
doi:10.1038/455847a• [2] A scaling normalization method for differential expression analysis of RNA-seq
data Mark D Robinson1,2*, Alicia Oshlack1*• [3] Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances
Associated with Copy Number Alterations. Brian B. Tuch1., Rebecca R. Laborde2., Xing Xu1, Jian Gu3, Christina B. Chung1, Cinna K. Monighetti1.
• [4] Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
• Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg• [5] V. Costa, A. Casamassimi, and A. Ciccodicola, “Nutritional genomics era:
opportunities toward a genome-tailored nutritional regimen,” The Journal of Nutritional Biochemistry, vol. 21, no. 6, pp. 457–467, 2010.
• [6] E. Birney, J. A. Stamatoyannopoulos, A. Dutta, et al., “Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project,” Nature, vol. 447, no. 7146, pp. 799–816, 2007.
• [7] F. S. Collins, E. S. Lander, J. Rogers, and R. H. Waterson, “Finishing the euchromatic sequence of the human genome,” Nature, vol. 431, no. 7011, pp. 931–945, 2004.
• [8] International Human Genome Sequencing Consortium, “A haplotype map of the human genome,” Nature, vol. 437, no. 7063, pp. 1299–1320, 2005.
19-Apr-15 15
19-Apr-15 16