1 cse dept., university of connecticut 2 immunology dept., uchc

1
A nalysis Pipeline G enom e sequence mRNA reads Consensus coding sequences CCDS M apping G enom e M apping Read merging Epitope Prediction M erged m apped reads Expressed SN Ps CCDS m apped reads G enom e m apped reads Imm unogenic mutations K now n SN Ps SN P calling A pproaches to Phasing W e m axim ize the num berofaccurately m apped reads by using M aq to m ap them againstboth the reference genom e and reference transcripts (from C C D S database) Forcom bining read m apping results w e im plem ented tw o approaches called hard m erging and softm erging H ard m erging throw s aw ay reads thatare m apped uniquely by one procedure and to m ultiple places by the otherw hile softm erging keeps the unique alignm entforthese reads Both m erging m ethods keep reads m apped uniquely to the sam e place by both m apping procedures and reads m apped by one procedure and notm apped by the other Both m ethods throw aw ay reads m apped uniquely to differentplaces by both m apping procedures and reads m apped m ultiple tim es by both procedures. M apping R eads SN P C alling M ethods Binom ial: binom ialtest used in [3,7] for calling SN Ps from genom ic D NA Binom ial probability based on the tw o highestallele counts underthe null hypothesis thatthe genotype is heterozygous Posterior: uses base quality scores and read mapping probabilities C onditionalprobability of observing read data given each possible genotype G is computed as a product of read contributions, assuming independence between reads. A base b mapped with errorprobability e b contributes as follow s: ForG =XX,1-e b ifX = b,e b /3 otherw ise ForG =XY,X≠Y,(1-e b )/2+e b /6 ifX=b orY=b,e b /3 otherw ise M aq mapping probabilities taken into account by raising the corresponding term to the probability that the read is mapped correctly atthis location The posterior probability of each genotype is then evaluated assum ing uniform priors. A variant is called in this approach if the genotype with highest posterior probability is different than hom ozygous reference and exceeds a userspecified threshold R esults on C ancerC ellLine R eads 2591 3859 19874 29561 102567 softmerged 2204 3393 19045 28532 100215 hardmerged 1879 2924 16712 25333 93499 genom e 2777 4121 20006 29623 102874 transcripts 0.999 0.99 0.95 0.9 0.1 Threshold 2351 3186 4447 5114 7371 softmerged 2001 2773 3973 4609 6775 hardmerged 1727 2422 3505 4093 6108 genom e 2508 3376 4661 5351 7638 transcripts 0.999 0.99 0.95 0.9 0.1 Threshold Alt.Coverage 1 Alt.Coverage 3 R esults on H apm ap R eads 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 200 400 600 800 1000 1200 False Positives True Positives Posterior Binomial Exact Binomial Cumulative M aq 0 500 1000 1500 2000 2500 3000 3500 0 20 40 60 80 False Positives True Positives Posterior Binomial Exact Binomial Cum ulative 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 100 200 300 400 False Positives True Positives Transcripts G enom e Hard M erge SoftM erge 0 500 1000 1500 2000 2500 3000 3500 0 5 10 15 20 25 30 35 False Positives True Positives Transcripts G enom e Hard M erge SoftM erge Alt.coverage 1 Alt.coverage 3 Validation R esults •Predicted im m unogenic m utations are currently being validated by Sangersequencing •Forconfirm ed m utations,peptide im m unogenicitywill be validated experim entally Proposed pipeline has im proved accuracy fordetecting m utations com pared to previous m ethods O ngoing w ork includes refining the posteriorm ethod to increase m utation detection robustness in the presence ofdifferential allelic expression and detecting short im m unogenic indels C onclusions & O ngoing W ork ExperimentalSetup •W e tested the perform ance ofim plem ented m ethods on 63 m illion Illum ina m R N A reads generated from blood cell tissue of H apm ap individual N A12878 [2](N C BISR A database accession num berSR X000566) •W e included in evaluation H apm ap SNPs in know n exons for w hich there w as atleastone m apped read by any m ethod •H apm ap genotypes forthese SN Ps:22,362 hom ozygous reference,7,893 heterozygous orhom ozygous variant •True positives:called SN Ps for w hich H apm ap genotype is heterozygous orhom ozygous variant •False positives:called SN Ps forw hich H apm ap genotype is hom ozygous reference •W e also ran ouranalysis pipeline on 6.75 m illion Illum inareads from m R N A isolated from a m ouse cancertum orcell line Im m unotherapy is a prom ising cancertreatm entapproach that relies on aw akening the im m une system to the presence of antigens associated w ith tum orcells •The success ofthis approach depends on the ability to reliably detectim m unogenic cancerm utations,the vastm ajority ofw hich are expected to be tum or-specific [6] •In this posterw e presenta bioinform atics pipeline fordetecting im m unogenic cancerm utations from high throughputm RNA sequencing data •Im m unogenic m utations predicted by ourpipeline from Illum ina m R N A reads generated from a m ouse cancertum orcell line are currently underexperim ental validation Introduction Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1 , Ion Mandoiu 1 , and Pramod Srivastava 2 1 CSE Dept., University of Connecticut 2 Immunology Dept., UCHC

Upload: dragon

Post on 11-Feb-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1 , Ion Mandoiu 1 , and Pramod Srivastava 2. 1 CSE Dept., University of Connecticut 2 Immunology Dept., UCHC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 1 CSE Dept., University of Connecticut 2 Immunology Dept., UCHC

Analysis Pipeline

Genome sequence

mRNA reads

Consensus coding

sequences

CCDSMapping

Genome Mapping

Read merging

EpitopePrediction

Merged mapped

reads

Expressed SNPsCCDS

mapped reads

Genome mapped

reads

Immunogenic mutations

Known SNPs

SNP calling

Approaches to Phasing

• We maximize the number of accurately mapped reads by using Maq to map them against both the reference genome and reference transcripts (from CCDS database)

• For combining read mapping results we implemented two approaches called hard merging and soft merging– Hard merging throws away reads that are mapped uniquely by one

procedure and to multiple places by the other while soft merging keeps the unique alignment for these reads

– Both merging methods keep reads mapped uniquely to the same place by both mapping procedures and reads mapped by one procedure andnot mapped by the other

– Both methods throw away reads mapped uniquely to different places by both mapping procedures and reads mapped multiple times by both procedures.

Mapping Reads

SNP Calling Methods•Binomial: binomial test used in [3,7] for calling SNPs from genomic DNA

– Binomial probability based on the two highest allele counts under the null hypothesis that the genotype is heterozygous

•Posterior: uses base quality scores and read mapping probabilities

– Conditional probability of observing read data given each possible genotype G is computed as a product of read contributions, assuming independence between reads. A base b mapped with error probability eb contributes as follows:• For G=XX, 1-eb if X = b, eb/3 otherwise• For G=XY, X≠Y, (1-eb)/2+eb/6 if X=b or Y=b, eb/3 otherwise

– Maq mapping probabilities taken into account by raising the corresponding term to the probability that the read is mapped correctly at this location

– The posterior probability of each genotype is then evaluated assuming uniform priors. A variant is called in this approach if the genotype with highest posterior probability is different than homozygous reference and exceeds a user specified threshold

Results on Cancer Cell Line Reads

259138591987429561102567softmerged

220433931904528532100215hardmerged

18792924167122533393499genome

277741212000629623102874transcripts

0.9990.990.950.90.1Threshold

23513186444751147371softmerged

20012773397346096775hardmerged

17272422350540936108genome

25083376466153517638transcripts

0.9990.990.950.90.1Threshold

Alt. Coverage 1

Alt. Coverage 3

Results on Hapmap Reads

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 200 400 600 800 1000 1200

False Positives

Tru

e P

osi

tive

s

Posterior

Binomial Exact

Binomial Cumulative

Maq

0

500

1000

1500

2000

2500

3000

3500

0 20 40 60 80

False Positives

Tru

e P

osi

tive

s

Posterior

Binomial Exact

Binomial Cumulative

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 100 200 300 400

False Positives

Tru

e P

osi

tive

s

Transcripts

Genome

Hard Merge

Soft Merge

0

500

1000

1500

2000

2500

3000

3500

0 5 10 15 20 25 30 35

False Positives

Tru

e P

osi

tive

s

Transcripts

Genome

Hard Merge

Soft Merge

Alt. coverage 1 Alt. coverage 3

Validation Results

•Predicted immunogenic mutations are currently being validated by Sanger sequencing

•For confirmed mutations, peptide immunogenicity will be validated experimentally

• Proposed pipeline has improved accuracy for detecting mutations compared to previous methods

• Ongoing work includes refining the posterior method to increase mutation detection robustness in the presence of differential allelic expression and detecting short immunogenic indels

Conclusions & Ongoing Work

Experimental Setup•We tested the performance of implemented methods on 63 million Illumina mRNA reads generated from blood cell tissue of Hapmap individual NA12878 [2] (NCBI SRA database accession number SRX000566)

•We included in evaluation Hapmap SNPs in known exons for which there was at least one mapped read by any method

•Hapmap genotypes for these SNPs: 22,362 homozygous reference, 7,893 heterozygous or homozygous variant

•True positives: called SNPs for which Hapmap genotype is heterozygous or homozygous variant

•False positives: called SNPs for which Hapmap genotype is homozygous reference

•We also ran our analysis pipeline on 6.75 million Illumina reads from mRNA isolated from a mouse cancer tumor cell line

• Immunotherapy is a promising cancer treatment approach that relies on awakening the immune system to the presence of antigens associated with tumor cells

• The success of this approach depends on the ability to reliablydetect immunogenic cancer mutations, the vast majority of which are expected to be tumor-specific [6]

• In this poster we present a bioinformatics pipeline for detecting immunogenic cancer mutations from high throughput mRNA sequencing data

• Immunogenic mutations predicted by our pipeline from IlluminamRNA reads generated from a mouse cancer tumor cell line are currently under experimental validation

Introduction

Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing

Jorge Duitama1, Ion Mandoiu1, and Pramod Srivastava2

1CSE Dept., University of Connecticut2Immunology Dept., UCHC