r for clinical breast cancer research · 2015-12-04 · ki biobank, institutionen för medicinsk...
TRANSCRIPT
R FOR CLINICAL BREAST CANCER RESEARCH
R in Genomics, 20150323 Daniel Klevebring, Ph D [email protected]
1
D ATA F L O W A N D B I O I N F O R M AT I C SI N C L I N I C A L C A N C E R R E S E A R C H… A N D S O M E R S T U F F.
2
C L I N I C A L S E Q U E N C I N G O F C A N C E R
Most overused slide in genomics
3
4
ClinSeq overall objectives and short term goals
Perform disease subtyping at primary diagnosis with the goal to replace and improve standard diagnostics Ensure rapid adoptions of new research findings Ensure rapid inclusion of patients in clinical trials Include first prospective cases in breast cancer and AML by Q1 2015 Develop a clinical pipeline for cancer genomics with capacity to handle 5000+ cancers yearly
Overall objectives Short term goals
ClinSeq pipeline
5
Patient value
Research opportunities !
6
ClinSeq clinical collaborations and funding
AML Breast cancer Ovarian cancer
Colorectal cancer Pancreatic cancer
– Sören Lehmann, hematologist – Christer Nilsson, hematologist – Jonas Bergh, oncologist – Johan Hartman, pathologist – Kamila Czene, epidemiologist – Lorand Kis, pathologist – Jan Frisell, surgeon – Irma Fredriksson, Surgeon
– Henrik Falconer, surgeon – Hanna Dahlstrand, oncologist – Joseph Carlson, pathologist – Anna Martling, surgeon – Maria Gustavsson Liljefors, oncologist – Sam Ghazi, pathologist – Matthias Löhr, professor – Caroline Verbeke, pathologist – Marco Del Chiaro, surgeon
Clinical collaborations Funding
This can happen within one pa0ent
Tumor Frozen piece (/RNALater/FFPE)
Normal Tissue
Examples • Mul0focal tumor • Contralateral BC • Primary tumor and local met (node) • Primary tumor and recurrence • Primary tumor and metastasis
54
1 2
3
Blood
-Deltagarens personnummer
Å Å M M DD NNNN
- -
Fylls i av provtagare och patolog. Avläses maskinellt. Skriv tydligt.
T T M M
Å Å Å Å M M D D
L
Provtagningsdatum
KI Biobank, Institutionen för Medicinsk Epidemiologi och Biostatistik, Karolinska Institutet, Box 281, 171 77 Stockholm Tel: 08-524 823 77, ki.se/kibiobank, [email protected]
tissue1 tissue2
Ref
1 2
Kontrollera identiteten, fyll i uppgifterna på remissen. Glöm ej att fylla i datum och tid för operation och för patolog-arbete.Märk 1-4 rör med streckkodsetikett. OBS! Tag rätt rör till rätt tumör.Fyll i fältet provbit med antal provbitar som tas (1-4).
Tumörpreparat:1. Tag ett skrap från tumören och skrapa av på kantens insida av ett RNAlater-rör.2. Skruva på locket och vänd röret försiktigt ca 10 ggr.3. Placera i kyl (+4 grader)
:
ClinSeq Bröst
Tissue 1 tissue1 Tag rätt etikett till rätt prov
tissue3
tissue2Tissue 2
Klockslag för provhantering påpatologen
Kommentar
Tissue 3
ClinSeq Bröst - Tumörpreparat
RemissID
Provtagares signatur
Studieinformation: Kombikakod:
Mottagande biobank: KI Biobank (reg.nr.222),Karolinska Institutet.
+
tissue4
4
ClinSeq Bröst
Tissue 4 tissue4
L
L
ClinSeq Bröst
Instruktion för provtagning:
KI BiobankRemiss för provtagning
Prov 1
Vänster Höger
00021001710
Ver: 2015-03-03, ClinSeq-Bröst_tumör
Γ
Kontaktperson:
Carin Cavalli- Björkman
tel: 08-524 867 97
Antal provbitar:
Prov 4
Prov 3
Prov 2
tissue3
3
ClinSeq Bröst
rid
Kön:Kvinna Man
Godkänt av Etikprövningsnämnden, dnr:2011/1020-32 och 2013/1833-31/2.
-Deltagarens personnummer
Å Å M M DD NNNN
- -
Fylls i av provtagaren. Avläses maskinellt. Skriv tydligt.
T T M MÅ Å Å Å M M D D
+
Provtagningsdatum
KI Biobank, Institutionen för Medicinsk Epidemiologi och Biostatistik, Karolinska Institutet, Box 281, 171 77 Stockholm Tel: 08-524 823 77, ki.se/kibiobank, [email protected]
blood
1
Ref
4 mL EDTA
Instruktion för blodprovtagning Karolinska Universitetssjukhuset Solna: 1. Kontrollera identiten, fyll i uppgifterna på remissen. Glöm ej att fylla i datum och tid.
2. Märk ett rör med streckkodsetikett.
1 st. 4 mL EDTA rör, lila propp
4. Tag provet som vid vanlig venpunktion.
5. Vänd röret försiktigt ca 10 ggr.
6. Skicka provet till KI Biobank enligt vanliga rutiner.
Vid frågor kontakta Carin Cavalli- Björkman tel: 08-524 867 97
Provtagares telefonnummer
:
Clin
Seq
Brö
st
Blod blood1Placera etiketten rakt på röret.
Provtagningstid (24-timmarsklocka)
Kommentar
ClinSeq Bröst - Blodprov
Remiss för provtagningKI Biobank
RemissID
Provtagares signatur
Kombikakod:
Mottagande biobank: KI Biobank (reg.nr.222), Karolinska Institutet.
+
+
+
+
rid
Ver. 2015-03-03, ClinSeq-Bröst_blod_Karolinska Solna
Godkänt av Etikprövningsnämnden dnr:
2011/1020-32 och 2013/1833-31/2.
00021001710
Kön:Kvinna Man
Referrals ClinSeq Breast
KI Biobank
Local MSSQL db
BloodRefTbl TissueRefTbl
ResultsTbl
SciLifeLab
Seq + bioinfo
MEB Firewall
Push from KI Biobank
Nightly
cFTP
Path Referral
Surgeon Referral Blood Referral
cFTP
10
Bioinformatics preprocessing
• ≈1000 patients analyzed to date
• 15 Gb raw data / TN pair • Analysis ≈ 100 CPUh / TN pair • Uses accepted best practices for each data type • Open source tools • Based on GATK Queue • Generation of reference files and tool-spec indices
GRCh37 aka 1000kg aka GATK bundle ref
• Included chromosomes – GRCh37 1-‐22 – X – Y
• Masked pseudoautosomal regions (PARs) – PAR1 chrY:10001-‐2649520 ßà chrX:60001-‐2699520 – PAR2 chrY:59034050-‐59363566 ßà chrX:154931044-‐155260560
• X-‐PARs will look diploid for men – MT -‐ NC_012920.1 – GL000191-‐249 (unplaced con0gs) – Decoy sequences (d5, incl NC_007605 (EBV) ) from
• _p://_p.1000genomes.ebi.ac.uk/vol1/_p/technical/reference/phase2_reference_assembly_sequence/
– Source file that we use: _p://gsapub_p-‐anonymous@_p.broadins0tute.org/bundle/2.8/b37/human_g1k_v37_decoy.fasta.gz
track name='PARs' description='PARs' chrX 60001 2699520 chrX 154931044 155260560 chrY 10001 2649520 chrY 59034050 59363566
Parsing varia0on resource files • dbSNP, Cosmic, ClinVar, ExAC are all great resources • None of them follow the VCF4.1 spec, which they all claim
to do • The problem:
12
dbSNP VCF
ClinVar VCF
Cosmic VCF
ExAC VCF
Use these files to annotate this file
Sample VCF
"How hard could this possibly be?"
Issues with annota0ng variants • Mul0alleleic variats
– Mul0ple ALT alleles on a single VCF line ("T,A") • (1) Some tools check the whole ALT string for iden0ty
– Does "A" equal "A,T" à no à move on – WRONG
• (2) Some tools ignore mul0alleleic variants – Some GATK tools
• (3) Some tools only check chr/pos for iden0ty, without regarding REF and ALT alleles
– Some GATK tools here as well
• Solu0on: Split mul0allelic variants into mul0ple lines and avoid tools in category (2)
– Must learn how each tool operates
• Måns Magnusson's vcf_parser correctly handles FORMAT, INFO and genotypes when splilng mutliallelic variants
– hmps://github.com/moonso/vcf_parser
13
Le_ alignment of indels • Indels can have mul0ple correct representa0ons • Consensus strategy is to "le_ align" • GATK Le_AlignAndTrimVariants
– When splitMultiallelics is set to true, the INFO field is not correctly split, and genotypes are dropped (set to ./.)
– Breaks if any indel > 200 bps is found – Doesn't check that REF allele matches the reference
sequence (by default) – (tested with GATK 3.3-‐0 – bugs reported, so can change)
• bc_ools norm – Requires that REF matches the reference sequence – Cosmic has some variants that map to the Y-‐PARs, which has
reference sequence NNNN in the build we use • (Related hmps://github.com/arq5x/gemini/issues/346 )
14
External VCF prep pipeline
vcfsorter.pl vcf_parser
sort in dict order split mul0allelics le_ align
bc_ools norm
vcfsorter.pl by German Leparc from hmps://code.google.com/p/vcfsorter/
bedtools intersect
filter Y-‐PAR variants
Adrian Tan et al, Bioinforma0cs (2015) doi: 10.1093/bioinforma0cs/btv112
Panel
Low-‐pass WGS
RNAseq
Sharing is caring – [Ongoing work] – We want our data to be accessible to others – Two tiers required
• Open access for non-personal data – Tumor-specific alterations, gene expression levels, some
phenotypes • Controlled access
– Genetic data (considered personal data by Swedish law) – Certain sensitive phenotypes
– Let’s not reinvent the wheel - Others to this well • ICGC
– Can we submit our data to EGA? – Legal &| consent issues?
17
R in ClinSeq
• BioConductor – CNANorm, QDNAseq for CNV analyses – Rmarkdown and Sweave for report genera0on
• Rstudio – Rmarkdown/knitr + git to version control project spec analyses
• Hadleyverse – ggplot2/devtools/dplyr/reshape/0dyr/data.table
18
Hadleyverse
• Use cases: – data.table reduces reading 0me of RNAseq data from 30 min to <2 min
• 400 files, each a matrix of 2 x 50000
– tidyr and reshape enable rapid – ggplot2 makes beau0ful plots with powerful syntax
– magrittr enables piping in R
19
Bioconductor
• Repo for HT-‐biology-‐related R packages – Sta0s0cal and graphical methods – Genome Annota0on
• bioconductor.org • Updated twice per year • Focus on packages with vignemes • In a pipeline stucture, runnable scripts are needed – Wouldn't it be nice if packages ship with a #!Rscript included?
20
Time to befriend getopt library(getopt) # set variables # format is c(long, short, argmask, datatype, desk) # argmask 0=no arg, 1=req, 2=optional args <-‐ rbind( c("bam", "b", 1, "character", "Input bam file"), c("output", "o", 1, "character", "Output tsv"), c("background", "x", 1, "character", "Background set to use, as a RData file")) opts <-‐ getopt(args) # opts$bam # opts$output # opts$background
21
getopt, check parameters #check cli parameters if(is.null(opts$bam)){ stop("Must specify input bam file -‐-‐bam/-‐b.") } if(is.null(opts$output)){ stop("Must specify output tsv file name -‐-‐segments/-‐s.") } if(is.null(opts$background)){ stop("Must specify background file -‐-‐background/-‐x.") }
22
Write to file, sense .gz suffix ## Write to outfile, gzip if outfile end with gz. cat("Writing outfile...\n") ofile <-‐ opts$output if( grepl("gz$", opts$output) ){ ofile <-‐ gzfile( opts$output, 'w' ) } write.table(dat, ofile, col.names=TRUE, dec=".", quote=FALSE, sep="\t", row.names=FALSE) if( grepl("gz$", opts$output) ){ close(ofile) }
23
oncoprints in R
24
Premy print matricies of genomics data As seen on cbioportal.org hmps://github.com/dakl/oncoprint
oncoprints in R library(devtools) install_github("dakl/oncoprint") library(oncoprint) data(tcga_brca) # load example data # vertical x-‐labels vert_x <-‐ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=.5)) oncoprint(tcga_brca) + coord_fixed() + vert_x
25