ngs data processing bioinformatics tips, tools of the trade and pipeline writing na cai 4 th year...

Download NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint

If you can't read please download the document

Upload: august-lloyd

Post on 22-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint
  • Slide 2
  • Example projects CONVERGE -1.7x whole genome sequencing in 12,000 Han Chinese Women -6000 Cases of MD, 6000 controls -Detailed questionnaire -45T of sequencing data Commercial Outbred Mice -0.1x whole genome sequencing in 2,000 mice -Known breeding history -Extensive phenotyping -2T of sequencing data
  • Slide 3
  • NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
  • Slide 4
  • Large-scale sequencing projects Lots of data Terabytes! Storage problems, I/O problems, RAM problems Time consuming to process Errors! Lots of them! Contamination Duplication Missing data Difficult regions/features of the genome
  • Slide 5
  • Approach to NGS data Explore the data before processing large-scale Pilot your experiments with small subsets Try default parameters of softwares before altering Check output Right number of lines? Did anything fail silently? Different handling of different classes of input? How are missing values coded? % failure?
  • Slide 6
  • Exploratory work in R read.table(, as.is=T, na.strings=c(NA, nan)) dim(), str(), mode(), complete.cases() head(), tail() table(), summary() order(), rank() plot(), library(ggplot2) library(plyr)
  • Slide 7
  • Pipeline writing Arguments/options for different input Arguments/options for parameters/auxillary files Reusable functions Reasonably flexible input format recognition Set up for parallelizing stderr for debugging, checking progress, but beware of its size and I/O! Create new directories as you go along Create flag files to indicate successful completion of each step
  • Slide 8
  • Make Specify input file and output file Specify command for input output Make checks presence of output file before running command Make deletes output of commands that did not finish running
  • Slide 9
  • Ruffus http://www.ruffus.org.uk Flexible: one many and many one processes Fully integrated with Python programming Need specify only the max number of cores allowed for parallelisation Useful printout options to check pipeline
  • Slide 10
  • Setting up Ruffus
  • Slide 11
  • Once Ruffus is set up - Help
  • Slide 12
  • Once Ruffus is set up just print
  • Slide 13
  • NGS data processing Taken from: http://www.broadinstitute.org/gatk/guide/best-practices
  • Slide 14
  • Processing a raw BAM file Practical concerns Number of samples Size of files Run time Server/cluster usage: How the jobs can be parallelized Scientific concerns Ploidy of genome Source of DNA Features of genome Variation between samples Genome coverage Error rates
  • Slide 15
  • Manipulating a BAM file Converting between bams and fastqs Indexing Coordinate sorting Splitting or merging Filter out reads using bitwise flags/other criteria Mask entire regions
  • Slide 16
  • Example: Contaminants
  • Slide 17
  • Slide 18
  • Useful Resource: Harvard Sysbio Remove duplicate sequences in FASTA Remove short sequences in FASTA Format FASTA http://archive.sysbio.harvard.edu/csb/resources/computati onal/scriptome/UNIX/Protocols/Sequences.html http://archive.sysbio.harvard.edu/csb/resources/computati onal/scriptome/UNIX/Protocols/Sequences.html
  • Slide 19
  • Useful Resource: NGSUtils Tools (in Python) for FASTA, BAM, BED, GTF file processing Eg. bamutils filter can filter out reads with more than x mismatches http://ngsutils.org
  • Slide 20
  • Useful Resource: PicardTools Tools (in java) for BAM and FASTA processing Cool tools: SamToFastq, MergeSamFiles, ValidateSamFile, ReplaceSamHeader, MarkDuplicates Cool options: SORT_ORDER, CREATE_INDEX, CREATE_MD5_FILE, VALIDATION_STRINGENCY http://broadinstitute.github.io/picard
  • Slide 21
  • Useful Resource: GATK Tools (in java) for NGS processing and analysis Cools things about it: Best Practices page, Forum, Tutorials, Presentations https://www.broadinstitute.org/gatk/
  • Slide 22
  • Useful Resource: GATK http://www.broadinstitute.org/gatk/guide/best-practices
  • Slide 23
  • Indel Realignment http://www.broadinstitute.org/gatk/guide/best-practices
  • Slide 24
  • Why Realign Around Indels? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
  • Slide 25
  • Why Realign Around Indels? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
  • Slide 26
  • How does it work? Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment Identified intervals: Known Indels Indels discovered in original alignments (in CIGAR strings of reads in BAM files) Reads where there is evidence of possible misalignment http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
  • Slide 27
  • The Indel Realigner Workflow http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-2-Realignment.pdf
  • Slide 28
  • Implementing RealignerTargetCreator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4 sample5 sample6 sample7 The RealignerTargetCreater needs as many reads from all the samples at a particular site to determine if reads tend to get misaligned there need to parse in data for all samples at the same time
  • Slide 29
  • Slide 30
  • Implementing IndelRealigner Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4 sample5 sample6 sample7 Once the Intervals are identified, reads from any single sample can be realigned individually based on the samples own insertion/deletion lengths only need to parse in one samples data at a time
  • Slide 31
  • Slide 32
  • Base Quality Score Recalibration (BQSR) http://www.broadinstitute.org/gatk/guide/best-practices
  • Slide 33
  • Why BQSR? http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf
  • Slide 34
  • The BQSR workflow http://www.broadinstitute.org/gatk/events/2038/GATKwh0-BP-3-Base_recalibration.pdf
  • Slide 35
  • Implementing BaseRecalibrator Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4 sample5 sample6 sample7 The BaseRecalibrator needs all reads from each samples at all unmasked sites to come up with the recalibration table for the dataset need to parse in all of the data of each sample
  • Slide 36
  • Slide 37
  • Variant Calling http://www.broadinstitute.org/gatk/guide/best-practices
  • Slide 38
  • Variant Calling http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5-Variant_calling.pdf
  • Slide 39
  • Implementing Variant Calling Site1Site2Site3Site4Site5Site6Site7Site8 sample1reads sample2reads sample3reads sample4 sample5 sample6 sample7 The UnifiedGenotyper (and many other callers) needs as many reads from all the samples at a particular site to determine if there is a variant at the site tend need to parse in data for all samples at a particular site at the same time
  • Slide 40
  • Slide 41
  • Useful Resource: Variant Callers
  • Slide 42
  • Acknowledgements Jonathan Flint, Richard Mott Robbie Davies, Winni Kretzschmar Kiran Garimella (GATK) Leo Goodstadt (Ruffus) Gerton Lunter (Stampy) Andy Rimmer (Platypus) Zam Iqbal (Cortex) John Broxholme (all software help and maintenance) Jon Diprose, Robert Esnouf (Clusters) Tim Bardsley, Mark Gibbons, Ruth Porter (IT support)