nextgeneraon,sequencing:,...

27
NextGenera*on Sequencing: Quality Control and Mapping BaRC Hot Topics – January 2015 Bioinforma*cs and Research Compu*ng Whitehead Ins*tute hKp://barc.wi.mit.edu/hot_topics/

Upload: others

Post on 30-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Next-­‐Genera*on  Sequencing:  Quality  Control  and  Mapping  

 BaRC  Hot  Topics  –  January  2015  

Bioinforma*cs  and  Research  Compu*ng  Whitehead  Ins*tute  

 hKp://barc.wi.mit.edu/hot_topics/  

Page 2: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Outline  

•  Quality  control    •  Preprocessing  •  Read  mapping  

– Non-­‐spliced  alignment  – Spliced  alignment  

•  Post  process  the  mapped  read  files  – Remove  unmapped  reads,  sort,  index  etc  – Mapping  sta*s*cs  

2  

Page 3: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Illumina  data  format  •  Fastq  format:  

3  

@ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG +ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh

@seq  iden*fier    seq  +any  descrip*on  seq  quality  values  

/1  or  /2  paired-­‐end  

Input  quali*es   Illumina  versions  

-­‐-­‐solexa-­‐quals   <=  1.2  

-­‐-­‐phred64   1.3-­‐1.7  

-­‐-­‐phred33   >=  1.8  

hKp://en.wikipedia.org/wiki/FASTQ_format  

Page 4: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Check  read  quality  with  fastqc  (hKp://www.bioinforma*cs.babraham.ac.uk/projects/fastqc/)  

1.  Run  fastqc  to  check  read  quality  

   bsub  fastqc  sample.fastq  

2.  Open  output  file:  “fastqc_report.html”    

4  

Page 5: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Fastqc  report  

5  

We  have  to  know  the  quality  encoding  to  use  the  

appropriate  parameter  in  the  mapping  step.  

Page 6: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

FastQC:  per  base  sequence  quality  

•Content

6  

very  good    quality  calls  

reasonable    quality  

poor  quality  

6  Red:  median  blue:  mean  yellow:  25%,  75%  whiskers:  10%,  90%  

Page 7: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Preprocessing  tools  •  Fastx  Toolkit  (�hKp://hannonlab.cshl.edu/fastx_toolkit/)  

–  FASTQ/A  Trimmer:  Shortening  reads  in  a  FASTQ  or  FASTQ  files  (removing  barcodes  or  noise).  

–  FASTQ  Quality  Filter:  Filters  sequences  based  on  quality  –  FASTQ  Quality  Trimmer:  Trims  (cuts)  sequences  based  on  quality  

–  FASTQ  Masker:  Masks  nucleo*des  with  'N'  (or  other  character)  based  on  quality  

(for  a  complete  list  go  to  the  link  above)  •  cutadapt  to  remove  adapters            (hKps://code.google.com/p/cutadapt/)  

7  

Page 8: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

What  preprocessing  do  we  need?  

8  

Flagged    Kmer  Content:    About  100%  of  the  first  six  bases  are  the  same  sequence  -­‐>  Use  “FASTQTrimmer”    

Bad  quality  -­‐>  Use  “FASTQ  Quality  Filter”  and/or  “FASTQ  Quality  Trimmer”    

Sequence   Count   Percentage   Possible  Source  

TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCA   7360116   82.88507591015895   RNA  PCR  Primer,  Index  3  (100%  over  40bp)  

GCGAGTGCGGTAGAGGGTAGTGGAATTCTCGGGTGCCAAG   541189   6.094535921273932   No  Hit  

TCGAATTGCCTTTGGGACTGCGAGGCTTTGAGGACGGAAG   291330   3.2807783416601866   No  Hit  

CCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGG   210051   2.365464495397192   RNA  PCR  Primer,  Index  3  (100%  over  38bp)  

Overrepresented  sequences    -­‐>  If  the  over  represented  sequence  is  an  adapter  use  “cutadapt”    

Page 9: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Examples  of  preprocessing  I  hands  on  exercise  

9  

Remove  reads  with  lower  quality  bsub  fastq_quality_filter  -­‐v  -­‐q  20  -­‐p  75    -­‐i  sample.fastq  -­‐o  sample_filtered.fastq  -­‐q  20  -­‐p  75        Trim  the  reads  #  Delete  the  first  6nt  from  5’  bsub  fastx_trimmer  -­‐v  -­‐f  7  -­‐l  36  -­‐i  sample.fastq  -­‐o  sample_trimmed.fastq  

 

-­‐i:  input  file  -­‐o:  output  file  -­‐v:  report  number  of  sequences  -­‐q  20  the  quality  value  required  -­‐p  75  the  percentage  of  bases  that  have  to  have  that  quality  value    

-­‐f:  First  base  to  keep  -­‐l:  Last  base  to  keep  -­‐i:  input  file  -­‐o:  output  file  -­‐v:  report  number  of  sequences  

Page 10: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Examples  of  preprocessing  II  hands  on  exercise  

•  Remove  adapter/Linker  

10  

     

10  

cutadapt    #  usage  bsub  "  cutadapt  -­‐m  20  -­‐b  GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG    sample2.fastq    |    fastx_ar*facts_filter  >  sample2_trimFilt.fastq”    -­‐a:  Sequence  of  an  adapter  that  was  ligated  to  the  3'  end  -­‐b:  Sequence  of  an  adapter  that  was  ligated  to  the  5'  or  3'  end  -­‐g:  Sequence  of  an  adapter  that  was  ligated  to  the  5'  end  -­‐o:    output  file  name  

Page 11: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Recommenda*on  for  preprocessing  

•  Treat  all  the  samples  the  same  way.  •  Watch  out  for  preprocessing  that  may  result  in  very  different  read  length  in  the  different  samples  as  that  can  affect  mapping.  

•  If  you  have  paired-­‐end  reads,  make  sure  you  s*ll  have  both  reads  of  the  pair  aver  the  processing  is  done.  

•  Run  fastqc  on  the  processed  samples  to  see  if  the  problem  has  been  removed.  

11  

Page 12: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Local  genomic  files  needed  for  mapping  tak:  /nfs/genomes/  

 – Human,  mouse,  zebrafish,  C.elegans,  fly,  yeast,  etc.  – Different  genome  builds  

•  mm9:  mouse_gp_jul_07  •  mm10:  mouse_mm10_dec_11  

–  human_gp_feb_09  vs  human_gp_feb_09_no_random?  •  human_gp_feb_09    includes  *_random.fa,    *hap*.fa,  etc.  

–  Sub  directories:  •  bow*e  

–  Bow*e1:  *.ebwt  –  Bow*e2:    *.bt2  

•  fasta:    •  fasta_whole_genome:  all  sequences  in  one  file  •  gz:  gene  models  from  Refseq,  Ensembl,  etc.  

12  

Page 13: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Mapping  I  Non-­‐spliced  alignment  sovware  

§  Used  mapping  DNA  fragments,  i.e.  ChIP-­‐seq,  SNP  calling  

§  Bow*e:  § bow*e  1  vs  bow*e  2    

§  For  reads  >50  bp  Bow*e  2  is  generally  faster,  more  sensi*ve,  and  uses  less  memory  than  Bow*e  1.    

§  Bow*e  2  supports  gapped  alignment,  it  makes  it  beKer  for  snp  calling.  Bow*e  1  only  finds  ungapped  alignments.  

§  Bow*e  2  supports  a  "local"  alignment  mode,  in  addi*on  to  the  “end-­‐to-­‐end"  alignment  mode  supported  by  bow*e1.  

§  BWA:  §  refer  to  the  BaRC  SOP    for  detailed  informa*on  

13  

Page 14: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Mapping  reads  with  bow*e2  •  Mapping  single  reads:  

bow*e2  [op*ons]*  -­‐x  <bt2-­‐index>    -­‐U  <r>  [-­‐S  <output.sam>]  bsub  bowAe2    -­‐-­‐phred64      –x  /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10  –U  DNA.fastq    –S  DNA.sam    

•  Mapping  paired-­‐end  reads:            bow*e2  [op*ons]*  -­‐x  <bt2-­‐index>    -­‐1  <m1>  -­‐2  <m2>    [-­‐S  <  output.sam  >]  bsub  bowAe2  -­‐-­‐phred64  –x  /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10      -­‐1  Reads1.fastq  -­‐2  Reads2.fastq    –S  DNA.sam  

14  

Page 15: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Some  important  parameters  in  bow*e2  •  ReporAng  

   (default)                    look  for  mul*ple  alignments,  report  best,  with  MAPQ        OR      -­‐k  <int>                      report  up  to  <int>  alns  per  read;  MAPQ  not  meaningful      OR      -­‐a/-­‐-­‐all                      report  all  alignments;  very  slow,  MAPQ  not  meaningful  

•  Alignment  mode  -­‐-­‐end-­‐to-­‐end  en*re  read  must  align;  no  clipping  (on)  OR  

-­‐-­‐local    local  alignment;  ends  might  be  sov  clipped  (off)  •  -­‐L  <int>  length  of  seed  substrings;  must  be  >3  and  <32  (default=22)  •   -­‐N  <int>  max  #  mismatches  in  seed  alignment;  can  be  0  or  1  (default=0)    

15  

Input  quali*es   Illumina  versions  

-­‐-­‐solexa-­‐quals   <=  1.2  

-­‐-­‐phred64   1.3-­‐1.7  

-­‐-­‐phred33  (default)   >=  1.8  

Page 16: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Mapping  II  Spliced  alignment  sovware  

§  Used  if  mapping  RNA  fragments  §  Tophat2  (uses  bow*e2)  §  Star:  maps  >60  *mes  faster  than  Tophat2,  tends  to  align  

more  reads  to  pseudogenes.    See  barc  SOPs  

16  

Page 17: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Spliced  alignment  with  tophat2  Tophat2  uses  bow*e2  to  map  the  reads  

#  single-­‐end  reads  bsub  tophat  -­‐-­‐solexa1.3-­‐quals  -­‐-­‐segment-­‐length  20    -­‐-­‐no-­‐novel-­‐juncs    -­‐G  /nfs/genomes/mouse_mm10_dec_11_no_random/gX/mm10_no_random.refseq.gX  /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10    sample_good_trimmed.fastq    #  paired-­‐end  reads:  Add  addi*onal  fastq  file  to  the  end  of  above  command.  

17  

Input  quali*es   Refer  to  bow*e2  mapping  slide  

-­‐-­‐segment-­‐length  Shortest  length  of  a  spliced  read  that  can  map  to  one  side  of  the  junc*on.  default:25  

-­‐-­‐no-­‐novel-­‐juncs   Only  look  at  reads  across  junc*ons  in  the  supplied  GFF  file  

-­‐G  <GTF  file>    Map  reads  to  virtual  transcriptome  (from  gz  file)  first.    

-­‐N    max.  number  of  mismatches  in  a  read,  default  is  2  

-­‐o/-­‐-­‐output-­‐dir    default  =  tophat_out              

 -­‐-­‐library-­‐type   (fr-­‐unstranded,  fr-­‐firststrand,  fr-­‐secondstrand)  

-­‐I/-­‐-­‐max-­‐intron-­‐length     default:  500000  

Page 18: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Op*mize  mapping  across  introns  •  Tophat  default  parameters  are  designed  for  mammalian  RNA-­‐seq  data.    

•   Reduce  “maximum  intron  length”  for  non-­‐mammalian  organisms    -­‐l:  default  is  500,000    

18  

Species   Max_intron_length  yeast   2,484  arabidopsis   11,603  C.  elegans   100,913  fly   141,628  

Page 19: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Hands  on  Mapping  

•  bowAe2              bsub  bowAe2    -­‐-­‐phred64      –x  /nfs/genomes/

mouse_mm10_dec_11_no_random/bowAe/mm10    –U  DNA.fastq  –S  DNA.sam  

 •  tophat              bsub  tophat  -­‐-­‐solexa1.3-­‐quals  -­‐-­‐segment-­‐length  20  -­‐G  /nfs/

genomes/mouse_mm10_dec_11_no_random/gX/mm10_no_random.refseq.gX  /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10  sample_good_trimmed.fastq  

 Note:  tophat  output  file  will  be:    tophat_out/accepted_hits.bam  

19  

Page 20: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Mapped  reads  file  formats:  SAM/BAM  •  SAM:  Sequence  Alignment/Map  format.  It  is  a  TAB-­‐delimited  text  format  consis*ng  of  a  header  sec*on,  which  is  op*onal,  and  an  alignment  sec*on.  Each  alignment  line  has  11  mandatory  fields  for  essen*al  alignment  informa*on.  

•  BAM:  binary  format.  It  is  much  smaller  than  sam.    

•  Bam  is  needed  for  viewing  in  a  genome  browser.  It  has  to  be  sorted  and  indexed.  

•  To  save  space  you  should  convert  mapped  files  to  .bam  format,  and  delete  the  .sam  file.  

 

20  

Page 21: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

SAM  tools:    Set  of  tools  for  manipula*ng  mapped  read  files  

21  

TOOL   DESCRIPTION  

samtools  view   conversion  between  SAM  and  BAM  files  

samtools    flagstat     simple  sta*s*cs  on  the  mapped  reads  

samtools  sort   sort  alignment  file  

samtools  index   index  alignment  

samtools  rmdup   remove  PCR  duplicates  

samtools   displays  all  the  tools  available  

Page 22: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Hands  on  

Convert  .sam  to  .bam  format,  sort  and  index.      bsub  /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl  DNA.sam    

1. Convert  .sam  to  .bam  2. Sort  bam  file  3. Index  bam  file,  created  a  .bai  file  

Delete  the  .sam  file  

22  

Page 23: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

How  to  get  the  number  of  reads  mapped  •  Bow*e2  prints  to  STDERR  the  number  of  reads  mapped,  so  

you  will  see  if  in  the  email  that  you  received.  •  Tophat  makes  a  summary  file  in  the  tophat  output  

directory.              head  tophat_out/align_summary.txt    

 

•  Tools:  –  bam_stat.py  -­‐i  accepted_hits.bam  –  samtools  flagstat  mapped_unmapped.bam  

•  See  BaRC  SOPs  hKp://barcwiki.wi.mit.edu/wiki/SOPs/miningSAMBAM  

23  

Page 24: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

What  to  look  for  when  few  reads  mapped?  

•  Reads  are  not  perfectly  paired  *  – Usually  occurs  aver  QC’ing  step.  Removing  low  quality  reads  or  adapters  creates  uneven  distribu*on  of  reads  

bsub  “/nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/cmpfastq.pl  s_8_1_filtered.fastq  s_8_2_filtered.fastq”    

•  Reads  may  have  adapter  sequences  –  Blast  top  overrepresented  sequences  in  fastQC  output  –  Refer  to  the  preprocessing  steps  

•  Mapping  parameters  are  too  stringent  *  –  Increase  number  of  mismatches  – Adjust  the  insert  size  of  paired-­‐end  reads?  

24  *  Refer  to  BaRC  SOP  for  more  informa*on  

Page 25: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

Summary    •  Quality  control  

–  fastqc  •  Clean  up  reads:  

–  fastx  tool  kit:  fastq_quality_filter,  fastx_trimmer  –  Cutadapt  

•  Map  reads:  –  Bow*e2  –  Tophat2  

•  Understand  the  mapped  files,  and  check  mapping  quality:    –  Samtools  –  RSeQC:bam_stat.py  

25  

Page 26: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

26  

hKp://barcwiki.wi.mit.edu/wiki/SOPs  

BaRC  Standard  opera*ng  procedures  

Page 27: NextGeneraon,Sequencing:, Quality,Control,and,Mapping,barc.wi.mit.edu/education/hot_topics/NGS_QC_mapping_Feb...Examples,of,preprocessing,I hands&on&exercise& 9 Remove&reads&with&lower&quality&

References    Fastqc:  hKp://www.bioinforma*cs.babraham.ac.uk/projects/fastqc    Fastx  Toolkit:  �hKp://hannonlab.cshl.edu/fastx_toolkit/    cutadapt:  hKps://code.google.com/p/cutadapt    Bow*e:  Langmead  B,  Trapnell  C,  Pop  M,  Salzberg  SL.  Ultrafast  and  memory-­‐efficient  alignment  of  short  DNA  sequences  to  the  human  genome.  Genome  Biology  10:R25.      TopHat:    Kim  D,  Pertea  G,  Trapnell  C,  Pimentel  H,  Kelley  R,  Salzberg  SL.  TopHat2:  accurate  alignment  of  transcriptomes  in  the  presence  of  inser*ons,  dele*ons  and  gene  fusions.  Genome  Biology  2013,  14:R36    Systema*c  evalua*on  of  spliced  alignment  programs  for  RNA-­‐seq  data    Engstrom  et.al    Nature  Methods  10,  1185–1191  (2013)    

27