sequencing run grief counseling: counting kmers at mg-rast

75
Sequencing run grief counseling: coun0ng kmers at MGRAST Will Trimble metagenomic annota0on group Argonne Na0onal Laboratory April 29, 2014 UIC

Upload: wltrimbl

Post on 02-Dec-2014

135 views

Category:

Science


0 download

DESCRIPTION

Talk by Will Trimble of Argonne National Laboratory on April 29, 2014, at UIC's department of Ecology & Evolution on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.

TRANSCRIPT

Page 1: Sequencing run grief counseling: counting kmers at MG-RAST

Sequencing  run  grief  counseling:  coun0ng  kmers  at  MG-­‐RAST  

Will  Trimble  metagenomic  annota0on  group  Argonne  Na0onal  Laboratory  

April  29,  2014        UIC  

Page 2: Sequencing run grief counseling: counting kmers at MG-RAST

Apology:  I  speak  biology    with  an  accent  

•  I  spent  six  years  in  dark  rooms  with  lasers  •  Now  I  use  computers  to  analyze  high-­‐throughput  sequence  data.  

•  I  introduce  myself  as  an  applied  mathema0cian.  

•  Finding  scoring  func0ons  to  use  ambiguous  data  to  answer  life’s  persistent  ques0ons.  

 

Page 3: Sequencing run grief counseling: counting kmers at MG-RAST

Apology:  I  speak  biology    with  an  accent  

•  I  spent  six  years  in  dark  rooms  with  lasers  •  Now  I  use  computers  to  analyze  high-­‐throughput  sequence  data.  

•  I  introduce  myself  as  an  applied  mathema0cian.  

•  Finding  scoring  func0ons  to  use  ambiguous  data  to  answer  life’s  persistent  ques0ons.  

•  Shoveling  data  from  the  data  producing  machine  into  the  data-­‐consuming  furnace.  

 

Page 4: Sequencing run grief counseling: counting kmers at MG-RAST

•  Sequences  are  different  •  Sequencing  is  like  photography  •  Sequencing  is  beau0ful   thumbnailpolish  •  How  diverse  are  my  shotgun  sequences?   nonpareil-k! kmerspectrumanalyzer! !!  

Outline  

Page 5: Sequencing run grief counseling: counting kmers at MG-RAST

•  Sequences  are  different                                        (math)  •  Sequencing  is  like  photography        (pictures)        •  Sequencing  is  beau0ful                                       thumbnailpolish (micrographs)                          •  How  diverse  are  my  shotgun  sequences?   nonpareil-k (graphs)     kmerspectrumanalyzer!                                                                                                                              (graphs)  

Outline  

Page 6: Sequencing run grief counseling: counting kmers at MG-RAST

Sequences  are  different  

•  Sequencing  produces  sequences.    Sequences  are  qualita0vely  different  from  all  other  data  types.  

   

Low-­‐throughput    categorical  data    Categories  are  sound    

Page 7: Sequencing run grief counseling: counting kmers at MG-RAST

Sequences  are  different  

•  Sequencing  produces  sequences.    Sequences  are  qualita0vely  different  from  all  other  data  types.  

   

Instrument  readings,  spectra,  micrographs    Not  categorical.  

Low-­‐throughput    categorical  data    Categories  are  sound    

Page 8: Sequencing run grief counseling: counting kmers at MG-RAST

Sequences  are  different  

•  Sequencing  produces  sequences.    Sequences  are  qualita0vely  different  from  all  other  data  types.  

   

@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!

Instrument  readings,  spectra,  micrographs    Not  categorical.  

Low-­‐throughput    categorical  data    Categories  are  sound    

High  throughput  sequence  data    Categories  uncertain    

Page 9: Sequencing run grief counseling: counting kmers at MG-RAST

Sequences  are  different  

•  Sequencing  produces  sequences.    Sequences  are  qualita0vely  different  from  all  other  data  types.  

   

@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!

Instrument  readings,  spectra,  micrographs    Not  categorical.  

Low-­‐throughput    categorical  data    Categories  are  sound    

High  throughput  sequence  data    Categories  uncertain    

100-­‐102   102-­‐107   1012-­‐1080  

Page 10: Sequencing run grief counseling: counting kmers at MG-RAST

Experiment  design   Sequencing  run   Sequence  data    

Assembly,  Annota0on  

SEED  M5NR  

489 !Sensory box/GGDEF family!470 !hyphothetical protein!241 !Co-Zn-Cd resistance CzcA!202 !Transposase!200 !homocysteine methyltransferase (EC 2.1.1.13)!175 !cyclase/phosphodiesterase !164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!156 !Methyl-accepting chemotaxis protein!149 !ABC transporter, ATP-binding protein!147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!133 !Ferrous iron transport protein B!

So  we  reduce  sequence  data  to  categorical  data.  

Page 11: Sequencing run grief counseling: counting kmers at MG-RAST

Forward-­‐backward  problem  

Experiment  design   Sequencing  run   Sequence  data    

Assembly,  Annota0on  

SEED  M5NR  

489 !Sensory box/GGDEF family!470 !hyphothetical protein!241 !Co-Zn-Cd resistance CzcA!202 !Transposase!200 !homocysteine methyltransferase (EC 2.1.1.13)!175 !cyclase/phosphodiesterase !164 !Long-chain-fatty-acid--CoA ligase (EC 6.2.1.3)!156 !Methyl-accepting chemotaxis protein!149 !ABC transporter, ATP-binding protein!147 !Pb, Cd, Zn, and Hg transporting ATPase (EC 3.6.3.3)!133 !Ferrous iron transport protein B!

1012  

103-­‐105  100-­‐101  

So  we  reduce  sequence  data  to  categorical  data.  

Page 12: Sequencing run grief counseling: counting kmers at MG-RAST

Sequences  are  different  

•  Sequencing  produces  sequences.    Sequences  are  qualita0vely  different  from  all  other  data  types.  

 •  Each  sequence  is  an  informa0on-­‐rich  (possibly  corrupted)  quota9on  from  the  catalog  of  gene0c  polymers.  

Page 13: Sequencing run grief counseling: counting kmers at MG-RAST

What  is  this  sequence  ?  >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!

Who  wrote  this  line  ?  “be regarded as unproved until it has been checked against more exact results”

Searching  

Page 14: Sequencing run grief counseling: counting kmers at MG-RAST

What  is  this  sequence  ?  >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!

Who  wrote  this  line  ?  “be regarded as unproved until it has been checked against more exact results”

Searching  

Same  answer  for  both  puzzles:  you  go  to  this  website…  

Page 15: Sequencing run grief counseling: counting kmers at MG-RAST

What  is  this  sequence  ?  >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!

Who  wrote  this  line  ?  “be regarded as unproved until it has been checked against more exact results”

Searching  

How  long  do  reads  need  to  be    to  recognize  them?  

How  long  do  phrases  need  to  be  to  recognize  them?  

Page 16: Sequencing run grief counseling: counting kmers at MG-RAST

How  long  do  reads  need  to  be?  

Informa9on      (Shannon,  1949,  BSTJ):              is  a  quan0ta0ve  summary  of  the  uncertainty  of  a  probability  distribu9on  –  a  model  of  the  data    Profound  applicability  in  machine  learning  and    probabilis0c  modeling    

H =

X

i

pi log2

✓1

pi

Page 17: Sequencing run grief counseling: counting kmers at MG-RAST

How  long  do  phrases  need  to  be?  

Exercise:    Pick  a  book  from  your  bookshelf.  Pick  an  arbitrary  page  and  arbitrary  line.    for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!

Page 18: Sequencing run grief counseling: counting kmers at MG-RAST

•  Informa0on  content  of  English  words:                  Hword                                                              ca.  12  bits  per  word.  •  Size  of  google  books?                      Big  libraries  have  few  107  books,                    each  one  has  105  indexed  words                  ….so  a  database  size  of  1012  words.            log(database  size)                =              1012    =  239.9                                                                  =  40  bits  •  So  we  expect  on  average  40  /  12  =  3.3  =  4  words  to  be  enough  to  find  a  phrase  in  google’s  index.                                                                                                            

                                                                                                                                                     Try  it.      

How  long  do  phrases  need  to  be?  

Page 19: Sequencing run grief counseling: counting kmers at MG-RAST

How  long  do  phrases  need  to  be?  

Exercise:    Pick  a  book  from  your  bookshelf.  Pick  an  arbitrary  page  and  arbitrary  line.    for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!

Page 20: Sequencing run grief counseling: counting kmers at MG-RAST

How  long  do  phrases  need  to  be?  

Exercise:    Pick  a  book  from  your  bookshelf.  Pick  an  arbitrary  page  and  arbitrary  line.    for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!

Usually  nails  your  source  in  four  words.  

Page 21: Sequencing run grief counseling: counting kmers at MG-RAST

•  Maximum  informa0on  content  of          base  pairs                            Hread                                            2        bits    per  length-­‐      sequence  •  Most  long  kmers  are  dis0nct:              genome  of  size  G  (ca  1010  bp)                            log(G)                                =            1010    =        233.2                                    =    34  bits  •  So  we  expect  that  when  2        >  34  bits,  we  should  be  able  to  place  any  sequence.  

•  That  means  we  need  at  least      17  base  pairs          (seems  small)  to  deliver  mail  anywhere  in  the  genome.    

How  long  do  reads  need  to  be?  

``

`

`

Page 22: Sequencing run grief counseling: counting kmers at MG-RAST

•  Maximum  informa0on  content  of          base  pairs                            Hread                                            2        bits    per  length-­‐      sequence  •  Most  long  kmers  are  dis0nct:              genome  of  size  G  (ca  1010  bp)                            log(G)                                =            1010    =        233.2                                    =    34  bits  •  So  we  expect  that  when  2        >  34  bits,  we  should  be  able  to  place  any  sequence.  

•  That  means  we  need  at  least      17  base  pairs          (seems  small)  to  deliver  mail  anywhere  in  the  genome.    

How  long  do  reads  need  to  be?  

``

`

`

Short  sequences  end  up  being  very  dis0nc0ve,  even  fingerprint-­‐like.  

Page 23: Sequencing run grief counseling: counting kmers at MG-RAST

`

Check:  Human  reference  genome  

Page 24: Sequencing run grief counseling: counting kmers at MG-RAST

The  data  deluge  

•  There  were  some  technological  breakthroughs  in  the  mid-­‐2000s  that  led  to  inexpensive  collec0on  of  10s  of  Gbytes  of  sequence  data  at  once.  

•  The  data  has  outgrown  some  favorite  algorithms  from  the  1990s  (BLAST)    

Page 25: Sequencing run grief counseling: counting kmers at MG-RAST

http://www.mcs.anl.gov/~trimble/flowcell/!

thumbnailpolish!

Page 26: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  A  camera  records  the  number  of  photons  that  land  on  each  of  millions  of  pixels.    A  sequencer  records  the  number  of  sequences  that  land  in  each  possible  sequence.    

I  actually  think  of  a  sequencer  like  a    mul0channel  gene0c  spectrometer.  

Page 27: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  A  camera  records  the  number  of  photons  that  land  on  each  of  millions  of  pixels.    A  sequencer  records  the  number  of  sequences  that  land  in  each  possible  sequence.    

I  actually  think  of  a  sequencer  like  a    mul0channel  gene0c  spectrometer.  

Page 28: Sequencing run grief counseling: counting kmers at MG-RAST

The  gene0c  spectrometer  

With  my  1012-­‐channel  gene0c  spectrometer,  I  am  trying  to  ar0culate  the  diversity  of  what  the  sequencer  sees.    Species  diversity  

ATCGCGAAAAGTCCC 2!AAAAAAAAAAAAAAA 459!AAAAAAAAAAAAAAC 71!AAAATAAAAAAAATA 1!AAAAAAAAAAAAAAG 36!ACATGAAAAACAACT 1!AAAAAAAAAAAAAAT 23!AAAAAAAAAAAAACA 95!GTAGGAAAAGCCCAC 1!AAAAAAAAAAAAACC 7!AAAAAAAAAAAAACG 8!AAAAAAAAAAAAACT 9!AAAAAAAAAAAAAGA 36!AACAAGAAAAACAAA 1!AAAAAAAAAAAAAGC 10!AAATAAAAAAAATAG 1!AACAGAAAAAACACG 1!AAAAAAAAAAAAAGG 2!AAAAAAAAAAAAAGT 6!

Page 29: Sequencing run grief counseling: counting kmers at MG-RAST

The  gene0c  spectrometer  

With  my  1012-­‐channel  gene0c  spectrometer,  I  am  trying  to  ar0culate  the  diversity  of  what  the  sequencer  sees.    Species  diversity    Gene  diversity    

ATCGCGAAAAGTCCC 2!AAAAAAAAAAAAAAA 459!AAAAAAAAAAAAAAC 71!AAAATAAAAAAAATA 1!AAAAAAAAAAAAAAG 36!ACATGAAAAACAACT 1!AAAAAAAAAAAAAAT 23!AAAAAAAAAAAAACA 95!GTAGGAAAAGCCCAC 1!AAAAAAAAAAAAACC 7!AAAAAAAAAAAAACG 8!AAAAAAAAAAAAACT 9!AAAAAAAAAAAAAGA 36!AACAAGAAAAACAAA 1!AAAAAAAAAAAAAGC 10!AAATAAAAAAAATAG 1!AACAGAAAAAACACG 1!AAAAAAAAAAAAAGG 2!AAAAAAAAAAAAAGT 6!

Page 30: Sequencing run grief counseling: counting kmers at MG-RAST

The  gene0c  spectrometer  

With  my  1012-­‐channel  gene0c  spectrometer,  I  am  trying  to  ar0culate  the  diversity  of  what  the  sequencer  sees.    Species  diversity    Gene  diversity    Sequence  diversity  

ATCGCGAAAAGTCCC 2!AAAAAAAAAAAAAAA 459!AAAAAAAAAAAAAAC 71!AAAATAAAAAAAATA 1!AAAAAAAAAAAAAAG 36!ACATGAAAAACAACT 1!AAAAAAAAAAAAAAT 23!AAAAAAAAAAAAACA 95!GTAGGAAAAGCCCAC 1!AAAAAAAAAAAAACC 7!AAAAAAAAAAAAACG 8!AAAAAAAAAAAAACT 9!AAAAAAAAAAAAAGA 36!AACAAGAAAAACAAA 1!AAAAAAAAAAAAAGC 10!AAATAAAAAAAATAG 1!AACAGAAAAAACACG 1!AAAAAAAAAAAAAGG 2!AAAAAAAAAAAAAGT 6!

Page 31: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  Sampling  only  a  few  sequences  is  like  exposing  the  camera  for  too  short  a  0me.        Not  enough  photons  to  make  out  the  picture.  

Page 32: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

some  parts  seem  to  be  dark.  

Page 33: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

Page 34: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

This  looks  like  a  portrait  

Page 35: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

Page 36: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

Start  to  see  the  mood  

Page 37: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

Page 38: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

A  0ny  bit  of  graininess  leg  

Page 39: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

“shot  noise”  in  electrical  engineering  

Page 40: Sequencing run grief counseling: counting kmers at MG-RAST

Rarefac0on  of  a  photograph  

A  studio  portrait  of  Jane  Goodall  

Page 41: Sequencing run grief counseling: counting kmers at MG-RAST

A  scien0fic  image  

This  is  a  famous    scien0fic  image.  

Anybody  recognize  it?  

Page 42: Sequencing run grief counseling: counting kmers at MG-RAST

A  scien0fic  image  

Does  this  help?  

Page 43: Sequencing run grief counseling: counting kmers at MG-RAST

A  scien0fic  image  

There  are  small  patches  of  brightness    

Page 44: Sequencing run grief counseling: counting kmers at MG-RAST

A  scien0fic  image  

Were  you  expec0ng  x-­‐ray  diffrac0on?  

Page 45: Sequencing run grief counseling: counting kmers at MG-RAST

A  scien0fic  image  

At  longer  exposures  

Page 46: Sequencing run grief counseling: counting kmers at MG-RAST

A  scien0fic  image  

more  objects,  smaller  and  dimmer,  appear.  

Page 47: Sequencing run grief counseling: counting kmers at MG-RAST

A  scien0fic  image  

This  is  a  part  of  the  Hubble  Deep  Field  image  

Page 48: Sequencing run grief counseling: counting kmers at MG-RAST

Image  /  sequencing  analogy  Analogy  to  sequencing:  •  Most  of  field  is  black  •  Bright  objects  have  halos  

•  Contains  camera  ar0facts  

•  We  can’t  know  what  we  didn’t  see  without  longer  exposures.  

Page 49: Sequencing run grief counseling: counting kmers at MG-RAST

Opportunity  cost  of  deep  sequencing  

This  took  two  weeks  to  acquire  on  a  one-­‐of-­‐a-­‐kind  telescope.    Consider  the  opportunity  cost  of  studying  a  single  sample  for  two  weeks.  

STSI  did  only  four  long  exposures  like  this  in  23  years.  

Page 50: Sequencing run grief counseling: counting kmers at MG-RAST

Image  /  sequencing  analogy  Analogy  to  sequencing:  •  Most  of  field  is  black  •  Bright  objects  have  halos  

•  Contains  camera  ar0facts  

•  We  can’t  know  what  we  didn’t  see  without  longer  exposures.  

Sampling  effort  interacts  with  sequence  diversity  to  produce  a  “horizon”    Inferences  are  supported  on  the  bright  parts  first,  on  the  dim  parts  only  at  higher  depth.    Not  all  the  sequences,  abundant  or  rare,    are  real.    Dim  targets  come  at  great  cost  in  sample  number.  

Page 51: Sequencing run grief counseling: counting kmers at MG-RAST

How  much  novelty  is  in  my  dataset?  

How  many  sequences  do  you  need  to  see  before  you  start  seeing    the  same  ones  over  and  over  again?  

Page 52: Sequencing run grief counseling: counting kmers at MG-RAST

How  much  novelty  is  in  my  dataset?  

How  many  sequences  do  you  need  to  see  before  you  start  seeing    the  same  ones  over  and  over  again?    Ini0ally,  everything  is  novel,  but  there  will  come  a  point  at  which    less  than  half  of  your  new  observa0ons  are  already  in  the  catalog.  

Page 53: Sequencing run grief counseling: counting kmers at MG-RAST

How  much  novelty  is  in  my  dataset?  Luis Rodriguez-Rojas and Kostas Konstantinidis developed a subset-against-all alignment approach to address the question “how quickly do we encounter novelty in shotgun datasets?” Nonpareil I found a way to answer almost the same question 300x faster. Nonpareil-k

Page 54: Sequencing run grief counseling: counting kmers at MG-RAST

Nonuniqefraction(✏; {r}, {n}) =X

i

ni · riPj nj · rj

(1� Poisscdf (✏ · ri, 1))(1� Poisscdf (✏ · ri, 0))

How  much  novelty  is  in  my  dataset?  

Nonpareil-k  

Page 55: Sequencing run grief counseling: counting kmers at MG-RAST

Nonpareil: model of sequence coverage Georgia Tech

Page 56: Sequencing run grief counseling: counting kmers at MG-RAST

Nonpareil: model of sequence coverage Georgia Tech

Nonpareil-k: kmer rarefaction Argonne + Georgia Tech

summary of sequence diversity

Page 57: Sequencing run grief counseling: counting kmers at MG-RAST

Nonpareil-­‐k:  stra0fy  datasets  by  coverage  distribu0on  

most  of  dataset  likely  contained  in    assembly    

assembly  is  likely  to  miss  or    alenuate  the    large  unique    frac0on  of  dataset.    

Page 58: Sequencing run grief counseling: counting kmers at MG-RAST

Looking  for  abundance  palerns  

Page 59: Sequencing run grief counseling: counting kmers at MG-RAST

Looking  for  abundance  palerns  

Let’s  look  at  the  greyscale  histogram  

Page 60: Sequencing run grief counseling: counting kmers at MG-RAST

Looking  for  abundance  palerns  

Page 61: Sequencing run grief counseling: counting kmers at MG-RAST

Looking  for  abundance  palerns  

Shadows  

Background  Jacket   Face  and    hands  

We  can  even  tease  out    a  few  palerns  in  the  histogram  

Page 62: Sequencing run grief counseling: counting kmers at MG-RAST

Kmers  can  tell  you  genome  size  and  coverage  depth  

Page 63: Sequencing run grief counseling: counting kmers at MG-RAST

Kmers  can  tell  you  genome  size  and  coverage  depth  

Page 64: Sequencing run grief counseling: counting kmers at MG-RAST

Redundancy  is  good  

•  OMG!      Check  out  these  three  sequences!    I’ve  found  the  fourth,  figh,  and  sixth  domains  of  life.  

         •  OMG!    I  see  this  sequence  10  million  0mes.      

•  OMG!    There  are  more  than  10  billion  dis0nct  31mers  in  my  dataset.    I  only  have  128  Gbases  of  memory.  

•  Error  correc0on  /  clustering  /  assembly  works  on  subsets  of  the  data  with  high  sequence  depth.  

Page 65: Sequencing run grief counseling: counting kmers at MG-RAST

Redundancy  is  good  

•  OMG!      Check  out  these  three  sequences!    I’ve  found  the  fourth,  figh,  and  sixth  domains  of  life.  

         •  OMG!    I  see  this  sequence  10  million  0mes.      

•  OMG!    There  are  more  than  10  billion  dis0nct  31mers  in  my  dataset.    I  only  have  128  Gbases  of  memory.  

•  Error  correc0on  /  clustering  /  assembly  works  on  subsets  of  the  data  with  high  sequence  depth.  

Abundance-­‐based  inferences  are  beler  in  the  high-­‐

abundance  part  of  the  data.  

Page 66: Sequencing run grief counseling: counting kmers at MG-RAST

But  I  want  to  sequence  everything!  Ok,  we  can  count  kmers  in  everything  too..  

kmerspectrumanalyzer  summarizes  distribu0on,  es0mates    genome  size,  coverage  depth,  …  but  what  it’s  really  good  at  

Page 67: Sequencing run grief counseling: counting kmers at MG-RAST

Kmers  show  problems  in  datasets  

•  Amok  PCR  –  seemingly  random  sequences  •  Amok  MDA  –  10  Gbases  of  sequence,  one  gene  •  PCR  duplicates:  en0re  sequencing  run  was  50x  exact-­‐  and  near-­‐exact  duplicate  reads  

•  Unusually  high  error  rate:  indicated  by  low  frac0on  of  “solid”  kmers  (for  isolate  genomes)  

•  Contaminated  samples:  95%  E.  coli  5%  E.  faecalis  •  Many  datasets  have  as  much  as  5-­‐45%  of  the  sequence  yield  in  adapters.      

Page 68: Sequencing run grief counseling: counting kmers at MG-RAST

Generali0es  from  the    kmer  coun0ng  mines  

•  FEW  DATASETS  have  well-­‐separated  abundance  peaks  (of  the  sort  metavelvet  was  engineered  to  find)      

•  Diverse  datasets  have  a  featureless,  geometric  rela9onship  between  kmer  rank  and  kmer  abundance  (but  I’m  not  about  to  write  a  paper  fipng  kmers  to  the  Yule,  Mandelbrot,  Levy,  or  Pareto  distribu0ons)  

Page 69: Sequencing run grief counseling: counting kmers at MG-RAST

Figure'1c!

-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04

0100

200

300

400

500

600

PC02 vs Alpha Diversity

eigen_vectors[, "PCO2"]

colo

r_m

atr

ix[, "

alp

ha

-div

ers

ity"]

All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7

Figure'1d!

HMP  /  quan0le  norm  /  euclidean  /  colored  by  alpha    

 MG-­‐RAST  API  R-­‐package  matR  

Hey  kid,  you  want  some  unlabeled  data?  Kevin  Keegan,  Argonne  Na0onal  Laboratory  

Page 70: Sequencing run grief counseling: counting kmers at MG-RAST

Figure'1c!

-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04

0100

200

300

400

500

600

PC02 vs Alpha Diversity

eigen_vectors[, "PCO2"]

colo

r_m

atr

ix[, "

alp

ha

-div

ers

ity"]

All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7

Figure'1d!

HMP  /  quan0le  norm  /  euclidean  /  colored  by  alpha    

 MG-­‐RAST  API  R-­‐package  matR  

Hey  kid,  you  want  some  unlabeled  data?  Kevin  Keegan,  Argonne  Na0onal  Laboratory  

I’m  not  sure  how  to  do  science  with  an  unlabeld  pile  

of  datasets.  

Page 71: Sequencing run grief counseling: counting kmers at MG-RAST

Figure'2a!

Figure'2b!

Hey  kid,  you  want  some  prely  ordina0ons?  Kevin  Keegan,  Argonne  Na0onal  Laboratory  

Page 72: Sequencing run grief counseling: counting kmers at MG-RAST

Observa0on:  Most  scien0sts  seem  to  be  self-­‐taught  in  compu0ng.  

 Observa0on:    Most  scien0sts  waste  a    

lot  of  0me  using  computers  inefficiently.  

Rachel  and  I  volunteer  with    

Page 73: Sequencing run grief counseling: counting kmers at MG-RAST

We  teach  scien0sts    how  to  get  more  done  

Woods  Hole  

Tugs  

U.  Chicago  

U.  Chicago  

UIC  

Page 74: Sequencing run grief counseling: counting kmers at MG-RAST
Page 75: Sequencing run grief counseling: counting kmers at MG-RAST

Metagenomic  annota0on  group    Folker  Meyer  Elizabeth  Glass  Narayan  Desai  Kevin  Keegan    Adina  Howe  Wolfgang  Gerlach  Wei  Tang  Travis  Harrison  Jared  Bishof  Dan  Braithwaite  Hunter  Malhews  Sarah  Owens  

Formerly  of  Yale:  Howard  Ochman    David  Williams    Georgia  Tech:  Kostas  Konstan0nidis  Luis  Rodriguez-­‐Rojas