the bioinformatics challenges and approaches to analyze ngs data

1
Next Generation Sequencing technologies have revolutionized the speed and detail of genomic and transcriptomic information, which opens novel research possibilities in the area of regulatory, developmental and cancer biology. However along with the advancements, it has offered great challenge in analyzing and interpreting the huge amount of data generated by the experiments. The continuous development of the different algorithms facilitated the data analysis, but also sometime leads to the confusion in making choice. The comparative analysis of different algorithms is important to choose the best method for the analysis. Here we have used different bioinformatics algorithms in order to prepare a standard pipeline for the various steps involved in ChIP Seq data analysis. We compared different algorithms for alignment, duplicate removal and peak calling. 1) Comparison of different alignment Softwares The Bioinformatics challenges and approaches to analyze NGS data. Yogita Sharma 1 , Elisa Fiorito 2 , Siv Gilfillan 3 & Toni Hurtado * 1 123 Nordic EMBL Partnership, Center for Molecular Medicine Norway (NCMM), University of Oslo, Norway. *Department of Genetics, Institute for Cancer Research, The Norwegian Radium Hospital, University of Oslo, Norway RESULTS ABSTRACT Error Rate Accuracy 0.001 0.010 0.100 0.0 0.2 0.4 0.6 0.8 1.0 Tool Bowtie BWA MrF astR MrFastS MrsFastR MrsFastS Novoalign SOAP Error Rate Accuracy 0.001 0.010 0.100 0.0 0.2 0.4 0.6 0.8 1.0 Tool Bowtie BWA MrFastR MrFastS MrsFastR MrsFastS Novoalign SOAP Used Read Ratio Fig. 1: Human genome: accuracy with varying error rate. (a) shows mapping quality threshol Indel Size (mean) Accuracy 2 4 7 10 16 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Tool Bowtie BWA MrFastR MrFastS MrsFastR MrsFastS Novoalign SOAP Indel Size (mean) Accuracy 2 4 7 10 16 0.0 0.2 0.4 0.6 0.8 Tool Bowtie BWA MrFastR MrFastS MrsFastR MrsFastS Novoalign SOAP 2) Removal of Duplicates 3) Peak Calling MACS CCAT Total no. of peaks 6031(3h) 17047 (12h) 8764 (3h) 19208 (12h) Unique Peaks (3h) 3094 5660 Unique Peaks (12h) 4767 5879 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 Chromosome Size (bp) Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 M X Y 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 Chromosome Size (bp) Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 M X Y ChIP regions over chromosome (3h & 12h) CONCLUSIONS 1) We found that Bowtie performs well for the ChIP- Seq experiments provided the length of indels should be small. 2) Samtools and Picard both removed the same number and egions as duplicates. 3) CCAT and MACS differ from other .

Upload: yogita-sharma

Post on 22-Jan-2017

163 views

Category:

Science


0 download

TRANSCRIPT

Page 1: The bioinformatics challenges and approaches to analyze NGS data

Next Generation Sequencing technologies have revolutionized the speed and detail of genomic and transcriptomic information, which opens novel research possibilities in the area of regulatory, developmental and cancer biology. However along with the advancements, it has offered great challenge in analyzing and interpreting the huge amount of data generated by the experiments. The continuous development of the different algorithms facilitated the data analysis, but also sometime leads to the confusion in making choice. The comparative analysis of different algorithms is important to choose the best method for the analysis. Here we have used different bioinformatics algorithms in order to prepare a standard pipeline for the various steps involved in ChIP Seq data analysis. We compared different algorithms for alignment, duplicate removal and peak calling.

1) Comparison of different alignment Softwares

The Bioinformatics challenges and approaches to analyze NGS data. Yogita Sharma1, Elisa Fiorito2, Siv Gilfillan3 & Toni Hurtado *1

123Nordic EMBL Partnership, Center for Molecular Medicine Norway (NCMM), University of Oslo, Norway.

*Department of Genetics, Institute for Cancer Research, The Norwegian Radium Hospital, University of Oslo, Norway

RESULTS

ABSTRACT

Selec%ve'inhibi%on'of'HER2'signalling'pathway'reveals'novel'func%ons'of'FOXA1'in'breast'cancer'''

'

!"#$%"&'' ()*)+#,-'../012'

34567'8,$98'7:/:;<'

34567'8,$98'

0/<;;'

!"#$% &!$'(%

3"=67'

)*+*,-./%

01-2314%

)*+*,-./%

>9)$'?)*'@A6BCD'E9&&8F'

../012'34567'8,$98'

34567'-,#G,#H',#'A)?"=,I9#'D98,8$)#$'B!3C0'E9&&8'@>JDKLMJDLF'

:'

1::::'

7:::::'

71::::'

K:::::'

K1::::'

C7::

:'C2::

'C<::

'CN::

'CK::

'7::'

.::'

1::'

0::'

;::'

E"#$%"&'' &)*)+#,-'

O,#G

,#H',#$9#8,$P

'@8Q?

'"I'%9)G8F'

R,8$)#E9'I%"?'E9#$9%'"I'-,#G,#H'8,$9'@-*F'

:' 1:' 7::'

5-26376-.8%

5-23*76-.8%%

A6BDS&)*)+#,-' A6BDSE"#$%"&'

T'"I'34567'8,$98'

34567'H9#"?9'G,8$%,-Q+"#',#'A6BCD'E9&&8'

9:)*+*,-./%36;<86=%#>?@9%/.-;.-7%.-263*8,1-%.-%&!$(AB!$A%/36*=2%8*-863%8644=%%%

Breast' cancer' cell' prolifera/on' results' from'many' different' factors' that' ac/vate'mul/ple' cellular' signaling' pathways,'which' are' currently' target'therapies.'Yet,'how'genomic'pathways'are'influenced'upon'cell=signaling'disrup/on'has'been'poorly'assessed.' 'We'explored'how'the'inhibi/on'of'HER2'signaling'pathway'influences'the'func/on'of'the'transcrip/on'factor'FOXA1.'By'means'of'specific'inhibitors'targe/ng'the'kinase'ac/vity'of'HER2'(Lapa/nib)'we'have' iden/fied'HER2'signaling'pathway'as'a'key'supervisor'of'FOXA1'func/on' in'HER2+'breast'cancer'subtype.'The'HER2'pathway'controls'the'binding'of'FOXA1'towards'key'genes'required'for'the'prolifera/on'in'HER2+'cancer'cells.'Importantly,'the'same'mechanisms'of'FOXA1'regula/on'are'needed'to'induce'cell'prolifera/on'in'ER+'breast'cancer'cells'that'do'not'overexpress'HER2.'We'also'explored'the'func/on'of'FOXA1'upon'Hercep/n'treatment,'which'is'a'monoclonal'an/body'that'binds'to'the'extracellular'domain'of'HER2.'Here,'we'demonstrate'that'treatment'of'breast'cancer'cells'with'Hercep/n'produces'the'ac/va/on'of'cytokine'signaling'pathways.'Importantly,'the'cytokine'pathway'ac/va/on'reprograms'the'chroma/n'interac/ons'of'FOXA1'towards'genes'playing'a'key'role'in'the'ini/a/on'of'the'cell=mediated'cytotoxicity.'All' together,'these'findings'supports'the' idea'that'FOXA1'integrates' input'signals'origina/ng'from'mul/ple'cell=signaling'pathways'to'generate'output'responses'that'culminate'in'control'of'prolifera/on'and'the'response'to'an/=cancer'therapies.'

NCMM-EMBL Breast Cancer Group Siv Gilfillan Elisa Fiorito Madhu Katika Engineer PhD student Post-doc

Yogita Sharma Bioinformatician (starting in June 2013)

Elena González Research assistant (starting in August 2013)

Baoyan Bai Post-doc

(starting in August 2013)

TEAM AND RESOURCES

'Madhumohan'R.'Ka%ka1,2,'Siv'Gilfillan1,'Yogita'Sharma1,'AnneJLise'BørresenJDale2'and'Antoni'Hurtado1,2''

1Nordic'EMBL'Partnership,'Center'for'Molecular'Medicine'Norway'(NCMM),'University'of'Oslo,'Norway'2Department'of'Gene/cs,'Ins/tute'for'Cancer'Research,'The'Norwegian'Radium'Hospital,'University'of'Oslo,'Norway''

'

RESULTS'

ABSTRACT'

!""#$%&'()*&+,-.(#)%'&%#)+%/0#

1234

5#6+(7

+(8#+(%/()+%9

#$):;

0#

!"#"$%&'()%*+,-.%/'#0'#1%23-456%7899:;%

<#=<<#

5<<<#5=<<#><<<#>=<<#?<<<#?=<<#@<<<#@=<<#

A=<<

<#A@B<

<#A@><

<#A?C<

<#A?@<

<#A?<<

<#A>B<

<#A>><

<#A5C<

<#A5@<

<#A5<<

<#AB<<

#A><<

#?<<#

D<<#

55<<#

5=<<#

5E<<#

>?<<#

>D<<#

?5<<#

?=<<#

?E<<#

@?<<#

@D<<#

*.(%&.F# F','-(+6#

G4!HI4J#4K4LJ"M"#

<=%9>?>@#'/%'#)'/'($%*+,-.%/'#0'#1%(A&>B0$%1"#"$%CBDC'>E%FAB%?BAE'F"B>@A#%>#0%G'1B>@A#%%

Cell cycle stages

Endocrine system disorders

7"EE%GAB?)AEA1H%

IDCE"'C%>C'0%G"(>/AE'$G%

JGGD#AEA1'C>E%0'$">$"$%

7"EE%0">()%>#0%$DBK'K>E%

7"EE%1BA&()%>#0%?BAE'F"B>@A#%

7"EEDE>B%GAK"G"#(L%'#K>$'A#%

MI-%B"?E'C>@A#%>#0%B"?>'B%

7"EEDE>B%>$$"G/EH%>#0%AB1>#'N>@A#%

O5P86<%

-C@#%

P86<%

P"BC"?@#%

9>?>@#'/%

IA#%(B">("0%

I/)%/&(#6F.%#

!"

#!"

$!"

%!"

&!"

'!"

(!"

!" $&" %(" &)" (!" *$" )&" +(" #!)"

!"#$%&'(#)*+,-./+,0#+1,#0)-*/2#%3#4)5,.,*+#-.%6+1#3/7+%.0#+%#)*487,#7,22#9.%2)3,./:%*#)*#;<=>?<=>#@.,/0+#7/*7,.#7,22#2)*,0#

,-.."/0123-1/-"456""

,-.."/0123-1/-"456""

78&*&"/-..9"

9:,;<8=;>"

!"

#!"

$!"

%!"

&!"

'!"

#" $" %" &" '" (" *" )" +"

A/B0#

<C$=#

;<=DE#

?0@A#"

BC?"

;<=D!#

DB="

<C$#

;<=<

CFGHI#

!"

#!"

$!"

%!"

&!"

'!"

#" $" %" &" '" (" *" )" +"

A/B0#

8AEF="/-..9"

9:?;GA#"

,-.."CH0IJK"4LH0IJK"M-N:O"/01N:P019Q">0I"9-H3M"O1N""-9JH0L-1"N-R.-P016"

;%8.0#

!"

#!"

$!"

%!"

&!"

'!"

(!"

*!"

)!"

!" $&" %(" &)" (!" *$" )&" +(" #!)"CTR

ICI

EG

F

HE

RU

FOXA1

H3

S-9J-H1"T.0J"0U"/KH0MOP1"UHO/P01":1"78&*&"/-..9""

4LH0IJK"M-N:O"/01N:P019Q">0I"9-H3M"

O1N""-9JH0L-1"N-R.-P016"

;%8.0#

!"##$%&'(

)"'%"$*+

,$$

-$

.-$

/--$

/.-$

-$ 01$

23$

14$

3-$

50$

41$

63$

/-4$

/0-$

/20$

/11$

/.3$

/31$

78!9:;<9=$

78>9?@/$

!"#$

%!&!

"'()*$

!"##$AB&CDE$8'$F!>G5$%"##7$$*HB&CDE$I"J8K$%&'J8L&'7M$=&C$7"B)I$K'J$$"7DB&H"'$J"N#"L&',$

+,$-./$01234.$5674218$!"#$69:$%/1/0;<=9$=9:;7/$>12<=5/16?29$=9$!&@A%!&BC$7/<<8$67?D6?90$:=E/1/94$0/92F=7$>64.36G8$

#HIJK$:/></?29$ !&$:/></?29$

-$

0-$

1-$

3-$

4-$

/--$

-$ 01$ 14$ 50$ 63$ /0-$ /11$ /.4$

-$

0-$

1-$

3-$

4-$

/--$

-$ 01$ 14$ 50$ 63$ /0-$ /11$ /.4$

OA>$

OA>$P$>)#Q"7DBK'D$*K'LGO<,$

R"B"H)#8'$

R"B"H)#8'$P$>)#Q"7DBK'D$*K'LGO<,$

SB&#8T"BKL&'$G2/+$$

SB&#8T"BKL&'$G/-+$$

!"#&$%!&CB$

>&U@/$

OA>$

ER! ER!

%!&CL$%!&CB$

>&U@/$

RO<$

78!9:;<9=$

78>9?@/$

M-&$

!"#$

%!&'"'()*$

FOXA1

H3

V"7D"B'$W#&D$&T$%EB&IKL'$TBK%L&'$8'$

F!>G5$%"##7$$*HB&CDE$I"J8K$%&'J8L&'7M$=&C$7"B)I$K'J$$"7DB&H"'$

J"N#"L&',$

!"##$%&'(

)"'%"$*+

,$$

-$

.-$

/--$

/.-$

0--$

-$ 23$ 3-$ 41$ /-4$ /20$ /.3$

%2;18$ %2;18$

!"#$%&#'%()*+,-%."//01%23%456*7%

89:;9:<%09$"0%09$"0%

=%23%456*7%09$"0%

456*7%<":2&"%;90$>98?@2:%9:%)*+,-%."//0%

!"#$%&'%()*#+*'&%,-%-#./012#3+*4+*5#678,&4-#5%*%-#(9,:+*5#,#;%:#&79%#+*##6<%#,')=,)7*#7>#':67;+*%#-+5*,9+*5#(,6<8,:-#?$@ABCD@AC#3&%,-6#',*'%&#'%99-E#

AB% AC% AD% AE% AF% BG% B7% BH% BI%

F*6%&5%*+'#

F*6&,5%*+'##

)*+-JK">."'@:% )*+-J.2:$>2/%

G7*6&79#

$%&'%()*#

L90$#:."%3>2&%.":$">%23%89:;9:<%09$"%(8'1%

7EMDGC%>";?.";%09$"0%N9$K%!">."'@:%

BGMCAC%9:;?.";%09$"0%N9$K%!">."'@:%

O9:;

9:<%9:$":09$P

%(0?&

%23%>"#;01%

G%HGGGG%AGGGG%CGGGG%EGGGG%

7GGGGG%7HGGGG%7AGGGG%

,7GGG%,EGG%,CGG%,AGG%,HGG%7GG%IGG%BGG%DGG%FGG%

Q5R)-5S%/20$% !T-QTU)VR%/20$%

G%BGGGG%

7GGGGG%7BGGGG%HGGGGG%HBGGGG%IGGGGG%IBGGGG%AGGGGG%

,7GGG%,EGG%,CGG%,AGG%,HGG%

7GG%

IGG%

BGG%

DGG%

FGG%

Q2:$>2/%<#9:% !T-QTU)VR%<#9:%

U#$KN#P%#:#/P090%2"   FHI##-+5*,9+*5#B"   J%*4&+)'#'%99#

K,6L&,)7*#M"   N.1O#-+5*,9+*5#P"   N.Q;R#-+5*,9+*5#!"   FH2S##-+5*,9+*5#T"   N,6L&,9#U+99%&#'%99#

-+5*,9+*5#S"   FHT##-+5*,9+*5#I"   FH#2V#-+5*,9+*5#W"   FH#2#-+5*,9+*5#2V"   FH#2!#-+5*,9+*5#22"   FH#2S,#-+5*,9+*5#2B"   OX.#-+5*,9+*5#2M"   G0GAP#-+5*,9+*5#7%%%%%%%H%%%%%%%I%%%%%%%A%%%%%%%B%%%%%%%C%%%%%%%D%%%%%%E%%%%%%%F%%%%%%7G%%%%%77%%%%%7H%%%%7I%%%%%%

!"#$%&%'()*#)(+)*),-#,+.#*)(/)(0#12#34567#,18%9/-#9.0)1(-#.(9):+./#8),+#;6<6=#%(/#>?>#@1'2-#)(#.(/1@.,9)%A#:%(:.9#:.AA-#BC-+)D%8%#:.AA#A)(.E#

>;3?# F>?GH#

!"#$%&

$%&%'()*#

'()*&+),&!-.$%&/01234)5)&6(7718&(1*9":(;&<(,7(*(<&+(<)=&

>3;<

3;:&3;*(;13*?

&/1@+

&"A&9()<1=&

B31*);6(&A9"+&6(;*(9&"A&C3;<3;:&13*(&/C,=&

D&EDDDD&FDDDD&GDDDD&HDDDD&

%DDDDD&%EDDDD&%FDDDD&%GDDDD&

I%DD

D&IHDD

&IGDD

&IFDD

&IEDD

&%DD&

JDD&

KDD&

LDD&

MDD&

6";*9"7&

7),)N;3C&

GM8JKK&13*(1&O";*9"7&*9()*(<&

6(771&

I1(,91A#

$%&%'()*#

P"NA&<316"Q(9?&/!-.$%&R3*(1=&

Selec%ve'inhibi%on'of'HER2'signalling'pathway'reveals'novel'func%ons'of'FOXA1'in'breast'cancer'''

'

!"#$%"&'' ()*)+#,-'../012'

34567'8,$98'7:/:;<'

34567'8,$98'

0/<;;'

!"#$% &!$'(%

3"=67'

)*+*,-./%

01-2314%

)*+*,-./%

>9)$'?)*'@A6BCD'E9&&8F'

../012'34567'8,$98'

34567'-,#G,#H',#'A)?"=,I9#'D98,8$)#$'B!3C0'E9&&8'@>JDKLMJDLF'

:'

1::::'

7:::::'

71::::'

K:::::'

K1::::'

C7::

:'C2::

'C<::

'CN::

'CK::

'7::'

.::'

1::'

0::'

;::'

E"#$%"&'' &)*)+#,-'

O,#G

,#H',#$9#8,$P

'@8Q?

'"I'%9)G8F'

R,8$)#E9'I%"?'E9#$9%'"I'-,#G,#H'8,$9'@-*F'

:' 1:' 7::'

5-26376-.8%

5-23*76-.8%%

A6BDS&)*)+#,-' A6BDSE"#$%"&'

T'"I'34567'8,$98'

34567'H9#"?9'G,8$%,-Q+"#',#'A6BCD'E9&&8'

9:)*+*,-./%36;<86=%#>?@9%/.-;.-7%.-263*8,1-%.-%&!$(AB!$A%/36*=2%8*-863%8644=%%%

Breast' cancer' cell' prolifera/on' results' from'many' different' factors' that' ac/vate'mul/ple' cellular' signaling' pathways,'which' are' currently' target'therapies.'Yet,'how'genomic'pathways'are'influenced'upon'cell=signaling'disrup/on'has'been'poorly'assessed.' 'We'explored'how'the'inhibi/on'of'HER2'signaling'pathway'influences'the'func/on'of'the'transcrip/on'factor'FOXA1.'By'means'of'specific'inhibitors'targe/ng'the'kinase'ac/vity'of'HER2'(Lapa/nib)'we'have' iden/fied'HER2'signaling'pathway'as'a'key'supervisor'of'FOXA1'func/on' in'HER2+'breast'cancer'subtype.'The'HER2'pathway'controls'the'binding'of'FOXA1'towards'key'genes'required'for'the'prolifera/on'in'HER2+'cancer'cells.'Importantly,'the'same'mechanisms'of'FOXA1'regula/on'are'needed'to'induce'cell'prolifera/on'in'ER+'breast'cancer'cells'that'do'not'overexpress'HER2.'We'also'explored'the'func/on'of'FOXA1'upon'Hercep/n'treatment,'which'is'a'monoclonal'an/body'that'binds'to'the'extracellular'domain'of'HER2.'Here,'we'demonstrate'that'treatment'of'breast'cancer'cells'with'Hercep/n'produces'the'ac/va/on'of'cytokine'signaling'pathways.'Importantly,'the'cytokine'pathway'ac/va/on'reprograms'the'chroma/n'interac/ons'of'FOXA1'towards'genes'playing'a'key'role'in'the'ini/a/on'of'the'cell=mediated'cytotoxicity.'All' together,'these'findings'supports'the' idea'that'FOXA1'integrates' input'signals'origina/ng'from'mul/ple'cell=signaling'pathways'to'generate'output'responses'that'culminate'in'control'of'prolifera/on'and'the'response'to'an/=cancer'therapies.'

NCMM-EMBL Breast Cancer Group Siv Gilfillan Elisa Fiorito Madhu Katika Engineer PhD student Post-doc

Yogita Sharma Bioinformatician (starting in June 2013)

Elena González Research assistant (starting in August 2013)

Baoyan Bai Post-doc

(starting in August 2013)

TEAM AND RESOURCES

'Madhumohan'R.'Ka%ka1,2,'Siv'Gilfillan1,'Yogita'Sharma1,'AnneJLise'BørresenJDale2'and'Antoni'Hurtado1,2''

1Nordic'EMBL'Partnership,'Center'for'Molecular'Medicine'Norway'(NCMM),'University'of'Oslo,'Norway'2Department'of'Gene/cs,'Ins/tute'for'Cancer'Research,'The'Norwegian'Radium'Hospital,'University'of'Oslo,'Norway''

'

RESULTS'

ABSTRACT'

!""#$%&'()*&+,-.(#)%'&%#)+%/0#

1234

5#6+(7

+(8#+(%/()+%9

#$):;

0#

!"#"$%&'()%*+,-.%/'#0'#1%23-456%7899:;%

<#=<<#

5<<<#5=<<#><<<#>=<<#?<<<#?=<<#@<<<#@=<<#

A=<<

<#A@B<

<#A@><

<#A?C<

<#A?@<

<#A?<<

<#A>B<

<#A>><

<#A5C<

<#A5@<

<#A5<<

<#AB<<

#A><<

#?<<#

D<<#

55<<#

5=<<#

5E<<#

>?<<#

>D<<#

?5<<#

?=<<#

?E<<#

@?<<#

@D<<#

*.(%&.F# F','-(+6#

G4!HI4J#4K4LJ"M"#

<=%9>?>@#'/%'#)'/'($%*+,-.%/'#0'#1%(A&>B0$%1"#"$%CBDC'>E%FAB%?BAE'F"B>@A#%>#0%G'1B>@A#%%

Cell cycle stages

Endocrine system disorders

7"EE%GAB?)AEA1H%

IDCE"'C%>C'0%G"(>/AE'$G%

JGGD#AEA1'C>E%0'$">$"$%

7"EE%0">()%>#0%$DBK'K>E%

7"EE%1BA&()%>#0%?BAE'F"B>@A#%

7"EEDE>B%GAK"G"#(L%'#K>$'A#%

MI-%B"?E'C>@A#%>#0%B"?>'B%

7"EEDE>B%>$$"G/EH%>#0%AB1>#'N>@A#%

O5P86<%

-C@#%

P86<%

P"BC"?@#%

9>?>@#'/%

IA#%(B">("0%

I/)%/&(#6F.%#

!"

#!"

$!"

%!"

&!"

'!"

(!"

!" $&" %(" &)" (!" *$" )&" +(" #!)"

!"#$%&'(#)*+,-./+,0#+1,#0)-*/2#%3#4)5,.,*+#-.%6+1#3/7+%.0#+%#)*487,#7,22#9.%2)3,./:%*#)*#;<=>?<=>#@.,/0+#7/*7,.#7,22#2)*,0#

,-.."/0123-1/-"456""

,-.."/0123-1/-"456""

78&*&"/-..9"

9:,;<8=;>"

!"

#!"

$!"

%!"

&!"

'!"

#" $" %" &" '" (" *" )" +"

A/B0#

<C$=#

;<=DE#

?0@A#"

BC?"

;<=D!#

DB="

<C$#

;<=<

CFGHI#

!"

#!"

$!"

%!"

&!"

'!"

#" $" %" &" '" (" *" )" +"

A/B0#

8AEF="/-..9"

9:?;GA#"

,-.."CH0IJK"4LH0IJK"M-N:O"/01N:P019Q">0I"9-H3M"O1N""-9JH0L-1"N-R.-P016"

;%8.0#

!"

#!"

$!"

%!"

&!"

'!"

(!"

*!"

)!"

!" $&" %(" &)" (!" *$" )&" +(" #!)"CTR

ICI

EG

F

HE

RU

FOXA1

H3

S-9J-H1"T.0J"0U"/KH0MOP1"UHO/P01":1"78&*&"/-..9""

4LH0IJK"M-N:O"/01N:P019Q">0I"9-H3M"

O1N""-9JH0L-1"N-R.-P016"

;%8.0#

!"##$%&'(

)"'%"$*+

,$$

-$

.-$

/--$

/.-$

-$ 01$

23$

14$

3-$

50$

41$

63$

/-4$

/0-$

/20$

/11$

/.3$

/31$

78!9:;<9=$

78>9?@/$

!"#$

%!&!

"'()*$

!"##$AB&CDE$8'$F!>G5$%"##7$$*HB&CDE$I"J8K$%&'J8L&'7M$=&C$7"B)I$K'J$$"7DB&H"'$J"N#"L&',$

+,$-./$01234.$5674218$!"#$69:$%/1/0;<=9$=9:;7/$>12<=5/16?29$=9$!&@A%!&BC$7/<<8$67?D6?90$:=E/1/94$0/92F=7$>64.36G8$

#HIJK$:/></?29$ !&$:/></?29$

-$

0-$

1-$

3-$

4-$

/--$

-$ 01$ 14$ 50$ 63$ /0-$ /11$ /.4$

-$

0-$

1-$

3-$

4-$

/--$

-$ 01$ 14$ 50$ 63$ /0-$ /11$ /.4$

OA>$

OA>$P$>)#Q"7DBK'D$*K'LGO<,$

R"B"H)#8'$

R"B"H)#8'$P$>)#Q"7DBK'D$*K'LGO<,$

SB&#8T"BKL&'$G2/+$$

SB&#8T"BKL&'$G/-+$$

!"#&$%!&CB$

>&U@/$

OA>$

ER! ER!

%!&CL$%!&CB$

>&U@/$

RO<$

78!9:;<9=$

78>9?@/$

M-&$

!"#$

%!&'"'()*$

FOXA1

H3

V"7D"B'$W#&D$&T$%EB&IKL'$TBK%L&'$8'$

F!>G5$%"##7$$*HB&CDE$I"J8K$%&'J8L&'7M$=&C$7"B)I$K'J$$"7DB&H"'$

J"N#"L&',$

!"##$%&'(

)"'%"$*+

,$$

-$

.-$

/--$

/.-$

0--$

-$ 23$ 3-$ 41$ /-4$ /20$ /.3$

%2;18$ %2;18$

!"#$%&#'%()*+,-%."//01%23%456*7%

89:;9:<%09$"0%09$"0%

=%23%456*7%09$"0%

456*7%<":2&"%;90$>98?@2:%9:%)*+,-%."//0%

!"#$%&'%()*#+*'&%,-%-#./012#3+*4+*5#678,&4-#5%*%-#(9,:+*5#,#;%:#&79%#+*##6<%#,')=,)7*#7>#':67;+*%#-+5*,9+*5#(,6<8,:-#?$@ABCD@AC#3&%,-6#',*'%&#'%99-E#

AB% AC% AD% AE% AF% BG% B7% BH% BI%

F*6%&5%*+'#

F*6&,5%*+'##

)*+-JK">."'@:% )*+-J.2:$>2/%

G7*6&79#

$%&'%()*#

L90$#:."%3>2&%.":$">%23%89:;9:<%09$"%(8'1%

7EMDGC%>";?.";%09$"0%N9$K%!">."'@:%

BGMCAC%9:;?.";%09$"0%N9$K%!">."'@:%

O9:;

9:<%9:$":09$P

%(0?&

%23%>"#;01%

G%HGGGG%AGGGG%CGGGG%EGGGG%

7GGGGG%7HGGGG%7AGGGG%

,7GGG%,EGG%,CGG%,AGG%,HGG%7GG%IGG%BGG%DGG%FGG%

Q5R)-5S%/20$% !T-QTU)VR%/20$%

G%BGGGG%

7GGGGG%7BGGGG%HGGGGG%HBGGGG%IGGGGG%IBGGGG%AGGGGG%

,7GGG%,EGG%,CGG%,AGG%,HGG%

7GG%

IGG%

BGG%

DGG%

FGG%

Q2:$>2/%<#9:% !T-QTU)VR%<#9:%

U#$KN#P%#:#/P090%2"   FHI##-+5*,9+*5#B"   J%*4&+)'#'%99#

K,6L&,)7*#M"   N.1O#-+5*,9+*5#P"   N.Q;R#-+5*,9+*5#!"   FH2S##-+5*,9+*5#T"   N,6L&,9#U+99%&#'%99#

-+5*,9+*5#S"   FHT##-+5*,9+*5#I"   FH#2V#-+5*,9+*5#W"   FH#2#-+5*,9+*5#2V"   FH#2!#-+5*,9+*5#22"   FH#2S,#-+5*,9+*5#2B"   OX.#-+5*,9+*5#2M"   G0GAP#-+5*,9+*5#7%%%%%%%H%%%%%%%I%%%%%%%A%%%%%%%B%%%%%%%C%%%%%%%D%%%%%%E%%%%%%%F%%%%%%7G%%%%%77%%%%%7H%%%%7I%%%%%%

!"#$%&%'()*#)(+)*),-#,+.#*)(/)(0#12#34567#,18%9/-#9.0)1(-#.(9):+./#8),+#;6<6=#%(/#>?>#@1'2-#)(#.(/1@.,9)%A#:%(:.9#:.AA-#BC-+)D%8%#:.AA#A)(.E#

>;3?# F>?GH#

!"#$%&

$%&%'()*#

'()*&+),&!-.$%&/01234)5)&6(7718&(1*9":(;&<(,7(*(<&+(<)=&

>3;<

3;:&3;*(;13*?

&/1@+

&"A&9()<1=&

B31*);6(&A9"+&6(;*(9&"A&C3;<3;:&13*(&/C,=&

D&EDDDD&FDDDD&GDDDD&HDDDD&%DDDDD&%EDDDD&%FDDDD&%GDDDD&

I%DD

D&IHDD

&IGDD

&IFDD

&IEDD

&%DD&

JDD&

KDD&

LDD&

MDD&

6";*9"7&

7),)N;3C&

GM8JKK&13*(1&O";*9"7&*9()*(<&

6(771&

I1(,91A#

$%&%'()*#

P"NA&<316"Q(9?&/!-.$%&R3*(1=&

Evaluation Type Genome Size(s) Read Length Read CountAccuracy: Varying Error Rate 3Gbp, 500Mbp 50bp 500,000Accuracy: Varying Indel Size 3Gbp, 500Mbp 50bp 500,000Accuracy: Varying Indel Frequency 3Gbp, 500Mbp 50bp 500,000

Table 1. Experimental setup for each simulation type: genome size(s), read length, and read count.

Error Rate

Accu

racy

0.001 0.010 0.100

0.0

0.2

0.4

0.6

0.8

1.0

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Error Rate

Accu

racy

0.001 0.010 0.100

0.0

0.2

0.4

0.6

0.8

1.0

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Error Rate

Used

Rea

d Ra

tio

0.001 0.010 0.100

0.0

0.2

0.4

0.6

0.8

1.0

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Fig. 1: Human genome: accuracy with varying error rate. (a) shows mapping quality threshold 0, (b) shows threshold 10 and (c) shows theproportion of reads that have mapping quality of at least 10. -R and -S suffixes denote relaxed and strict accuracy, respectively.

4.1.1 Varying Error Rate The accuracy of all algorithms onthe human genome for varying error rate is compared in Figure1. The results for quality threshold 0 (accepting all reads) areshown in Figure 1a, whereas 1b shows the mapping accuracy whenconsidering reads of quality ! 10. We can see that Bowtie, BWA

and Novoalign are the most sensitive to mapping quality thresholdat high error rates; their accuracy significantly increases as reads ofmapping quality 0 are discarded. SOAP’s mapping accuracy is quitehigh even at quality threshold 0, which is consistent with its intendedusage for genotyping SNPs. Figure 1c shows the proportion of

Threshold

Accu

racy

0 4 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Tool<Theoretical>BowtieBWANovoalignSOAP

Threshold

Used

Rea

d Ra

tio

0 4 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

ToolBowtieBWANovoalignSOAP

Fig. 2: Human genome: comparison of reported accuracy vs. theoretical accuracy for 0.1% base call error rate (only tools that reportmeaningful quality scores are included). (a) shows a comparison of the theoretical accuracy for each mapping quality score vs. each tool’saccuracy at that quality threshold. (b) shows the proportion of reads with a mapping quality greater than or equal to each threshold value.

4

at Odontologisk Fakultetsbibliothek on July 30, 2013http://bioinform

atics.oxfordjournals.org/Downloaded from

Indel Size (mean)

Accu

racy

2 4 7 10 16

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Indel Size (mean)

Accu

racy

2 4 7 10 16

0.0

0.2

0.4

0.6

0.8

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Indel Size

Used

Rea

d Ra

tio

2 4 7 10 16

0.0

0.2

0.4

0.6

0.8

1.0

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Fig. 3: Human genome: accuracy with varying indel sizes. (a) shows mapping quality threshold 0, (b) shows threshold 10 and (c) shows theproportion of reads that have mapping quality of at least 10. -R and -S suffixes denote relaxed and strict accuracy, respectively. At indel sizes10 and 16, SOAP discards all reads, producing missing values in (c).

Indel Frequency

Accu

racy

1e−05 1e−04 0.001 0.01

0.0

0.2

0.4

0.6

0.8

1.0

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Indel Frequency

Accu

racy

1e−05 1e−04 0.001 0.01

0.0

0.2

0.4

0.6

0.8

1.0

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Indel Frequency

Used

Rea

d Ra

tio

1e−05 1e−04 1e−03 1e−02

0.0

0.2

0.4

0.6

0.8

1.0

ToolBowtieBWAMrFast−RMrFast−SMrsFast−RMrsFast−SNovoalignSOAP

Fig. 4: Human genome: accuracy with varying indel frequencies. (a) shows mapping quality threshold 0, (b) shows threshold 10 and (c)shows the proportion of reads that have mapping quality of at least 10. -R and -S suffixes denote relaxed and strict accuracy, respectively.

errors in the human genome sequence. Genome Biology, 4(4), R25.Ewing, B. and Green, P. (1998). Base-Calling of Automated Sequencer TracesUsingPhred.II. ErrorProbabilities. Genome Research, 8(3), 186–194.

Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications.Foundations of Computer Science, Annual IEEE Symposium on, 0, 390.

Guffanti, A., Iacono, M., Pelucchi, P., Kim, N., Solda, G., Croft, L. J., Taft, R. J.,Rizzi, E., Askarian-Amiri, M., Bonnal, R. J., Callari, M., Mignone, F., Pesole, G.,Bertalot, G., Bernardi, L. R. R., Albertini, A., Lee, C., Mattick, J. S., Zucchi, I., andDe Bellis, G. (2009). A transcriptional sketch of a primary human breast cancer by454 deep sequencing. BMC genomics, 10(1), 163+.

Hach, F., Hormozdiari, F., Alkan, C., Hormozdiari, F., Birol, I., Eichler, E. E.,and Sahinalp, S. C. (2010). mrsFAST: a cache-oblivious algorithm for short-readmapping. Nat. Methods, 7, 576–577.

Horner, D. S., Pavesi, G., CastrignanA, T., De Meo, P. D., Liuni, S., Sammeth, M.,Picardi, E., and Pesole, G. (2010). Bioinformatics approaches for genomics and postgenomics applications of next-generation sequencing. Briefings in Bioinformatics,11(2), 181–197.

Illumina, I. (2010). Quality scores data.International Human Genome Sequencing Consortium (2001). Initial sequencing andanalysis of the human genome. Nature, 409(6822), 860–921.

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol.,10, R25.

Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.

Li, H. and Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595.

Li, H. and Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5), 473–483.

Li, H., Ruan, J., and Durbin, R. (2008a). Mapping short DNA sequencing reads andcalling variants using mapping quality scores. Genome Res., 18, 1851–1858.

Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008b). SOAP: short oligonucleotidealignment program. Bioinformatics, 24(5), 713–714.

Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K., and Wang, J. (2009).SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25,1966–1967.

Medvedev, P., Stanciu, M., and Brudno, M. (2009). Computational methods fordiscovering structural variation with next-generation sequencing. Nature Methods,6(11s), S13–S20.

6

at Odontologisk Fakultetsbibliothek on July 30, 2013http://bioinform

atics.oxfordjournals.org/Downloaded from

2) Removal of Duplicates 3) Peak Calling MACS CCAT

Total no. of peaks

6031 (3h) 17047 (12h)

8764 (3h) 19208 (12h)

Unique Peaks (3h)

3094 5660

Unique Peaks (12h)

4767 5879

0 2 4 6 8 10 12

Distribution of Peak Heights

0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08

ChIP Regions (Peaks) over Chromosomes

Chromosome Size (bp)

Ch

rom

oso

me

12

34

56

78

910

11

12

13

14

15

16

17

18

19

20

21

22

MX

Y

0 5 10 15

Distribution of Peak Heights

0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08

ChIP Regions (Peaks) over Chromosomes

Chromosome Size (bp)

Chro

mosom

e

12

34

56

78

91

01

11

21

31

41

51

61

71

81

92

02

12

2M

XY

ChIP regions over chromosome (3h & 12h)

CONCLUSIONS 1) We found that Bowtie performs well for the ChIP- Seq experiments provided the length of indels should be small. 2) Samtools and Picard both removed the same number and egions as duplicates. 3) CCAT and MACS differ from other .