the human reference assembly
DESCRIPTION
The Human Reference Assembly. How the sequence is made. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Valerie Schneider, NCBI. http:// genomereference.org. HGP Goals. Throughput: 500 Mb/year Cost: < $0.25 per base - PowerPoint PPT PresentationTRANSCRIPT
The Human Reference AssemblyDeanna M. Church Staff Scientist, NCBI
@deannachurch Short Course in Medical Genetics 2013
How the sequence is made
http://genomereference.org
Valerie Schneider, NCBI
Collins FS et al, 1998
Throughput: 500 Mb/yearCost: < $0.25 per base
Variation: 100,000 SNPs mapped
HGP Goals
Steve Sherry, NCBI
2010
10
20
30
40
50
60
STR & IndelSNPAmbiguous mapping
Millions of rs-idsNCBI dbSNP database growth
human variations
Non-redundant annotations
25
50
75
100
125
150
175
1000 GenomesOther projectsHapMapTSC
Millions of submissionsSubmissions
by project
dbSNP build 135. November 2011
20001999 20112005
Kidd et al, 2007 APOBEC cluster
BLACK: DeletionWhite: Insertion
Genome Research, May, 1997
Reference assembly history
Restrict and make libraries2, 4, 8, 10, 40, 150 kb
End-sequence allclones and retainpairing information“mate-pairs”
Find sequence overlaps
Each end sequenceis referred to as a read
WGS contig
tails
WGS: Sanger ReadsReference assembly history
Scaffold
Reference assembly history
A T T T T C C C T T C T G A A A T G A T G A A A G A G T C
Schatz et al, 2010
Reference assembly history
BAC insertBAC vector
Shotgun sequence
Assemble
Fold
sequ
ence
Gaps
deeper sequencecoverage rarelyresolves all gaps
GAPS
“finishers” go in to manually fill the gaps, often by PCR
Clone based assembliesReference assembly history
Build sequence contigs based on contigs defined in TPF.
Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis
Switch point
Consensus sequence
Reference assembly history
NCBI36
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321
A
BCD
EFGH
IJKLMNO
ABCD
FGH
KL
ON
Ideally…
Non-sequence based Map
(flip)
ABCD
FGH
KL
ON
Reference assembly history
More like…
A
BCD
EFGH
IJKLMNO
A
BC
ZYX
W
HJ
M
V
N
O
AB
HIJ
CDY
LMNO
AB
HIJ
LMNO
?
Reference assembly history
Sequence vs. Non-sequence based mapsMmu7
WI GeneticWI/MRC RH
An assembly is a MODEL of the genome
Fragmented genomes tend to have less frame shifts
Alexander Souvorov, NCBI
Reference assembly history
Fragmented genomes tend to have more partial models
Alexander Souvorov, NCBI
Reference assembly history
Reference assembly history
EnrichmentObservedExpected
-5
-4
-3
-2
-1
0
1
2
3
4
5
60
40
20
0
20
40
60
Maj
or h
isto
com
patib
ility
com
plex
ant
igen
Che
mok
ine
Tum
or n
ecro
sis
fact
or re
cept
or
Oth
er c
ytok
ine
rece
ptor
Cys
tein
e pr
otea
se in
hibi
tor
CA
M fa
mily
adh
esio
n m
olec
ule
Apo
lipop
rote
in
KR
AB
box
tran
scrip
tion
fact
or
Inte
rmed
iate
fila
men
t
Imm
unog
lobu
lin re
cept
or fa
mily
mem
ber
Oth
er c
ell a
dhes
ion
mol
ecul
e
Zinc
fing
er tr
ansc
riptio
n fa
ctor
Def
ense
/imm
unity
pro
tein
Stru
ctur
al p
rote
in
Cys
tein
e pr
otea
se
Cyt
okin
e re
cept
or
Oxy
gena
se
Cel
l adh
esio
n m
olec
ule
Tran
scrip
tion
fact
or
Mis
cella
neou
s fu
nctio
n
Sig
nalin
g m
olec
ule
Oxi
dore
duct
ase
Unc
lass
ified
Nuc
leic
aci
d bi
ndin
g
Sel
ect r
egul
ator
y m
olec
ule
Kin
ase
Hyd
rola
se
Rib
osom
al p
rote
in
Pro
tein
kin
ase
G-p
rote
in m
odul
ator
Ext
race
llula
r mat
rix
Oth
er tr
ansc
riptio
n fa
ctor
Human- panther classifications (biological process)
Evan Eichler, University of Washington
Reference assembly history
Center sequence distribution: NCBI36
Finding the data
http://genomereference.org
Church et al., 2011 PLoS
Issue tracking system (JIRA) publicly available
Finding the data
Finding the data
HG-110 AC021180.6 AC149643.1
Finding the data
Finding the data
ACC NAME CTG
GAP Telomere 10000
AP006221 XX-190A2 Hschr1_ctg1
AL627309 RP11-34P13 Hschr1_ctg1
GAP type-3
AC114498 RP5-857K21 Hschr1_ctg3
AL669831 RP11-206L10 Hschr1_ctg
AL645608 RP11-54O7 Hschr1_ctg3
Tiling Path File (TPF)
Putting the genome together
Putting the genome together
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/overlap/
Serious alignment problem requiring review
Minor alignment problem
Excellent alignment
Certificate submitted, not yet approved
Certificate submitted and approved
Join not evaluated
Valid, contained clones
Putting the genome together
Putting the genome together
Putting the genome together
AGP: A Golden Path
Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types
GRC ProducesGenBank components->Scaffolds, GenBank components->ChromosomeScaffolds->Chromosome
Putting the genome together
Putting the genome together
Large-Scale Variation Complicates Genome Assembly
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
GRCh37 (hg19)
http://genomereference.org
7 alternate haplotypesat the MHC
Alternate loci released as:FASTA
AGPAlignment to chromosome
UGT2B17 MHC MAPT
Assembly (e.g. GRCh37)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Data Model
Take home messages
Assemblies are not genomes, they are models of genomesAll eukaryotic assemblies have some issues
Mis-assembliesMissing variation
Assembly evidence is importantAssemblies are not static (if you are lucky!)