the human reference assembly

46
he Human Reference Assembl Deanna M. Church Staff Scientist, NCBI @deannachurch Short Course in Medical Genetics How the sequence is made

Upload: sissy

Post on 22-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

The Human Reference Assembly. How the sequence is made. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Valerie Schneider, NCBI. http:// genomereference.org. HGP Goals. Throughput: 500 Mb/year Cost: < $0.25 per base - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Human Reference Assembly

The Human Reference AssemblyDeanna M. Church Staff Scientist, NCBI

@deannachurch Short Course in Medical Genetics 2013

How the sequence is made

Page 2: The Human Reference Assembly

http://genomereference.org

Valerie Schneider, NCBI

Page 3: The Human Reference Assembly
Page 4: The Human Reference Assembly

Collins FS et al, 1998

Throughput: 500 Mb/yearCost: < $0.25 per base

Variation: 100,000 SNPs mapped

HGP Goals

Page 5: The Human Reference Assembly
Page 6: The Human Reference Assembly

Steve Sherry, NCBI

2010

10

20

30

40

50

60

STR & IndelSNPAmbiguous mapping

Millions of rs-idsNCBI dbSNP database growth

human variations

Non-redundant annotations

25

50

75

100

125

150

175

1000 GenomesOther projectsHapMapTSC

Millions of submissionsSubmissions

by project

dbSNP build 135. November 2011

20001999 20112005

Page 7: The Human Reference Assembly

Kidd et al, 2007 APOBEC cluster

BLACK: DeletionWhite: Insertion

Page 8: The Human Reference Assembly
Page 9: The Human Reference Assembly
Page 10: The Human Reference Assembly

Genome Research, May, 1997

Reference assembly history

Page 11: The Human Reference Assembly
Page 12: The Human Reference Assembly

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger ReadsReference assembly history

Scaffold

Page 13: The Human Reference Assembly

Reference assembly history

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

Page 14: The Human Reference Assembly

Schatz et al, 2010

Reference assembly history

Page 15: The Human Reference Assembly

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

sequ

ence

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assembliesReference assembly history

Page 16: The Human Reference Assembly

Build sequence contigs based on contigs defined in TPF.

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

Reference assembly history

Page 17: The Human Reference Assembly

NCBI36

Page 18: The Human Reference Assembly
Page 19: The Human Reference Assembly

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321

Page 20: The Human Reference Assembly

A

BCD

EFGH

IJKLMNO

ABCD

FGH

KL

ON

Ideally…

Non-sequence based Map

(flip)

ABCD

FGH

KL

ON

Reference assembly history

Page 21: The Human Reference Assembly

More like…

A

BCD

EFGH

IJKLMNO

A

BC

ZYX

W

HJ

M

V

N

O

AB

HIJ

CDY

LMNO

AB

HIJ

LMNO

?

Reference assembly history

Page 22: The Human Reference Assembly

Sequence vs. Non-sequence based mapsMmu7

WI GeneticWI/MRC RH

Page 23: The Human Reference Assembly

An assembly is a MODEL of the genome

Page 24: The Human Reference Assembly

Fragmented genomes tend to have less frame shifts

Alexander Souvorov, NCBI

Reference assembly history

Page 25: The Human Reference Assembly

Fragmented genomes tend to have more partial models

Alexander Souvorov, NCBI

Reference assembly history

Page 26: The Human Reference Assembly

Reference assembly history

Page 27: The Human Reference Assembly

EnrichmentObservedExpected

-5

-4

-3

-2

-1

0

1

2

3

4

5

60

40

20

0

20

40

60

Maj

or h

isto

com

patib

ility

com

plex

ant

igen

Che

mok

ine

Tum

or n

ecro

sis

fact

or re

cept

or

Oth

er c

ytok

ine

rece

ptor

Cys

tein

e pr

otea

se in

hibi

tor

CA

M fa

mily

adh

esio

n m

olec

ule

Apo

lipop

rote

in

KR

AB

box

tran

scrip

tion

fact

or

Inte

rmed

iate

fila

men

t

Imm

unog

lobu

lin re

cept

or fa

mily

mem

ber

Oth

er c

ell a

dhes

ion

mol

ecul

e

Zinc

fing

er tr

ansc

riptio

n fa

ctor

Def

ense

/imm

unity

pro

tein

Stru

ctur

al p

rote

in

Cys

tein

e pr

otea

se

Cyt

okin

e re

cept

or

Oxy

gena

se

Cel

l adh

esio

n m

olec

ule

Tran

scrip

tion

fact

or

Mis

cella

neou

s fu

nctio

n

Sig

nalin

g m

olec

ule

Oxi

dore

duct

ase

Unc

lass

ified

Nuc

leic

aci

d bi

ndin

g

Sel

ect r

egul

ator

y m

olec

ule

Kin

ase

Hyd

rola

se

Rib

osom

al p

rote

in

Pro

tein

kin

ase

G-p

rote

in m

odul

ator

Ext

race

llula

r mat

rix

Oth

er tr

ansc

riptio

n fa

ctor

Human- panther classifications (biological process)

Evan Eichler, University of Washington

Reference assembly history

Page 28: The Human Reference Assembly

Center sequence distribution: NCBI36

Page 29: The Human Reference Assembly

Finding the data

Page 30: The Human Reference Assembly

http://genomereference.org

Church et al., 2011 PLoS

Page 31: The Human Reference Assembly
Page 32: The Human Reference Assembly

Issue tracking system (JIRA) publicly available

Finding the data

Page 33: The Human Reference Assembly

Finding the data

Page 34: The Human Reference Assembly

HG-110 AC021180.6 AC149643.1

Finding the data

Page 35: The Human Reference Assembly

Finding the data

Page 36: The Human Reference Assembly

ACC NAME CTG

GAP Telomere 10000

AP006221 XX-190A2 Hschr1_ctg1

AL627309 RP11-34P13 Hschr1_ctg1

GAP type-3

AC114498 RP5-857K21 Hschr1_ctg3

AL669831 RP11-206L10 Hschr1_ctg

AL645608 RP11-54O7 Hschr1_ctg3

Tiling Path File (TPF)

Putting the genome together

Page 37: The Human Reference Assembly

Putting the genome together

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/overlap/

Page 38: The Human Reference Assembly

Serious alignment problem requiring review

Minor alignment problem

Excellent alignment

Certificate submitted, not yet approved

Certificate submitted and approved

Join not evaluated

Valid, contained clones

Putting the genome together

Page 39: The Human Reference Assembly

Putting the genome together

Page 40: The Human Reference Assembly

Putting the genome together

Page 41: The Human Reference Assembly

AGP: A Golden Path

Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types

GRC ProducesGenBank components->Scaffolds, GenBank components->ChromosomeScaffolds->Chromosome

Putting the genome together

Page 42: The Human Reference Assembly

Putting the genome together

Page 43: The Human Reference Assembly

Large-Scale Variation Complicates Genome Assembly

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

Page 44: The Human Reference Assembly

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Page 45: The Human Reference Assembly

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Data Model

Page 46: The Human Reference Assembly

Take home messages

Assemblies are not genomes, they are models of genomesAll eukaryotic assemblies have some issues

Mis-assembliesMissing variation

Assembly evidence is importantAssemblies are not static (if you are lucky!)