the human reference assembly

Post on 22-Feb-2016

29 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Human Reference Assembly. How the sequence is made. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Valerie Schneider, NCBI. http:// genomereference.org. HGP Goals. Throughput: 500 Mb/year Cost: < $0.25 per base - PowerPoint PPT Presentation

TRANSCRIPT

The Human Reference AssemblyDeanna M. Church Staff Scientist, NCBI

@deannachurch Short Course in Medical Genetics 2013

How the sequence is made

http://genomereference.org

Valerie Schneider, NCBI

Collins FS et al, 1998

Throughput: 500 Mb/yearCost: < $0.25 per base

Variation: 100,000 SNPs mapped

HGP Goals

Steve Sherry, NCBI

2010

10

20

30

40

50

60

STR & IndelSNPAmbiguous mapping

Millions of rs-idsNCBI dbSNP database growth

human variations

Non-redundant annotations

25

50

75

100

125

150

175

1000 GenomesOther projectsHapMapTSC

Millions of submissionsSubmissions

by project

dbSNP build 135. November 2011

20001999 20112005

Kidd et al, 2007 APOBEC cluster

BLACK: DeletionWhite: Insertion

Genome Research, May, 1997

Reference assembly history

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger ReadsReference assembly history

Scaffold

Reference assembly history

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C

Schatz et al, 2010

Reference assembly history

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

sequ

ence

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assembliesReference assembly history

Build sequence contigs based on contigs defined in TPF.

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

Reference assembly history

NCBI36

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321

A

BCD

EFGH

IJKLMNO

ABCD

FGH

KL

ON

Ideally…

Non-sequence based Map

(flip)

ABCD

FGH

KL

ON

Reference assembly history

More like…

A

BCD

EFGH

IJKLMNO

A

BC

ZYX

W

HJ

M

V

N

O

AB

HIJ

CDY

LMNO

AB

HIJ

LMNO

?

Reference assembly history

Sequence vs. Non-sequence based mapsMmu7

WI GeneticWI/MRC RH

An assembly is a MODEL of the genome

Fragmented genomes tend to have less frame shifts

Alexander Souvorov, NCBI

Reference assembly history

Fragmented genomes tend to have more partial models

Alexander Souvorov, NCBI

Reference assembly history

Reference assembly history

EnrichmentObservedExpected

-5

-4

-3

-2

-1

0

1

2

3

4

5

60

40

20

0

20

40

60

Maj

or h

isto

com

patib

ility

com

plex

ant

igen

Che

mok

ine

Tum

or n

ecro

sis

fact

or re

cept

or

Oth

er c

ytok

ine

rece

ptor

Cys

tein

e pr

otea

se in

hibi

tor

CA

M fa

mily

adh

esio

n m

olec

ule

Apo

lipop

rote

in

KR

AB

box

tran

scrip

tion

fact

or

Inte

rmed

iate

fila

men

t

Imm

unog

lobu

lin re

cept

or fa

mily

mem

ber

Oth

er c

ell a

dhes

ion

mol

ecul

e

Zinc

fing

er tr

ansc

riptio

n fa

ctor

Def

ense

/imm

unity

pro

tein

Stru

ctur

al p

rote

in

Cys

tein

e pr

otea

se

Cyt

okin

e re

cept

or

Oxy

gena

se

Cel

l adh

esio

n m

olec

ule

Tran

scrip

tion

fact

or

Mis

cella

neou

s fu

nctio

n

Sig

nalin

g m

olec

ule

Oxi

dore

duct

ase

Unc

lass

ified

Nuc

leic

aci

d bi

ndin

g

Sel

ect r

egul

ator

y m

olec

ule

Kin

ase

Hyd

rola

se

Rib

osom

al p

rote

in

Pro

tein

kin

ase

G-p

rote

in m

odul

ator

Ext

race

llula

r mat

rix

Oth

er tr

ansc

riptio

n fa

ctor

Human- panther classifications (biological process)

Evan Eichler, University of Washington

Reference assembly history

Center sequence distribution: NCBI36

Finding the data

http://genomereference.org

Church et al., 2011 PLoS

Issue tracking system (JIRA) publicly available

Finding the data

Finding the data

HG-110 AC021180.6 AC149643.1

Finding the data

Finding the data

ACC NAME CTG

GAP Telomere 10000

AP006221 XX-190A2 Hschr1_ctg1

AL627309 RP11-34P13 Hschr1_ctg1

GAP type-3

AC114498 RP5-857K21 Hschr1_ctg3

AL669831 RP11-206L10 Hschr1_ctg

AL645608 RP11-54O7 Hschr1_ctg3

Tiling Path File (TPF)

Putting the genome together

Putting the genome together

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/overlap/

Serious alignment problem requiring review

Minor alignment problem

Excellent alignment

Certificate submitted, not yet approved

Certificate submitted and approved

Join not evaluated

Valid, contained clones

Putting the genome together

Putting the genome together

Putting the genome together

AGP: A Golden Path

Provides instructions for building a sequence• Defines components sequences used to build scaffolds/chromosome• Switch points• Defines gaps and types

GRC ProducesGenBank components->Scaffolds, GenBank components->ChromosomeScaffolds->Chromosome

Putting the genome together

Putting the genome together

Large-Scale Variation Complicates Genome Assembly

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Assembly (e.g. GRCh37)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Data Model

Take home messages

Assemblies are not genomes, they are models of genomesAll eukaryotic assemblies have some issues

Mis-assembliesMissing variation

Assembly evidence is importantAssemblies are not static (if you are lucky!)

top related