understanding the reference assembly: cshl hackathon

41
erstanding the reference assemb Valerie Schneider NCBI 26 October 2016 tp://www.biorxiv.org/content/early/2016/08/30/072116

Upload: genome-reference-consortium

Post on 17-Jan-2017

82 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Understanding the reference assembly: CSHL Hackathon

Understanding the reference assembly

Valerie SchneiderNCBI

26 October 2016

http://www.biorxiv.org/content/early/2016/08/30/072116

Page 2: Understanding the reference assembly: CSHL Hackathon

Dilthey et al.Paten et al.

Scientific Models

Page 3: Understanding the reference assembly: CSHL Hackathon

• Distinguishing features of the human reference assembly• Implications for genomic analyses and tools• Where do you get assembly-relevant data?

Outline

Page 4: Understanding the reference assembly: CSHL Hackathon
Page 5: Understanding the reference assembly: CSHL Hackathon

Assembly BasicsSanger-seq’d, clone based assembly BAC insert

BAC vector

Shotgun sequence clone

Assemble

GAPS

Finish

Minimal Tiling Path

Define switch points for adjacent components(haploid mosaic)

Most contiguousHighest sequence quality

Page 6: Understanding the reference assembly: CSHL Hackathon

Today’s reference assembly does not represent:1.The most common allele

2.The longest allele3.The ancestral allele

Assembly Basics

It represents the sequence available from the HGP

Page 7: Understanding the reference assembly: CSHL Hackathon

GRC Assembly Model

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

Current Assembly model: represent both haplotypesmany

Page 8: Understanding the reference assembly: CSHL Hackathon

Assembly (e.g. GRCh38)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

ALT 1

Page 9: Understanding the reference assembly: CSHL Hackathon

The alignments of the alternate loci scaffolds to the chromosomes are an integral part of the assembly and can be downloaded from GenBank with the assembly sequences

Page 10: Understanding the reference assembly: CSHL Hackathon

Assembly (e.g. GRCh38.p1)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091

Patches

Genomic Region(ABO)

Genomic Region

(FOXO6)Genomic

Region(FCGBP)

GRC Assembly Model

Patches

FIX NOVEL

SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE

ALT LOCI

--(integrated)

Treat as: Allelic

Treat as: Preferred

Page 11: Understanding the reference assembly: CSHL Hackathon

1q32 1q21 1p21

Dennis et al., 2012

GRC Assembly Model

Page 12: Understanding the reference assembly: CSHL Hackathon

GRC: Assembly Model

GRCh38• 178 regions with alt loci: 2% of

chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence

relative to chromosomes

Page 13: Understanding the reference assembly: CSHL Hackathon

GRCh38.p9• 96 Patches: >1 Mb novel

sequence• 48 FIX• 48 NOVEL

GRC: Assembly Model

Page 14: Understanding the reference assembly: CSHL Hackathon

GRCh38: Alt Loci

Alignment Legend

no alignmentmismatchdeletion

Page 15: Understanding the reference assembly: CSHL Hackathon

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRCh38: Alt LociPLoS Biol. 2011 Jul;9(7):e1001091

Page 16: Understanding the reference assembly: CSHL Hackathon

Anatomy of an alt

Page 17: Understanding the reference assembly: CSHL Hackathon

Anatomy of an alt

AC012314.8

CU151838.1

ALT LOCI

AC012314.8

AC245052.3 CHR. 19

Due to anchor components, alternate loci contain some sequence that is redundant to the primary assembly unit

Page 18: Understanding the reference assembly: CSHL Hackathon

GRCh38 Model CentromeresKaren Miga (Kent Lab, UCSC)

Page 19: Understanding the reference assembly: CSHL Hackathon

GRCh38 Model Centromeres

WGS WGS WGS

Page 20: Understanding the reference assembly: CSHL Hackathon

GRCh38 Centromeres

Miga et al., Genome Res. 2014 Apr;24(4):697-707

Page 21: Understanding the reference assembly: CSHL Hackathon

GRCh38: Where’s the data?

Page 22: Understanding the reference assembly: CSHL Hackathon

GRCh38: Where’s the data?

GRCh38 Sequences for alignment pipelines

Page 23: Understanding the reference assembly: CSHL Hackathon

GRCh38: Where’s the data?

Page 24: Understanding the reference assembly: CSHL Hackathon

Assembly Sequence and Statistics Reports

GRCh38: Where’s the data?

Page 25: Understanding the reference assembly: CSHL Hackathon

GRCh38: Where’s the data?

Page 26: Understanding the reference assembly: CSHL Hackathon

GRCh38: Where’s the data?

Assembly Regions Report: Alts, Patches and Centromeres

Page 27: Understanding the reference assembly: CSHL Hackathon

GRCh38: Where’s the data?

Page 28: Understanding the reference assembly: CSHL Hackathon
Page 29: Understanding the reference assembly: CSHL Hackathon

GRCh38: Where’s the data?

Page 30: Understanding the reference assembly: CSHL Hackathon

GRCh38: Where’s the data?

Page 31: Understanding the reference assembly: CSHL Hackathon

Accessing the Datahttps://genomereference.org

Page 32: Understanding the reference assembly: CSHL Hackathon

Accessing the Datahttps://genomereference.org

Page 33: Understanding the reference assembly: CSHL Hackathon

Dumped daily

Frozen mappings to prior assembly versions in GFF3

Accessing the Datahttps://genomereference.org

Page 34: Understanding the reference assembly: CSHL Hackathon

Mapped to latest GRCh38 and GRCh37.p13

Accessing the Datahttps://genomereference.org

Page 35: Understanding the reference assembly: CSHL Hackathon

GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs

GRC Creditshttps://genomereference.org

Page 36: Understanding the reference assembly: CSHL Hackathon

Alt Loci: Informatics Challenges

Page 37: Understanding the reference assembly: CSHL Hackathon

Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning

reads to the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffoldsSimulated Reads

GRCh38: Alt Loci

Page 38: Understanding the reference assembly: CSHL Hackathon

The Changing Reference

Page 39: Understanding the reference assembly: CSHL Hackathon

The Changing Reference

Page 40: Understanding the reference assembly: CSHL Hackathon

Dilthey et al.Paten et al.

The Changing Reference

Page 41: Understanding the reference assembly: CSHL Hackathon

• Distinguishing features of the human reference assembly• Implications for genomic analyses and tools• Where do you get assembly-relevant data?

Outline