understanding the reference assembly: cshl hackathon

Post on 17-Jan-2017

82 Views

Category:

Science

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Understanding the reference assembly

Valerie SchneiderNCBI

26 October 2016

http://www.biorxiv.org/content/early/2016/08/30/072116

Dilthey et al.Paten et al.

Scientific Models

• Distinguishing features of the human reference assembly• Implications for genomic analyses and tools• Where do you get assembly-relevant data?

Outline

Assembly BasicsSanger-seq’d, clone based assembly BAC insert

BAC vector

Shotgun sequence clone

Assemble

GAPS

Finish

Minimal Tiling Path

Define switch points for adjacent components(haploid mosaic)

Most contiguousHighest sequence quality

Today’s reference assembly does not represent:1.The most common allele

2.The longest allele3.The ancestral allele

Assembly Basics

It represents the sequence available from the HGP

GRC Assembly Model

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

Current Assembly model: represent both haplotypesmany

Assembly (e.g. GRCh38)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

ALT 1

The alignments of the alternate loci scaffolds to the chromosomes are an integral part of the assembly and can be downloaded from GenBank with the assembly sequences

Assembly (e.g. GRCh38.p1)

Primary Assembly

Unit

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 6

ALT 7

PAR

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Church et al., PLoS Biol. 2011 Jul;9(7):e1001091

Patches

Genomic Region(ABO)

Genomic Region

(FOXO6)Genomic

Region(FCGBP)

GRC Assembly Model

Patches

FIX NOVEL

SCAFFOLD STATUS AT NEXTMAJOR ASSEMBLY RELEASE

ALT LOCI

--(integrated)

Treat as: Allelic

Treat as: Preferred

1q32 1q21 1p21

Dennis et al., 2012

GRC Assembly Model

GRC: Assembly Model

GRCh38• 178 regions with alt loci: 2% of

chromosome sequence (61.9 Mb)• 261 Alt Loci: 3.6 Mb novel sequence

relative to chromosomes

GRCh38.p9• 96 Patches: >1 Mb novel

sequence• 48 FIX• 48 NOVEL

GRC: Assembly Model

GRCh38: Alt Loci

Alignment Legend

no alignmentmismatchdeletion

chromosome

alt/patch

reads On-target alignment

Off-target alignments

(n=122,922)

GRCh38: Alt LociPLoS Biol. 2011 Jul;9(7):e1001091

Anatomy of an alt

Anatomy of an alt

AC012314.8

CU151838.1

ALT LOCI

AC012314.8

AC245052.3 CHR. 19

Due to anchor components, alternate loci contain some sequence that is redundant to the primary assembly unit

GRCh38 Model CentromeresKaren Miga (Kent Lab, UCSC)

GRCh38 Model Centromeres

WGS WGS WGS

GRCh38 Centromeres

Miga et al., Genome Res. 2014 Apr;24(4):697-707

GRCh38: Where’s the data?

GRCh38: Where’s the data?

GRCh38 Sequences for alignment pipelines

GRCh38: Where’s the data?

Assembly Sequence and Statistics Reports

GRCh38: Where’s the data?

GRCh38: Where’s the data?

GRCh38: Where’s the data?

Assembly Regions Report: Alts, Patches and Centromeres

GRCh38: Where’s the data?

GRCh38: Where’s the data?

GRCh38: Where’s the data?

Accessing the Datahttps://genomereference.org

Accessing the Datahttps://genomereference.org

Dumped daily

Frozen mappings to prior assembly versions in GFF3

Accessing the Datahttps://genomereference.org

Mapped to latest GRCh38 and GRCh37.p13

Accessing the Datahttps://genomereference.org

GRCh38 Collaborators• NCBI RefSeq and gpipe annotation team• Havana annotators• Karen Miga• David Schwartz• Steve Goldstein• Mario Caceres• Giulio Genovese• Jeff Kidd• Peter Lansdorp• Mark Hills• David Page• Jim Knight• Stephan Schuster• 1000 Genomes

GRC SAB• Rick Myers• Granger Sutton• Evan Eichler• Jim Kent• Roderic Guigo• Carol Bult• Derek Stemple• Jan Korbel• Liz Worthey• Matthew Hurles• Richard Gibbs

GRC Creditshttps://genomereference.org

Alt Loci: Informatics Challenges

Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning

reads to the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts. Mask2: mask only on scaffoldsSimulated Reads

GRCh38: Alt Loci

The Changing Reference

The Changing Reference

Dilthey et al.Paten et al.

The Changing Reference

• Distinguishing features of the human reference assembly• Implications for genomic analyses and tools• Where do you get assembly-relevant data?

Outline

top related