hg19 (grch37) vs. hg38 (grch38) human genome reference comparison zuotian tatum department of human...

29
hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison Zuotian Tatum Department of Human Genetics Leiden University Medical Center

Upload: lesley-wells

Post on 19-Dec-2015

262 views

Category:

Documents


2 download

TRANSCRIPT

hg19 (GRCh37) vs. hg38 (GRCh38)

Human Genome Reference Comparison

Zuotian Tatum

Department of Human Genetics

Leiden University Medical Center

Timeline GRCh37:

First release: Feb 27, 2009

Latest patch: Jun 28, 2013 (p13)

GRCh38:

First release: Dec 24, 2013

Latest patch: Oct 14, 2014 (p1)

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/

Content GRCh37.p13:

Total bases: 3.23 Billion 2.99 Billion (without N)

N50: 46 Million

Number of alternative loci: 9

Non-nuclear genome: No

GRCh38.p2:

Total bases: 3.21 Billion 3.05 Billion (without N)

N50: 67 Million

Number of alternative loci : 261

Non-nuclear genome: Yes

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/

UCSC tracks for GRCh38 UCSC RefSeq available since April 2014.

Ensembl regulatory build available since September 2014.

dbSNP 141 available since October 2014.

ENCODE and FANTOM5 track hubs are still not available (Nov 2014).

New in GRCh38 release Three new sequence files, in addition to the standard assembly files:

- GCA_000001405.15_GRCh38_top-level.fna.gz

- GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

- GCA_000001405.15_GRCh38_full_analysis_set.fna.gz

The analysis set files are created to avoid false mapping in NGS alignment pipelines.

GCA_000001405.15_GRCh38_top-level.fna.gz

All the top-level objects in the full-assembly Chromosomes unlocalized scaffolds unplaced scaffolds alternate locus scaffolds mitochondrial genome

The sequence identifiers are International Sequence Database Collaboration (INSDC) accession.versions and the definition lines are GenBank style.

No sequences have been hard-masked.

GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

Chromosomes from the GRCh38 Primary Assembly unit.

Note: the two PAR regions on chrY have been hard-masked with Ns. The chromosome Y sequence provided therefore has the same coordinates as the GenBank sequence but it is not identical to the GenBank sequence. Similarly, duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked with Ns.

Mitochondrial genome from the GRCh38 non-nuclear assembly unit.

Unlocalized scaffolds from the GRCh38 Primary Assembly unit.

Unplaced scaffolds from the GRCh38 Primary Assembly unit.

Epstein-Barr virus (EBV) sequence Note: The EBV sequence is not part of the genome assembly but is included in the

analysis set as a sink for alignment of reads that are often present in sequencing samples.

GCA_000001405.15_GRCh38_full_analysis_set.fna.gz

=

GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

+

alt-scaffolds from the GRCh38 ALT_REF_LOCI_* assembly units

Alt-loci add complexity to RNASeq quantification

Ideogram of GRCh38.p2

RNASeq quantification - Fragments (reads) per million per killobase (FPKM/RPKM) values to quantify gene expression

- Unique mapping only Analysis tools do not distinguish allelic duplication from paralogous duplication

- Non overlapping gene regions

To understand the effect of alt-loci on RNASeq quantification Compare alignment of chromosome 6 MHC region between

- hg19 full set with 7 alt-loci

- hg38 analysis set without alt-loci

Sequence content are largely unchanged between hg19 and hg38.

Mapping/alignment for RNASeq

hg19 hg38

mapped 14,655,299 14,704,427

mappedDiffChr 4,959 4,017

mappedPairProper 14,639,261 14,690,090

mappedPairProperPct 92.62 92.94

total 15,805,561 15,805,561

totalSplice 5,060,829 5,078,133

unmapped 1,150,262 1,101,134

hg19: with alt locihg38: without alt loci

Effect of alt loci in RNASeq alignments

Gen

e R

PK

M

(hg

38

)

Major Histocompatibility complex region on chromosome 6

HLA-A hg19 full set – chr6

D1

hg19 full set – chr6_mann_hap4

D1

hg19 full set – chr6_qb1_hap6

D1

hg19 full set – chr6_dbb_hap3

D1

HLA-A hg19 full set – chr6

hg38 analysis set

D1

D2

D3

D1

D2

D3

HLA-C hg19 full set

D1

D2

D3

hg38 analysis set

D1

D2

D3

HLA-DRAhg19 full set

D1

D2

D3

hg38 analysis set

D1

D2

D3

Major Histocompatibility complex region on chromosome 6

Class III

MHC Class III 700kb stretch, 60 genes.

The most gene-dense region of the human genome

> 14% coding ~ 72% transcribed

Highly conserved

Only a free have clearly defined and proven function

TNFhg19 full set – chr6

D1.control

D1.treated

hg38 analysis set – chr6

D1.control

D1.treated

Highly variant immune regions retiled

LILRA3 moved to alt-loci in hg38

hg19

hg38

LILRB2 LILRA3 LILRA5

LILRB2 LILRA5

Phantom LILRA3

LILRA3 in hg19

Intergenic LILRB3 LILRA4LILRB5

Need more comprehensive approach to genome variation.Assembly model is neither haploid nor diploid

Analysis tools penalize reads mapping to > 1 location do not distinguish allelic duplication from paralogous

duplication

A graph structure is a natural way to represent a population-based genome assembly

Conclusions RPKM values are highly correlated between hg19 and hg38.

Analysis set is preferred for expression analysis. Additional analysis may be performed to use the

alt-loci separately.

Annotations for hg38 is still lacking and need contribution from the community.

Improve modeling of genome variability in population.

Questions?