ashg 2015 - redundant annotations in tertiary analysis

1
www.bina.com Fig. 3) Percentage of SNPs predicted as damaging by 7 different algorithms. Fig. 1) 1000 Genomes overlap with transcription, coding and exonic regions. Transcription and coding regions Ensembl and RefSeq are standard references for transcription, coding region and exon locations. Figure 1 displays how many of the 85M unique variants from the 1000 Genomes projects overlap with the genomic regions as defined by these two sources. An empirical evaluation of redundant annotations in common reference sources for tertiary analysis James Warren 1 , Jian Li 1 , Aparna Chhibber 1 , Emre Colak 1 , Narges Bani Asadi 1 , Sharon Barr 1 , Hugo Y. K. Lam 1 1 Bina Technologies, Roche Sequencing, Redwood City, CA 94065 MOTIVATION After obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with available relevant information [1]. There is no standardized compendium for this purpose; researchers instead are required to compile data from a motley of annotation tools and public datasets [2, 3]. These sources for annotation are independently maintained, and accordingly there is limited concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the analysis [4]. References: 1. Warren, et. al. A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation. ASHG (2014) 2. Johnston and Biesecker. Databases of genomic variation and phenotypes: existing resources and future needs. Human Molecular Genetics (2013) 3. Peterson, et. al. Towards precision medicine: advances in computational approaches for the analysis of human variants. Journal of METHODOLOGY To empirically evaluate the differences between annotation data sources, we examined the overlap of the variants from the 1000 Genomes project with commonly used and publicly available sources containing information regarding gene transcription and coding regions, predicted functional impacts and population allele frequencies. For each of these dimensions, we compared the number of variants that met the specified criteria by at least one annotation from the data source. CONCLUSIONS The choice of data sources for variant annotation have a substantive impact on tertiary analysis results. When identifying variants that satisfy given criteria, the differences between sources can result in significantly different findings. As a best practice, multiple sources should be included in the analysis. This provides the ability to tune the specificity and sensitivity of the results by choosing to use intersections or unions at each filtering step. Please direct questions to: [email protected] m DATA SOURCES All data sources are publicly available and based on the GRCH37 reference. They were downloaded from the following organizations: NCBI: 1000 Genomes, Phase 3, V.5b UCSC: dbSNP142, RefSeq 72, Ensembl 75 U. Washington: ESP 6500 SI, V2 U. Texas Health Science Center: dbNSFP 2.9 Damaging SNV predictions dbNSFP compiles predicted effects for non-synonymous SNVs in the human genome. Of the 667K SNVs from 1000 Genomes that coincide with dbNSFP, 76.3% are predicted damaging by at least one prediction algorithm, whereas only 7.8% are predicted damaging by all algorithms. Funnel analysis We defined two separate annotation pipelines to identify NA12878 variants (NIST v.2.17) that are rare, predicted damaging, and within the exons of protein-coding genes. Both pipelines identified approximately 60-80 results but with significant differences in the actual variant sets. Fig. 4) Comparative analysis two annotation pipelines. Population allele frequencies Allele frequency can be used to identify or remove both rare and common variants. ESP, dbSNP and 1000 Genomes all are common sources for this purpose. Of the 506K SNVs shared by the three sources, we identified the overlap for rare, uncommon and common variants. ESP is considerably different, but the minor disagreement between dbSNP and 1000 Genomes is also noteworthy since the 1000 Genomes records are contributed to dbSNP. Fig. 2) Overlap of population allele frequency sources.

Upload: james-warren

Post on 08-Apr-2017

205 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: ASHG 2015 - Redundant Annotations in Tertiary Analysis

www.bina.com

Fig. 3) Percentage of SNPs predicted as damaging by 7 different algorithms.

Fig. 1) 1000 Genomes overlap with transcription, coding and exonic regions.

Transcription and coding regionsEnsembl and RefSeq are standard references for transcription, coding region and exon locations. Figure 1 displays how many of the 85M unique variants from the 1000 Genomes projects overlap with the genomic regions as defined by these two sources.

An empirical evaluation of redundant annotations in common reference sources for tertiary analysis

James Warren1, Jian Li1, Aparna Chhibber1, Emre Colak1, Narges Bani Asadi1, Sharon Barr1, Hugo Y. K. Lam1

1 Bina Technologies, Roche Sequencing, Redwood City, CA 94065

MOTIVATIONAfter obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with available relevant information [1]. There is no standardized compendium for this purpose; researchers instead are required to compile data from a motley of annotation tools and public datasets [2, 3]. These sources for annotation are independently maintained, and accordingly there is limited concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the analysis [4].

References:1. Warren, et. al. A highly efficient and scalable compute platform for massive variant annotation and rapid genome interpretation. ASHG (2014)2. Johnston and Biesecker. Databases of genomic variation and phenotypes: existing resources and future needs. Human Molecular

Genetics (2013)3. Peterson, et. al. Towards precision medicine: advances in computational approaches for the analysis of human variants. Journal of

Molecular Biology (2013)4. Taylor, et. al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nature Genetics (2015)

METHODOLOGYTo empirically evaluate the differences between annotation data sources, we examined the overlap of the variants from the 1000 Genomes project with commonly used and publicly available sources containing information regarding gene transcription and coding regions, predicted functional impacts and population allele frequencies. For each of these dimensions, we compared the number of variants that met the specified criteria by at least one annotation from the data source.

CONCLUSIONSThe choice of data sources for variant annotation have a substantive impact on tertiary analysis results. When identifying variants that satisfy given criteria, the differences between sources can result in significantly different findings. As a best practice, multiple sources should be included in the analysis. This provides the ability to tune the specificity and sensitivity of the results by choosing to use intersections or unions at each filtering step.

Please direct questions to:

[email protected]

DATA SOURCESAll data sources are publicly available and based on the GRCH37 reference. They were downloaded from the following organizations:• NCBI: 1000 Genomes, Phase 3, V.5b• UCSC: dbSNP142, RefSeq 72, Ensembl 75• U. Washington: ESP 6500 SI, V2• U. Texas Health Science Center: dbNSFP 2.9

Damaging SNV predictionsdbNSFP compiles predicted effects for non-synonymous SNVs in the human genome. Of the 667K SNVs from 1000 Genomes that coincide with dbNSFP, 76.3% are predicted damaging by at least one prediction algorithm, whereas only 7.8% are predicted damaging by all algorithms.

Funnel analysisWe defined two separate annotation pipelines to identify NA12878 variants (NIST v.2.17) that are rare, predicted damaging, and within the exons of protein-coding genes. Both pipelines identified approximately 60-80 results but with significant differences in the actual variant sets.

Fig. 4) Comparative analysis two annotation pipelines.

Population allele frequenciesAllele frequency can be used to identify or remove both rare and common variants. ESP, dbSNP and 1000 Genomes all are common sources for this purpose. Of the 506K SNVs shared by the three sources, we identified the overlap for rare, uncommon and common variants. ESP is considerably different, but the minor disagreement between dbSNP and 1000 Genomes is also noteworthy since the 1000 Genomes records are contributed to dbSNP.

Fig. 2) Overlap of population allele frequency sources.