nanotyper: an hla genotyping algorithm for nanopore sequencing … · 2019. 11. 19. · nanotyper:...
TRANSCRIPT
nanotyper: an HLA genotyping algorithm for nanopore sequencing data
Steffen Klasberg¹, Kathrin Putke¹, Vineeth Surendranath¹, Alexander H Schmidt1,2, Vinzenz Lange¹, Gerhard Schöfl1 1DKMS Life Science Lab, St. Petersburger Str. 2, 01069 Dresden, Germany;
2DKMS, Kressbach 1, 72072 Tübingen, Germany
Introduction Nanopore sequencing may put sequence-based HLA genotyping in reach of even the smallest immunogenetic laboratories. Without the requirement for major
capital investments, it delivers very long reads, easily covering both HLA class I and class II genes. In addition, results may be obtained very rapidly and at
reasonable cost. However, these advantages come at the price of a lower per-read accuracy of only ~ 85 to 90 % compared to the high accuracy of NGS reads.
Here, we present an algorithmic approach to obtain accurate genotyping results from noisy third-generation sequencing data. Nanotyper consists of a two step
pipeline: First, the creation of allele-specific mappings using the longread path of DR2S, second, the hierarchical probabilistic typing using Typlomat.
Typlomat DR2S-LR
Results
Our still maturing nanotyper pipeline, consisting of the in-house developed freely available open-source R packages DR2S and Typlomat, is able to infer
the genotype of HLA class I and class II genes by only leveraging data from ONT´s nanopore sequencers. Nanotyper delivers ultra-high 4-field resolution
while keeping ambiguities at a low. DR2S and Typlomat can be used in an interactive R session or via Linux command-line on a standard workstation.
Both modules benefit from parallel computing frameworks. The runtime of DR2S for a single sample is about 10 min on 8 cores, depending on the target
size and read coverage. Typlomat can be run on a batch of samples simultaneously and takes about 15 min + 1 min per sample per core. The accuracy
and performance of resulting genotypes are discussed on poster P184.
Figure 1: First step of the nanotyper pipeline. The nanopore reads
of heterozygous samples are filtered and, after initial mapping,
assigned to individual alleles based upon true polymorphic
positions. The best possible results are obtained by iteratively
mapping reads to consensus sequences from a previous mapping
step. The resulting allele-specific mappings serve as the input for
the next step of nanotyper.
Mind the gap! Long stretches of homopolymers are a special challenge for third generation
sequencing. Current platforms are not able to capture the correct number of nucleotides in homopolymers accurately. Therefore, the genotyping algorithm cannot distinguish alleles based on homopolymer length alone.
Figure 2: The hierarchical mapping approach of Typlomat. (A) In a first round,
only exons of the antigen-recognition domain are scored. The best scoring 10 %
are propagated into the next round, to score (B) the remaining exons, followed by
(C) non-coding features, to (D) finally arrive at the genotyping result.
Figure 3: Probabilistic scoring of Typlomat. The
allele-specific longread mapping is transformed into a
position-specific weight matrix (PWM). The consensus
sequence of the mapping is added to a feature-wise
pre-computed multiple sequence alignment (MSA) of
target alleles. Terminal borders in the MSA denote
feature boundaries of the allele. Insertions and
deletions of the reference to the target allele
database are inferred from the MSA and injected into
the PWM. The score for a feature is built up by
summing the nucleotide weights. The score of all
features per round is summed to rank the target
alleles. Missing information, i.e. incomplete target
alleles, is substituted by the best occurring score.
DKMS Life Science Lab
St. Petersburger Str. 2 • 01069 Dresden • www.dkms-lab.de
Download from GitHub: https://github.com/DKMS-LSL