typing in the third generation: class i and class ii hla ... · typing in the third generation:...
TRANSCRIPT
Typing in the third generation: Class I and Class II HLA genotyping performance using nanopore long-read data
Gerhard Schöfl1, Steffen Klasberg1, Vineeth Surendranath1, Kathrin Putke1, Alexander H Schmidt1.2, Vinzenz Lange1
1DKMS Life Science Lab, St. Petersburger Str. 2, 01069 Dresden, Germany; 2DKMS, Kressbach 1, 72072 Tübingen, Germany
Introduction Full four-field allele-level resolution genotyping of the human leucocyte antigen (HLA) is a technical challenge
due to the high polymorphism rate of the genomic region, the difficulties to phase the observed
polymorphisms, and the as yet fragmented nature of the reference database. Third-generation long-read
sequencing platforms such as Oxford Nanopore Technologies’ MinION or PromethION have the potential to
change the landscape of HLA genotyping. Their main advantages are rapid library preparation and,
importantly, long single-molecule-derived reads to alleviate the phasing problem. But the high single-read
error rates leading to inaccurate consensus sequences and unreliable genotyping using standard mapping
approaches remain a major limitation. Here, we report on the performance of a novel genotyping approach
geared towards noisy long-read data by filtering phase-informative polymorphic positions and using a
hierarchical probabilistic genotyping model.
Generating Benchmark Data
Sample Selection
Conclusion
Ninety-six samples were sequenced for HLA-A, -B, -C, -DRB1, -DQB1, and -DPB1 on the Illumina MiSeq and Oxford Nanopore MinION platforms. Illumina short-read data were used to
generate benchmark genotypes using GenDx’ NGSengine®. The overall typing accuracy currently reaches 100% and 92% for the Class I genes HLA-A and HLA-B, respectively. 22% and
33% of the respective typing results were unambiguous, with the remaining typing results comprising mostly 2 or more closely related fourth-field variants. Many of these are technical
ambiguities due to incomplete 3‘-UTR sequence information. The remainder are due to an inability to distinguish between alleles differing in the lengths of homopolymer tracts
located in introns and UTRs. The genotyping pipeline still requires further adaptations for HLA-C and the HLA class II loci, but these results already demonstrate the viability of
nanopore sequencing as an efficient platform for full-length sequencing for immunogenetic genotyping.
Typing Algorithm
Locus-specific long-range PCR
Barcoding PCR
Locus-specific long-range PCR
PCR
Pooling
End Repair. dA-Tailing
Adapter Ligation
Fragmentation
Adapter Ligation
Barcoding & Pooling Library
Preparation
ONT MinION Illumina MiSeq; 2x250 bp
Sequencing
Illumina Shotgun Sequencing Nanopore Sequencing
96 Samples for HLA-A, -B, -C, DRB1, DQB1, -DPB1
NGSengine® (GenDx) DR2S-LR/Typlomat Analysis
96 samples were selected from the
DKMS donor pool to represent a highly
diverse set of genotypes across Class I
and Class II HLA loci.
Left: Unbiased HLA allele distribution in DKMS donor pool (subset); Top: Ninety-six
samples enriched for low-frequency alleles to maximise allelic diversity in a ground-
truth data set.
Left: HLA genotyping is performed on
full-length amplicon sequences
generated on the MinION:
a. Map reads to generic reference of
the target HLA gene.
b. Identify polymorphic positions.
c. Separate “true” polymorphisms
from sequencing artefacts.
d. Cluster reads based on “true”
polymorphisms.
e. Iteratively create the best
possible multiple sequence
alignment (MSA) for each allele.
f. Hierarchical genotyping by scoring
each allele in the IPD-IMGT/HLA
against a position weight matrix
created from the allele-specific
MSA (for details see P192)
* Steps a to e are implemented in the long-read path of the DR2S R package available on GitHub
allele mean score
5' UTR Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 Intron 3 Exon 4 Intron 4 Exon 5 Intron 5 Exon 6 Intron 6 Exon 7 3' UTR
HLA-B*07:05:01:01 2.151 2.133 2.185 2.117 2.069 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161
HLA-B*07:05:01:03 2.151 2.133 2.185 2.117 2.069 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161
HLA-B*07:264 2.150 2.133 2.185 2.117 2.061 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161
Ambiguous result set
Top: 96 samples were independently sequenced using shortread and longread
technologies. Shortread data analysed with NGSengine to full 4-field resolution
provided ground truth genotyping results to compare long-read data against.
Right: Typing concord-
ance for HLA-A and HLA-
B. Ambiguous results
(result sets > 1) arise if
(a) alleles differ in
regions difficult to
resolve with nanopore
reads (homopolymers,
CnGn-stretches); (b)
alleles differ in regions
not covered by the
amplicons (some
stretches of 3’-UTRs)
Top: Example for feature scoring and ambiguous result set. HLA-B*07:05:01:01 and HLA-
B*07:05:01:03 differ in a part of the 3’-UTR not covered by our amplicon. The next closest allele, HLA-
B*07:264, scores less in Exon 2.
Typing Accuracy
DKMS Life Science Lab
St. Petersburger Str. 2 • 01069 Dresden • www.dkms-lab.de
Download from GitHub: https://github.com/DKMS-LSL