typing in the third generation: class i and class ii hla ... · typing in the third generation:...

Typing in the third generation: Class I and Class II HLA genotyping performance using nanopore long-read data

Gerhard Schöfl1, Steffen Klasberg1, Vineeth Surendranath1, Kathrin Putke1, Alexander H Schmidt1.2, Vinzenz Lange1

1DKMS Life Science Lab, St. Petersburger Str. 2, 01069 Dresden, Germany; 2DKMS, Kressbach 1, 72072 Tübingen, Germany

Introduction Full four-field allele-level resolution genotyping of the human leucocyte antigen (HLA) is a technical challenge

due to the high polymorphism rate of the genomic region, the difficulties to phase the observed

polymorphisms, and the as yet fragmented nature of the reference database. Third-generation long-read

sequencing platforms such as Oxford Nanopore Technologies’ MinION or PromethION have the potential to

change the landscape of HLA genotyping. Their main advantages are rapid library preparation and,

importantly, long single-molecule-derived reads to alleviate the phasing problem. But the high single-read

error rates leading to inaccurate consensus sequences and unreliable genotyping using standard mapping

approaches remain a major limitation. Here, we report on the performance of a novel genotyping approach

geared towards noisy long-read data by filtering phase-informative polymorphic positions and using a

hierarchical probabilistic genotyping model.

Generating Benchmark Data

Sample Selection

Conclusion

Ninety-six samples were sequenced for HLA-A, -B, -C, -DRB1, -DQB1, and -DPB1 on the Illumina MiSeq and Oxford Nanopore MinION platforms. Illumina short-read data were used to

generate benchmark genotypes using GenDx’ NGSengine®. The overall typing accuracy currently reaches 100% and 92% for the Class I genes HLA-A and HLA-B, respectively. 22% and

33% of the respective typing results were unambiguous, with the remaining typing results comprising mostly 2 or more closely related fourth-field variants. Many of these are technical

ambiguities due to incomplete 3‘-UTR sequence information. The remainder are due to an inability to distinguish between alleles differing in the lengths of homopolymer tracts

located in introns and UTRs. The genotyping pipeline still requires further adaptations for HLA-C and the HLA class II loci, but these results already demonstrate the viability of

nanopore sequencing as an efficient platform for full-length sequencing for immunogenetic genotyping.

Typing Algorithm

Locus-specific long-range PCR

Barcoding PCR

Locus-specific long-range PCR

PCR

Pooling

End Repair. dA-Tailing

Adapter Ligation

Fragmentation

Adapter Ligation

Barcoding & Pooling Library

Preparation

ONT MinION Illumina MiSeq; 2x250 bp

Sequencing

Illumina Shotgun Sequencing Nanopore Sequencing

96 Samples for HLA-A, -B, -C, DRB1, DQB1, -DPB1

NGSengine® (GenDx) DR2S-LR/Typlomat Analysis

96 samples were selected from the

DKMS donor pool to represent a highly

diverse set of genotypes across Class I

and Class II HLA loci.

Left: Unbiased HLA allele distribution in DKMS donor pool (subset); Top: Ninety-six

samples enriched for low-frequency alleles to maximise allelic diversity in a ground-

truth data set.

Left: HLA genotyping is performed on

full-length amplicon sequences

generated on the MinION:

a. Map reads to generic reference of

the target HLA gene.

b. Identify polymorphic positions.

c. Separate “true” polymorphisms

from sequencing artefacts.

d. Cluster reads based on “true”

polymorphisms.

e. Iteratively create the best

possible multiple sequence

alignment (MSA) for each allele.

f. Hierarchical genotyping by scoring

each allele in the IPD-IMGT/HLA

against a position weight matrix

created from the allele-specific

MSA (for details see P192)

* Steps a to e are implemented in the long-read path of the DR2S R package available on GitHub

allele mean score

5' UTR Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 Intron 3 Exon 4 Intron 4 Exon 5 Intron 5 Exon 6 Intron 6 Exon 7 3' UTR

HLA-B*07:05:01:01 2.151 2.133 2.185 2.117 2.069 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161

HLA-B*07:05:01:03 2.151 2.133 2.185 2.117 2.069 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161

HLA-B*07:264 2.150 2.133 2.185 2.117 2.061 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161

Ambiguous result set

Top: 96 samples were independently sequenced using shortread and longread

technologies. Shortread data analysed with NGSengine to full 4-field resolution

provided ground truth genotyping results to compare long-read data against.

Right: Typing concord-

ance for HLA-A and HLA-

B. Ambiguous results

(result sets > 1) arise if

(a) alleles differ in

regions difficult to

resolve with nanopore

reads (homopolymers,

CnGn-stretches); (b)

alleles differ in regions

not covered by the

amplicons (some

stretches of 3’-UTRs)

Top: Example for feature scoring and ambiguous result set. HLA-B*07:05:01:01 and HLA-

B*07:05:01:03 differ in a part of the 3’-UTR not covered by our amplicon. The next closest allele, HLA-

B*07:264, scores less in Exon 2.

Typing Accuracy

DKMS Life Science Lab

St. Petersburger Str. 2 • 01069 Dresden • www.dkms-lab.de

Download from GitHub: https://github.com/DKMS-LSL

typing in the third generation: class i and class ii hla ... · typing in the third generation:...

Documents