typing in the third generation: class i and class ii hla ... · typing in the third generation:...

1
Typing in the third generation: Class I and Class II HLA genotyping performance using nanopore long-read data Gerhard Schöfl 1 , Steffen Klasberg 1 , Vineeth Surendranath 1 , Kathrin Putke 1 , Alexander H Schmidt 1.2 , Vinzenz Lange 1 1 DKMS Life Science Lab, St. Petersburger Str. 2, 01069 Dresden, Germany; 2 DKMS, Kressbach 1, 72072 Tübingen, Germany Introduction Full four-field allele-level resolution genotyping of the human leucocyte antigen (HLA) is a technical challenge due to the high polymorphism rate of the genomic region, the difficulties to phase the observed polymorphisms, and the as yet fragmented nature of the reference database. Third-generation long-read sequencing platforms such as Oxford Nanopore Technologies’ MinION or PromethION have the potential to change the landscape of HLA genotyping. Their main advantages are rapid library preparation and, importantly, long single-molecule-derived reads to alleviate the phasing problem. But the high single-read error rates leading to inaccurate consensus sequences and unreliable genotyping using standard mapping approaches remain a major limitation. Here, we report on the performance of a novel genotyping approach geared towards noisy long-read data by filtering phase-informative polymorphic positions and using a hierarchical probabilistic genotyping model. Generating Benchmark Data Sample Selection Conclusion Ninety-six samples were sequenced for HLA-A, -B, -C, -DRB1, -DQB1, and -DPB1 on the Illumina MiSeq and Oxford Nanopore MinION platforms. Illumina short-read data were used to generate benchmark genotypes using GenDx’ NGSengine®. The overall typing accuracy currently reaches 100% and 92% for the Class I genes HLA-A and HLA-B, respectively. 22% and 33% of the respective typing results were unambiguous, with the remaining typing results comprising mostly 2 or more closely related fourth-field variants. Many of these are technical ambiguities due to incomplete 3‘-UTR sequence information. The remainder are due to an inability to distinguish between alleles differing in the lengths of homopolymer tracts located in introns and UTRs. The genotyping pipeline still requires further adaptations for HLA-C and the HLA class II loci, but these results already demonstrate the viability of nanopore sequencing as an efficient platform for full-length sequencing for immunogenetic genotyping. Typing Algorithm Locus-specific long-range PCR Barcoding PCR Locus-specific long-range PCR PCR Pooling End Repair. dA-Tailing Adapter Ligation Fragmentation Adapter Ligation Barcoding & Pooling Library Preparation ONT MinION Illumina MiSeq; 2x250 bp Sequencing Illumina Shotgun Sequencing Nanopore Sequencing 96 Samples for HLA-A, -B, -C, DRB1, DQB1, -DPB1 NGSengine® (GenDx) DR2S-LR/Typlomat Analysis 96 samples were selected from the DKMS donor pool to represent a highly diverse set of genotypes across Class I and Class II HLA loci. Left: Unbiased HLA allele distribution in DKMS donor pool (subset); Top: Ninety-six samples enriched for low-frequency alleles to maximise allelic diversity in a ground- truth data set. Left: HLA genotyping is performed on full-length amplicon sequences generated on the MinION: a. Map reads to generic reference of the target HLA gene. b. Identify polymorphic positions. c. Separate “true” polymorphisms from sequencing artefacts. d. Cluster reads based on “true” polymorphisms. e. Iteratively create the best possible multiple sequence alignment (MSA) for each allele. f. Hierarchical genotyping by scoring each allele in the IPD-IMGT/HLA against a position weight matrix created from the allele-specific MSA (for details see P192) * Steps a to e are implemented in the long-read path of the DR2S R package available on GitHub allele mean score 5' UTR Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 Intron 3 Exon 4 Intron 4 Exon 5 Intron 5 Exon 6 Intron 6 Exon 7 3' UTR HLA-B*07:05:01:01 2.151 2.133 2.185 2.117 2.069 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161 HLA-B*07:05:01:03 2.151 2.133 2.185 2.117 2.069 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161 HLA-B*07:264 2.150 2.133 2.185 2.117 2.061 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161 Ambiguous result set Top: 96 samples were independently sequenced using shortread and longread technologies. Shortread data analysed with NGSengine to full 4-field resolution provided ground truth genotyping results to compare long-read data against. Right: Typing concord- ance for HLA-A and HLA- B. Ambiguous results (result sets > 1) arise if (a) alleles differ in regions difficult to resolve with nanopore reads (homopolymers, C n G n -stretches); (b) alleles differ in regions not covered by the amplicons (some stretches of 3’-UTRs) Top: Example for feature scoring and ambiguous result set. HLA-B*07:05:01:01 and HLA- B*07:05:01:03 differ in a part of the 3’-UTR not covered by our amplicon. The next closest allele, HLA- B*07:264, scores less in Exon 2. Typing Accuracy DKMS Life Science Lab St. Petersburger Str. 2 • 01069 Dresden • www.dkms-lab.de Download from GitHub: https://github.com/DKMS-LSL

Upload: others

Post on 18-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Typing in the third generation: Class I and Class II HLA ... · Typing in the third generation: Class I and Class II HLA genotyping performance using nanopore long-read data Gerhard

Typing in the third generation: Class I and Class II HLA genotyping performance using nanopore long-read data

Gerhard Schöfl1, Steffen Klasberg1, Vineeth Surendranath1, Kathrin Putke1, Alexander H Schmidt1.2, Vinzenz Lange1

1DKMS Life Science Lab, St. Petersburger Str. 2, 01069 Dresden, Germany; 2DKMS, Kressbach 1, 72072 Tübingen, Germany

Introduction Full four-field allele-level resolution genotyping of the human leucocyte antigen (HLA) is a technical challenge

due to the high polymorphism rate of the genomic region, the difficulties to phase the observed

polymorphisms, and the as yet fragmented nature of the reference database. Third-generation long-read

sequencing platforms such as Oxford Nanopore Technologies’ MinION or PromethION have the potential to

change the landscape of HLA genotyping. Their main advantages are rapid library preparation and,

importantly, long single-molecule-derived reads to alleviate the phasing problem. But the high single-read

error rates leading to inaccurate consensus sequences and unreliable genotyping using standard mapping

approaches remain a major limitation. Here, we report on the performance of a novel genotyping approach

geared towards noisy long-read data by filtering phase-informative polymorphic positions and using a

hierarchical probabilistic genotyping model.

Generating Benchmark Data

Sample Selection

Conclusion

Ninety-six samples were sequenced for HLA-A, -B, -C, -DRB1, -DQB1, and -DPB1 on the Illumina MiSeq and Oxford Nanopore MinION platforms. Illumina short-read data were used to

generate benchmark genotypes using GenDx’ NGSengine®. The overall typing accuracy currently reaches 100% and 92% for the Class I genes HLA-A and HLA-B, respectively. 22% and

33% of the respective typing results were unambiguous, with the remaining typing results comprising mostly 2 or more closely related fourth-field variants. Many of these are technical

ambiguities due to incomplete 3‘-UTR sequence information. The remainder are due to an inability to distinguish between alleles differing in the lengths of homopolymer tracts

located in introns and UTRs. The genotyping pipeline still requires further adaptations for HLA-C and the HLA class II loci, but these results already demonstrate the viability of

nanopore sequencing as an efficient platform for full-length sequencing for immunogenetic genotyping.

Typing Algorithm

Locus-specific long-range PCR

Barcoding PCR

Locus-specific long-range PCR

PCR

Pooling

End Repair. dA-Tailing

Adapter Ligation

Fragmentation

Adapter Ligation

Barcoding & Pooling Library

Preparation

ONT MinION Illumina MiSeq; 2x250 bp

Sequencing

Illumina Shotgun Sequencing Nanopore Sequencing

96 Samples for HLA-A, -B, -C, DRB1, DQB1, -DPB1

NGSengine® (GenDx) DR2S-LR/Typlomat Analysis

96 samples were selected from the

DKMS donor pool to represent a highly

diverse set of genotypes across Class I

and Class II HLA loci.

Left: Unbiased HLA allele distribution in DKMS donor pool (subset); Top: Ninety-six

samples enriched for low-frequency alleles to maximise allelic diversity in a ground-

truth data set.

Left: HLA genotyping is performed on

full-length amplicon sequences

generated on the MinION:

a. Map reads to generic reference of

the target HLA gene.

b. Identify polymorphic positions.

c. Separate “true” polymorphisms

from sequencing artefacts.

d. Cluster reads based on “true”

polymorphisms.

e. Iteratively create the best

possible multiple sequence

alignment (MSA) for each allele.

f. Hierarchical genotyping by scoring

each allele in the IPD-IMGT/HLA

against a position weight matrix

created from the allele-specific

MSA (for details see P192)

* Steps a to e are implemented in the long-read path of the DR2S R package available on GitHub

allele mean score

5' UTR Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 Intron 3 Exon 4 Intron 4 Exon 5 Intron 5 Exon 6 Intron 6 Exon 7 3' UTR

HLA-B*07:05:01:01 2.151 2.133 2.185 2.117 2.069 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161

HLA-B*07:05:01:03 2.151 2.133 2.185 2.117 2.069 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161

HLA-B*07:264 2.150 2.133 2.185 2.117 2.061 2.149 2.193 2.144 2.170 2.146 2.192 2.166 2.166 2.135 2.132 2.161

Ambiguous result set

Top: 96 samples were independently sequenced using shortread and longread

technologies. Shortread data analysed with NGSengine to full 4-field resolution

provided ground truth genotyping results to compare long-read data against.

Right: Typing concord-

ance for HLA-A and HLA-

B. Ambiguous results

(result sets > 1) arise if

(a) alleles differ in

regions difficult to

resolve with nanopore

reads (homopolymers,

CnGn-stretches); (b)

alleles differ in regions

not covered by the

amplicons (some

stretches of 3’-UTRs)

Top: Example for feature scoring and ambiguous result set. HLA-B*07:05:01:01 and HLA-

B*07:05:01:03 differ in a part of the 3’-UTR not covered by our amplicon. The next closest allele, HLA-

B*07:264, scores less in Exon 2.

Typing Accuracy

DKMS Life Science Lab

St. Petersburger Str. 2 • 01069 Dresden • www.dkms-lab.de

Download from GitHub: https://github.com/DKMS-LSL