characterising the human immunoglobulin heavy chain locus by ultra-deep sequencing of rearranged...

Click to edit Master title style

1

Characterising the human immunoglobulin heavy chain locus by ultra-deep sequencing

of rearranged immunoglobulin genes

Bruno GaëtaSchool of Computer Science and Engineering, UNSW


2

Immunoglobulin Rearrangement

Katherine Jackson


3

So what do we know about the process?

• Combinatorial diversity (VDJ)

• P-addition

• N-addition

• Exonuclease action

• Somatic hypermutation


4

Human heavy chain immunoglobulin variable region genes (IGH)

• IGHV

About 46 functional genes (7 families)

Up to 20 (?) reported alleles per gene

• IGHD

About 24 genes (7 families), 1-3 known alleles/gene

• IGHJ

6 functional genes, 1-4 known alleles/gene

Still very controversial!


5

The Immunoglobulin Factsbook (Lefrancand Lefranc 2001)


6


7

Characterizing variation in the IGH locus

• The Human Genome Project, HapMap and the 1000 Genomes Project have all ignored the IGH locus

• Conventional methods are difficult to apply to this locus

• Our approach focuses on mass sequencing of rearranged sequences


8

Blood sample Rearranged IGH gene sequences (VDJ)

Data generation

Data from Stanford University (Lyndon Zhang, Katherine Jackson, Scott D. Boyd, Andrew Z. Fire)

Multiplex PCR Sequencing (454)


9

Bioinformatics analysis

Rearranged IGH gene

sequences (VDJ)

HaplotypeGenotypeIdentify

germlinegenes

Draft genotype

iHMMune-align model


10

iHMMune-align

• Hidden Markov model of immunoglobulin rearrangement and diversification processes

• Designed to identify the most likely germlinegene segments in a rearranged Ig gene sequence and partition the sequence

• Can also be used to calculate the probability of a sequence originating from a specific germline gene


11

iHMMune-align HMM topology

Gaëta et al (2007) Bioinformatics 23:1580


12

Genotyping

• Find the combination of alleles most likely to generate the observed data


13

IGHV Genotyping

• Pre-align sequences (Vmatch) with the IGHV repertoire to filter out unlikely alleles (draft genotype)

• Calculate P(si|gn) using the iHMMune-align model

• Calculate likelihood of sequence set for each combination of alleles in the draft genotype

• Select most likely genotype


14

IGHD Genotyping

• IGHD genes very short and difficult to identify unambiguously: use a combination of iHMMune-align (with only IGHV alleles present in the genotype) and specific pattern searches

• Calculate P(si|gn) using a simplified iHMMune-align model


15

IGHJ Genotyping

• Similar to IGHV genotyping but with a simplified iHMMune-align model


16


17

Genotyping - evaluation


18

Once the genotype is determined…

• Re-identify germline genes in the sequence set, using only germline genes present in the genotype (iHMMune-align)


19

Determination of phased haplotypes

IGHV1-2*01 IGHJ6*02

Only possible for subjects heterozygous at the IGHJ4 or IGHJ6 loci

IGHV1-2*01 IGHJ6*02

IGHV1-2*04 IGHJ6*03

IGHV1-2*04 IGHJ6*03

IGHV1-2*04 IGHJ6*03


20

Automated classification5. Mult inomial Logist ic Regression for the Ident ificat ion of Immunoglobulin

Haplotypes

Figure 5.4: Classificat ion er ror rat es of di↵erent algor i t hms for I GH D

haplotyping. The error rates of using ‘Counts of Sequences’(CoS) ,‘Bino-

mial Probabilit ies’ (BP) and ‘Counts of Sequences Plus Binomial Probabilit ies’

(CoSBP) as attributes in the classificat ions were also compared.

respect ively. The classificat ion correctness given by di↵erent algorithms using

di↵erent at tributes is shown in Table 5.5.

Figure 5.4 compares the performance of Logist ic Regression, Linear Regres-

sion,SVM and Decision Tree. The logist ic regression using ‘Binomial Probabili-

t ies’(BP) as classificat ion attributes gave the best classificat ion.

Table 5.6 shows the di↵erence of haplotypes ident ified by manual and au-

tomatic haplotyping. Excellent agreement was observed between manually and

156

5. Mult inomial Logist ic Regression for the Ident ificat ion of Immunoglobulin

Haplotypes

Figure 5.3: Classificat ion er ror rat es of di↵erent algor i t hms for IGH V

haplotyping. Logist ic regression, linear regression and J48 decision tree’s per-

formances were compared. The error rates of using ‘Counts of Sequences’(CoS),

‘Binomial Probabilit ies’ (BP) and ‘Counts of Sequences Plus Binomial Probabil-

it ies’ (CoSBP) as att ributes in the classificat ions were also compared.

154


21


22

IGHD Haplotypes


23

IGHV Haplotypes

Ambiguity

Duplication

D Deletion


24

The team…

• BABS, UNSW

– Marie Kidd

– Yan Wang

– Mark Tanaka

– Andrew Collins

• CSE, UNSW

– Zhiliang Chen

– Bruno Gaëta

• Pathology, Stanford

– Lyndon Zhang

– Katherine Jackson

– Scott Boyd

– Andrew Fire

characterising the human immunoglobulin heavy chain locus by ultra-deep sequencing of rearranged...

Technology

master title style

d genes

v gene

j gene segments

heavy chain gene locus

heavy chains

handful of j genes

likely germline gene