characterising the human immunoglobulin heavy chain locus by ultra-deep sequencing of rearranged...
DESCRIPTION
The study of inherited variation in the immunoglobulin heavy chain (IGH) locus has lagged behind that of other loci. This locus undergoes recombination during B-‐ lymphocyte differentiation, as well as somatic hypermutation after antigen challenge, and the resulting variation is difficult to distinguish from inherited polymorphisms. In addition, most large-‐scale human genomics projects (including the Human Genome Project and the 1000 Genomes Project) have ignored the IGH locus as they are based on sequencing DNA from lymphoblastoid cells in which the IGH locus has been recombined. As an alternative, our group has pioneered the use of ultra-‐deep sequencing of rearranged immunoglobulin genes to understand inherited variation in the germline locus. By sampling and comparing tens of thousands of rearranged sequences from an individual it is possible to identify the patterns of variation that are consistent with inherited polymorphisms instead of resulting from somatic mutation. It is also possible to genotype, and in some cases haplotype, the IGH loci for this individual. This approach has required the development of a whole new range of bioinformatics algorithms tailored to immunoglobulin genes, and has resulted in the discovery of several new polymorphisms as well as providing the basis for in-‐depth population analysis of the IGH locus. In this presentation I will outline the difficulties in applying standard genomic techniques to immunoglobulin genes and describe the bioinformatics methods we developed to study this unusual locus.TRANSCRIPT
Click to edit Master title style
1
Characterising the human immunoglobulin heavy chain locus by ultra-deep sequencing
of rearranged immunoglobulin genes
Bruno GaëtaSchool of Computer Science and Engineering, UNSW
Click to edit Master title style
2
Immunoglobulin Rearrangement
Katherine Jackson
Click to edit Master title style
3
So what do we know about the process?
• Combinatorial diversity (VDJ)
• P-addition
• N-addition
• Exonuclease action
• Somatic hypermutation
Click to edit Master title style
4
Human heavy chain immunoglobulin variable region genes (IGH)
• IGHV
About 46 functional genes (7 families)
Up to 20 (?) reported alleles per gene
• IGHD
About 24 genes (7 families), 1-3 known alleles/gene
• IGHJ
6 functional genes, 1-4 known alleles/gene
Still very controversial!
Click to edit Master title style
5
The Immunoglobulin Factsbook (Lefrancand Lefranc 2001)
Click to edit Master title style
6
Click to edit Master title style
7
Characterizing variation in the IGH locus
• The Human Genome Project, HapMap and the 1000 Genomes Project have all ignored the IGH locus
• Conventional methods are difficult to apply to this locus
• Our approach focuses on mass sequencing of rearranged sequences
Click to edit Master title style
8
Blood sample Rearranged IGH gene sequences (VDJ)
Data generation
Data from Stanford University (Lyndon Zhang, Katherine Jackson, Scott D. Boyd, Andrew Z. Fire)
Multiplex PCR Sequencing (454)
Click to edit Master title style
9
Bioinformatics analysis
Rearranged IGH gene
sequences (VDJ)
HaplotypeGenotypeIdentify
germlinegenes
Draft genotype
iHMMune-align model
Click to edit Master title style
10
iHMMune-align
• Hidden Markov model of immunoglobulin rearrangement and diversification processes
• Designed to identify the most likely germlinegene segments in a rearranged Ig gene sequence and partition the sequence
• Can also be used to calculate the probability of a sequence originating from a specific germline gene
Click to edit Master title style
11
iHMMune-align HMM topology
Gaëta et al (2007) Bioinformatics 23:1580
Click to edit Master title style
12
Genotyping
• Find the combination of alleles most likely to generate the observed data
Click to edit Master title style
13
IGHV Genotyping
• Pre-align sequences (Vmatch) with the IGHV repertoire to filter out unlikely alleles (draft genotype)
• Calculate P(si|gn) using the iHMMune-align model
• Calculate likelihood of sequence set for each combination of alleles in the draft genotype
• Select most likely genotype
Click to edit Master title style
14
IGHD Genotyping
• IGHD genes very short and difficult to identify unambiguously: use a combination of iHMMune-align (with only IGHV alleles present in the genotype) and specific pattern searches
• Calculate P(si|gn) using a simplified iHMMune-align model
Click to edit Master title style
15
IGHJ Genotyping
• Similar to IGHV genotyping but with a simplified iHMMune-align model
Click to edit Master title style
16
Click to edit Master title style
17
Genotyping - evaluation
Click to edit Master title style
18
Once the genotype is determined…
• Re-identify germline genes in the sequence set, using only germline genes present in the genotype (iHMMune-align)
Click to edit Master title style
19
Determination of phased haplotypes
IGHV1-2*01 IGHJ6*02
Only possible for subjects heterozygous at the IGHJ4 or IGHJ6 loci
IGHV1-2*01 IGHJ6*02
IGHV1-2*04 IGHJ6*03
IGHV1-2*04 IGHJ6*03
IGHV1-2*04 IGHJ6*03
Click to edit Master title style
20
Automated classification5. Mult inomial Logist ic Regression for the Ident ificat ion of Immunoglobulin
Haplotypes
Figure 5.4: Classificat ion er ror rat es of di↵erent algor i t hms for I GH D
haplotyping. The error rates of using ‘Counts of Sequences’(CoS) ,‘Bino-
mial Probabilit ies’ (BP) and ‘Counts of Sequences Plus Binomial Probabilit ies’
(CoSBP) as attributes in the classificat ions were also compared.
respect ively. The classificat ion correctness given by di↵erent algorithms using
di↵erent at tributes is shown in Table 5.5.
Figure 5.4 compares the performance of Logist ic Regression, Linear Regres-
sion,SVM and Decision Tree. The logist ic regression using ‘Binomial Probabili-
t ies’(BP) as classificat ion attributes gave the best classificat ion.
Table 5.6 shows the di↵erence of haplotypes ident ified by manual and au-
tomatic haplotyping. Excellent agreement was observed between manually and
156
5. Mult inomial Logist ic Regression for the Ident ificat ion of Immunoglobulin
Haplotypes
Figure 5.3: Classificat ion er ror rat es of di↵erent algor i t hms for IGH V
haplotyping. Logist ic regression, linear regression and J48 decision tree’s per-
formances were compared. The error rates of using ‘Counts of Sequences’(CoS),
‘Binomial Probabilit ies’ (BP) and ‘Counts of Sequences Plus Binomial Probabil-
it ies’ (CoSBP) as att ributes in the classificat ions were also compared.
154
Click to edit Master title style
21
Click to edit Master title style
22
IGHD Haplotypes
Click to edit Master title style
23
IGHV Haplotypes
Ambiguity
Duplication
D Deletion
Click to edit Master title style
24
The team…
• BABS, UNSW
– Marie Kidd
– Yan Wang
– Mark Tanaka
– Andrew Collins
• CSE, UNSW
– Zhiliang Chen
– Bruno Gaëta
• Pathology, Stanford
– Lyndon Zhang
– Katherine Jackson
– Scott Boyd
– Andrew Fire