the fundamental principles of antibody repertoire ...63 biological insight of repertoire...

24
1 The fundamental principles of antibody repertoire architecture revealed by 1 large-scale network analysis 2 Enkelejda Miho 1 †, Victor Greiff 1 †, Rok Roškar 2 , Sai T. Reddy 1 * 3 Affiliations: 4 1 Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland. 5 2 Research Informatics, Scientific IT Services, ETH Zürich, Zürich, Switzerland. 6 *Correspondence to: [email protected] 7 †These authors contributed equally to this study. 8 9 10 ABSTRACT 11 The antibody repertoire is a vast and diverse collection of B-cell receptors and antibodies that 12 confer protection against a plethora of pathogens. The architecture of the antibody repertoire, 13 defined by the network similarity landscape of its sequences, is unknown. Here, we established 14 a novel high-performance computing platform to construct large-scale networks from high- 15 throughput sequencing data (>100’000 unique antibodies), in order to uncover the architecture 16 of antibody repertoires. We identified three fundamental principles of antibody repertoire 17 architecture across B-cell development: reproducibility, robustness and redundancy. 18 Reproducibility of network structure explains clonal expansion and selection. Robustness 19 ensures a functional immune response even under extensive loss of clones (50%). Redundancy 20 in mutational pathways suggests that there is a pre-programmed evolvability in antibody 21 repertoires. Our analysis provides guidelines for a quantitative network analysis of antibody 22 repertoires, which can be applied to other facets of adaptive immunity (e.g., T cell receptors), 23 and may direct the construction of synthetic repertoires for biomedical applications. 24 25 Keywords: systems biology, antibody repertoires, networks, large-scale, systems immunology 26 27 INTRODUCTION 28 The high diversity of antibody repertoires enables broad and protective humoral immunity, thus 29 understanding their system sequence-related properties is essential to the development of new 30 therapeutics and vaccines 1,2 . The source of antibody diversity has long been identified to be the 31 V-, (D- in the heavy chains) and J-gene somatic recombination 3 . Further additions and 32 deletions of nucleotides at the junctions of the gene segments generate a large collection of 33 antibodies and B-cell receptors, which is called the antibody repertoire 4,5 . Antibody identity 34 (clonality) and antigen specificity are primarily encoded in the highly diverse junctional site of 35 recombination in the variable heavy chain, known as the complementarity determining region 3 36 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not this version posted April 5, 2017. ; https://doi.org/10.1101/124578 doi: bioRxiv preprint

Upload: others

Post on 24-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

1

The fundamental principles of antibody repertoire architecture revealed by 1

large-scale network analysis 2

Enkelejda Miho1†, Victor Greiff1†, Rok Roškar2, Sai T. Reddy1* 3

Affiliations: 4 1Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland. 5 2Research Informatics, Scientific IT Services, ETH Zürich, Zürich, Switzerland. 6 *Correspondence to: [email protected] 7 †These authors contributed equally to this study. 8 9

10

ABSTRACT 11

The antibody repertoire is a vast and diverse collection of B-cell receptors and antibodies that 12 confer protection against a plethora of pathogens. The architecture of the antibody repertoire, 13 defined by the network similarity landscape of its sequences, is unknown. Here, we established 14 a novel high-performance computing platform to construct large-scale networks from high-15 throughput sequencing data (>100’000 unique antibodies), in order to uncover the architecture 16 of antibody repertoires. We identified three fundamental principles of antibody repertoire 17 architecture across B-cell development: reproducibility, robustness and redundancy. 18 Reproducibility of network structure explains clonal expansion and selection. Robustness 19 ensures a functional immune response even under extensive loss of clones (50%). Redundancy 20 in mutational pathways suggests that there is a pre-programmed evolvability in antibody 21 repertoires. Our analysis provides guidelines for a quantitative network analysis of antibody 22 repertoires, which can be applied to other facets of adaptive immunity (e.g., T cell receptors), 23 and may direct the construction of synthetic repertoires for biomedical applications. 24

25 Keywords: systems biology, antibody repertoires, networks, large-scale, systems immunology 26 27

INTRODUCTION 28

The high diversity of antibody repertoires enables broad and protective humoral immunity, thus 29 understanding their system sequence-related properties is essential to the development of new 30 therapeutics and vaccines 1,2. The source of antibody diversity has long been identified to be the 31 V-, (D- in the heavy chains) and J-gene somatic recombination 3. Further additions and 32 deletions of nucleotides at the junctions of the gene segments generate a large collection of 33 antibodies and B-cell receptors, which is called the antibody repertoire 4,5. Antibody identity 34 (clonality) and antigen specificity are primarily encoded in the highly diverse junctional site of 35 recombination in the variable heavy chain, known as the complementarity determining region 3 36

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 2: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

2

(CDR3) 6. Therefore, the similarity landscape of CDR3 sequences represents the clonal 37 architecture of an antibody repertoire, which reflects the breadth of antigen-binding and 38 correlates to humoral immune protection and function. However, due to limitations in 39 technological sequencing depth and computing power, the fundamental principles that govern 40 antibody repertoire architecture have remained unknown, thereby hindering a profound 41 understanding of humoral immunity. 42

Recently, different aspects of network analysis have been employed to investigate antibody 43 repertoires in health and disease. Antibody repertoire networks represent CDR3 sequence-44 nodes connected by similarity-edges 7–11. Sequence-based networks have first been used to 45 show immune responses defined by similarity between clones, a proxy for clonal expansion 8. 46 Network connectivity was later also used to discriminate between diverse repertoires of healthy 47 individuals and clonally expanded repertoires from individuals with diseases such as chronic 48 lymphocytic leukemia 7 and HIV-1 infection 10. A predominant part of network analysis has 49 involved visualization of clusters and the display of clonal composition 7–11. Yet, visualization 50 alone does not provide quantitative insights into the architecture of antibody repertoires and is 51 limited to the informative graphical display of a few hundred nodal clones. It has been shown 52 that the natural antibody repertoire exceeds the informative visualization threshold (hundreds 53 clonal nodes) by at least three orders of magnitude (Glanville et al., 2009), a limit that previous 54 research did not explore given the lower biological coverage (102–103 unique clones analyzed). 55 Consequently, computational methods for constructing large-scale networks with more than 103 56 nodes have remained underdeveloped in systems biology (Kidd et al., 2014). Furthermore, only 57 networks expressing clonal similarity relations of one nucleotide (nt) or one amino acid (a.a.) 58 between sequences have been analyzed so far, which is insufficient in covering the clonal 59 relationships possible considering the extensive mutational landscape of somatic hypermutation 60 12,13. Thus, the lack of quantitative investigation of a relatively (and exceedingly) small subset of 61 the antibody repertoire, with respect to clone numbers and network size, has limited the 62 biological insight of repertoire architecture. 63

To reveal the fundamental principles of antibody repertoire architecture, we implemented a 64 large-scale network analysis platform coupled to high-coverage antibody repertoire high-65 throughput sequencing data to answer the following questions: (i) Does sequence similarity 66 among clones show reproducible signatures across individuals? (ii) How robust are antibody 67 repertoires to removal of a fraction of clones, given their kinetics and rapid turnover? (iii) To 68 what extent is the repertoire architecture intrinsically redundant? (Figure 1). 69

70

RESULTS 71

A high-performance computing platform for large-scale network analysis of antibody 72 repertoires 73

The global landscape of antibody clonal similarities is vast and complex; for example, on the 74 amino acid level, the size of the distance matrix of all-against-all sequences is ≈1010 for a 75 representative repertoire of ≈105 clones (murine naïve B cells). In order to extract the 76 construction principles of antibody repertoires from the high-dimensional similarity space, we 77

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 3: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

3

developed a large-scale network analysis approach, which was based on representing CDR3 78 a.a. clones as sequence-nodes connected by similarity-edges. Specifically, we developed a 79 computational platform that leverages the power of distributed cluster computing, which is able 80 to compute the extremely large distance matrices required for entire repertoires (≥105 CDR3 81 sequences, Figure S1). Analysis of circa 105 or more sequences is an intractable problem 82 without parallel computing, thus our implementation utilized the Apache Spark 14 distributed 83 computing framework to partition the work among a cluster of machines (Figure S1B). The 84 construction of large-scale networks is computationally demanding: a large network comprised 85 of 1.6 million nodes (simulated strings) required 15 minutes if the calculation was performed 86 simultaneously on 625 computational cores (Figure S1C). Computational costs could have been 87 lowered by performing network analysis on a subsample of the repertoire (e.g., 103), as was 88 done in previous studies 7–11. However, extensive analysis into sub-networks has revealed that 89 they are not statistically representative of entire networks, specifically in that sub-network 90 measurements are not always representative of key parameters such as degree distribution, 91 betweenness, assortativity and clustering 15,16. Thus, it was imperative to construct and analyze 92 large-scale networks based on a similarity distance matrix that covers the full clonal diversity of 93 biological antibody repertoires. 94

Comprehensive biological sampling of antibody repertoires was ensured by the usage of high-95 throughput RNA sequencing data (≈400 million full-length antibody sequence reads) from 96 murine B-cell populations, isolated at key stages in humoral development (data was provided by 97 Greiff et al.). Data was analyzed from pre-B cells (pBC), naïve B cells (nBC) and memory 98 plasma cells (PC) isolated from 19 mice, which were stratified into one unimmunized and three 99 immunized cohorts. The experimental design allowed for the assessment of antibody 100 architecture across several important parameters: i) across key stages of B-cell development, ii) 101 before (pBC, nBC) and after antigen-driven clonal selection and expansion (PC), (iii) differences 102 in the complexity of the protein antigen [hepatitis B surface antigen (HBsAg), ovalbumin (OVA) 103 and nitrophenylacetyl-conjugated hen egg lysozyme (NP-HEL)], and (iv) across a scale of 104 different repertoire sizes (102–105 of unique CDR3 clones). The experimental data provided 105 maximal technological and high biological coverage (Greiff et al.), enabling comprehensive 106 assessment of clonal diversity and the global similarity landscape and architecture of antibody 107 repertoires. 108

For each sample (n=57, from 19 mice and three B-cell stages), antibody repertoire architecture 109 was based on the pairwise a.a. sequence similarity of all clones (Levenshtein distance (LD) 110 matrix, hereafter referred to as similarity layer, Figure 1A). When two sequences were similar 111 within a defined threshold, they were connected in the repertoire network (i.e., similarities of 1 112 a.a. differences were captured in similarity layer 1, LD1, 2 a.a. in LD2 and so on). 113

114

Global patterns of antibody repertoire networks are reproducible 115

In order to quantify the extent to which architectural patterns are reproducible across antibody 116 repertoires, we analyzed the base similarity layer in antibody repertoires (similarity layer LD1). 117 The base layer of the network organization provides information regarding the minimal 118 differences (e.g., 1 a.a.) of all antibody sequences that compose the repertoire. Solely the base 119

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 4: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

4

similarity layer, LD1, has previously been analyzed to describe antibody repertoires as networks 120 7–11. Although repertoires varied highly among mice (74–85% clonal variability, Figure S2A), we 121 found a remarkable cross-mouse consistency in patterns of clonal interconnectedness (similarity 122 of antibody sequences) within each B-cell stage: the average number of edges among clones 123 (𝐸"#$ = 230,395±23,048; 𝐸,#$ = 1,016,928±67,080; 𝐸01 = 45±10), the average size of the 124 largest component (𝑝𝐵𝐶 = 46±0.7%; 𝑛𝐵𝐶 = 58±0.5%; 𝑃𝐶 = 10±1.6%; Figure 2A) and cluster 125 composition (Figure S2B) varied negligibly across mice (see Methods, Network analysis). Thus, 126 although unique sequence composition varied substantially between individuals, the overall 127 structure of the network remained similar. 128

Along B-cell development, PC repertoires were five-fold more disconnected than pBC and nBC 129 networks (PC largest component was nearly 5 times smaller than pBC and nBC, Figure 2A), 130 and their centrality was concentrated on specific clones compared to the homogeneously 131 connected clones in pBC and nBC networks (centralization 𝑧01 = 0.05, density 𝐷01 = 0.01, 132 𝑧;<1,=<1 ≈ 𝐷=01,=<1 ≈ 0, Figure S2C). Compared to pBC networks, nBC were on average 4–5 133 times larger and showed a higher average degree (𝑘"#$ = 3, 𝑘,#$ = 5, 𝑘@$ = 1, Figure S2B) 134 although both pre- and naïve B-cell repertoires had identical diameter (𝑑"#$,,#$ = 26, 𝑑@$ = 6, 135 Figure 2B), indicating a similar coverage of the sequence space. We observed that clones in 136 pBC and nBC repertoires connected to comparable clones in terms of degree (assortativity 17–19, 137 𝑟"#$ = 0.48, 𝑟,#$ = 0.41), whereas PC networks were consistently disassortative: their highly 138 connected clones were linked to clones with few connections (𝑟@$ = −0.09, Figure 2B). The 139 characterization of the global patterns of antibody repertoire networks indicated that pBC, nBC 140 and PC repertoires were reproducible. pBC and nBC clones cover a larger diversity space than 141 clones in PC repertoires, where sequence similarity showed to be centralized and targeted 142 towards certain clones. 143

144

Clonal features of antibody repertoire networks are reproducible 145

Antibody repertoire architecture was also reproducible at the level of clonal (local) features in 146 pBC and nBC networks, which were characterized by a low variability (coefficient of variation, 147 CV) across various clonal parameters. The low variability of clonal parameters in pBC and nBC 148 networks (𝐶𝑉"#$ = 2 − 28%, 𝐶𝑉,#$ = 1 − 24%) was in contrast to the higher variability observed 149 in PC repertoires (𝐶𝑉@$ = 13 − 118%, Figure 2C). Specifically, low variability across different 150 individuals was observed in several average clonal parameters such as degree, transitivity, 151 authority and PageRank, closeness and betweenness. Variation analysis of the similarity 152 degree indicated that the average number of similar clones to each of the clones in a repertoire 153 varied marginally in pBC and nBC (𝐶𝑉"#$,,#$ = 5, 6%). Transitivity showed that the similarity 154 between clones both similar to a third CDR3 clone varied only negligibly between individuals 155 (𝐶𝑉"#$,,#$ = 1, 2%). Authority and PageRank showed that the centrality of a CDR3 in the 156 repertoire topology varied respectively 𝐶𝑉;<1,=<1 = 11% and 25% across individuals, suggesting 157 that individual repertoires were centered variably around certain CDR3 clones which were 158 centers of highly connected (similar) clonal regions compared to less connected regions in the 159 same repertoire network. 160

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 5: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

5

Closeness analysis revealed that an analogous number of similarity edges were required to 161 access every other CDR3 from a given CDR3 clone in antibody repertoire networks of different 162 individuals, as the similarity of a clone to every other CDR3 clone in the repertoire varied by 163 𝐶𝑉"#$,,#$ = 17%. Betweenness, the “bridge” function of a clone in sequence similarity, varied 164 slightly across individuals with 𝐶𝑉"#$,,#$ = 28%, suggesting a comparable structure of the 165 similarity route function of CDR3 sequences in these repertoires. These characteristics reflect 166 the transversal diversity of pBC and nBC antibody repertoires where the clones cover a larger 167 space and their similarity is more homogenously distributed at the global repertoire level. 168

Although a higher variability was detected across PC repertoire networks (Figure 2C), clonal 169 parameters were specific to B-cell stages (ppBC,nBC/PC < 0.05): PC clones possessed higher 170 centrality compared to pBC and nBC (closeness 20, eigenvector 21, and PageRank 22), while 171 antigen-inexperienced clones bridged sequence similarity (betweenness 23, Figure S2D). 172 Furthermore, in contrast to pBC and nBC, PC network clonal parameters correlated with CDR3 173 frequency (clonal degree median 𝑟@HIJKL, = 0.55, betweenness 𝑟@HIJKL, = 0.82) suggesting that 174 clonally expanded CDR3 sequences were structural centers of similar clones (Figure S2E, F). 175 CDR3 authority correlated positively with germline V-gene frequency in PC clones (𝑟@HIJKL, =176 0.39), denoting the potential role of the V-gene usage in the centralization of these networks 177 (Figure S2G). Thus, certain high frequency V-genes predispose clones to be highly connected 178 and similar to other clones. 179

180

The structure of antibody repertoires is reproducible and depends on the immune status 181

Network analysis revealed that antibody repertoires were constricted along B-cell development 182 throughout all similarity layers. At LD1, 44–62% of clones were similar (connected) to at least 183 one other clone in all B-cell stages, revealing a high degeneracy in clonal generation and 184 selection (Figure 2D). This indicated that nearly half of antibodies in the respective repertoires 185 had similar clones, thus demonstrating the extent of constriction present in antibody repertoires. 186

In order to understand if such degeneracy in CDR3 sequence similarity translated into 187 reproducible repertoire network structures 21, we determined the clonal empirical degree 188 distribution. The degree distribution is a distinctive feature of different types of networks and it 189 provides an immediate indication of how similarities (degrees) between antibody sequences are 190 distributed in repertoires. Analysis of the cumulative degree distribution revealed that antigen-191 inexperienced pBC, nBC and unimmunized PC repertoires were exponentially distributed (LD1), 192 whereas PC repertoires of immunized cohorts were power-law distributed (base similarity layer 193 LD1, Figure 2E, Figure S3D, E, F, G). Clusters of connected CDR3 clones showed a typical 194 tree-like structure for pBC and nBC, and a star-like structure for PC. The structure of the 195 network suggested an extended and chain-wise sequence similarity of the antibody clones in 196 pBC and nBC repertoires and targeted expansion of certain clones in PC after immunization. 197

In order to prove the tree-/star-like hypothesis and further investigate the sequence similarity 198 space, we performed k-core 24 decomposition and neighborhood analysis (Figure 2F, G, H). The 199 k-core decomposition revealed that the largest k-cores (after all external shells with k<kmax were 200 removed, where k is the degree, i.e. number of similar clones, see Methods) of pBC and nBC 201

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 6: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

6

(0.04% and 0.06% of CDR3 clones in k-core, respectively) were 200-fold smaller than those of 202 PC (8.2%, Figure 2F). Antigen-inexperienced repertoires were thus characterized by larger 203 coreness values (>20), signifying a more layered structure of CDR3 similarity (Figure S2J, K) 204 and confirming their tree-like structure. Furthermore, the high convergence of V-genes at the 205 core-level of antibody repertoire networks (pBC=50%, nBC=70%, PC=1–10%, Figure 2G), in 206 contrast with the low exact CDR3 sequence core-overlap (Figure S3A, B, C), suggested a 207 genetically determined origin of the structure. 208

The average CDR3 neighborhood size, which designated the set of similar CDR3 clones along 209 each sequential step of similarity from a certain clone (orders n=1–50), was order-independent 210 in PC and plateaued at 2% of the network, confirming that PC clones were connected to one 211 central clone in a star-like similarity structure, reflecting clonal selection and expansion 212 signatures. Neighborhood size 25, the number of similar clones to each clone, increased order-213 wise in antigen-inexperienced cells up to 34% (Figure 2H), signifying tree-like similarity 214 structures that enable maximal exploration of sequence space within the genetically 215 predetermined repertoire constriction space, suggesting that antibody repertoires are 216 evolutionarily wired to respond to diverse antigenic stimuli. 217

218

Antibody repertoires are highly robust systems 219

We hypothesized that the reproducible architecture of antibody repertoires may have evolved to 220 be robust to fluctuations in clonal composition. It is known that antibody repertoires are very 221 dynamic systems characterized by a high turnover rate 26–28. Therefore, we investigated the 222 robustness of antibody repertoire architecture to clonal removal (deletion). 223

It is has been recently established that individual repertoires have public clones, which are 224 defined as identical clones present in multiple individuals 29. While mostly distinct, antibody 225 repertoires possessed a fraction of public clones (15–26% along B-cell development, Figure 226 S2A). Given their regular presence, we determined if public clones were essential to the 227 maintenance of antibody repertoire architecture. We found that the highest authority clones 228 were public (Figure 3A) and up to 74% of private clones (specific to an individual) were 229 connected to at least one public clone (Figure S2I). To quantify the extent to which public clones 230 maintain the architecture of antibody repertoires, we tested the effect of removing public clones 231 on CDR3 degree distributions. In pBC and nBC, removal of all public clones transformed their 232 network structure from exponential to power-law; in contrast, removal of public clones led to no 233 change in PC network structure (Figure 3B). To assess if such a structural shift was specifically 234 due to the deletion of public clones, we removed (repeatedly) random subsets of clones 235 representing a similar fraction of public clones. The structure of antibody repertoires was robust 236 along B-cell stages at up to 50% removal of random clones. The same structural shift in 237 repertoire structure caused by the deletion of public clones could only be replicated by removing 238 90% of random clones (Figure 3C). Therefore, public clones represent pillars that are critical for 239 maintaining the architecture of an antibody repertoire, and the robustness of this architecture 240 suggests a functional immunity is preserved even after extensive (random) loss of antibody 241 clones (or B cells). 242

243

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 7: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

7

Antibody repertoires are evolutionary redundant 244

Redundancy is a hallmark of robust systems; for example, redundancy in genes with the same 245 function is the main mechanism of robustness against mutations in genetic networks 30. To 246 investigate the extent of redundancy within antibody networks, we examined whether their 247 architecture at the base similarity layer (LD1) was manifested in higher order similarity layers 248 (LD>1). Differences greater than one a.a. between antibody sequences could represent the 249 potential personal scenarios of antibody repertoire evolution (somatic hypermutation 12,13), a 250 result of successful survival through selective processes. Specifically, if a clone connected to 251 many other clones in the LD1 similarity layer mutates into a similar clone at a specific a.a. 252 position, this potential clone will be connected to many clones in the LD2 similarity layer. Thus, 253 higher order similarity layers can serve as surrogates for the evolution of potential antibody 254 repertoires from antigen-inexperienced B-cell populations. 255

To quantify the extent of redundancy across similarity layers, we calculated the prediction 256 accuracy of LD1 versus similarity layers LD2–12 using a leave-one-out cross-validation approach 257 (Figure 3D, Figure S3H and S3I). Specifically, quantitative redundancy was low in PC 258 (LD1→LD2–3 prediction accuracy was 28% on average); however, LD1 of pBC and nBC predicted 259 CDR3 degree profiles of proximal similarity layers LD2–3 with ≥80% accuracy (Figure 3D and 260 3E), thereby indicating a high redundancy in antibody repertoire architecture. This high 261 redundancy is explained by the structure of the antibody networks (Figure 2E–H). Although the 262 distance between proximal similarity layers (LD1 to LD3) seems small (1–3 a.a. CDR3 sequence 263 differences), it represents ≈20% of potential change in clonal a.a. sequence (99% of CDR3 264 clones are 4–20 a.a. long), which is in the range of highly mutated antibodies (e.g., broadly-265 neutralizing HIV-specific 31). Therefore, redundancy in the antigen-inexperienced repertoire is 266 maintained throughout a large sequence space and provides details on the pre-programmed 267 evolvability 32,33 of antibody responses. 268

269

DISCUSSION 270

In summary, leveraging a custom-developed analysis platform for generating large-scale 271 networks from datasets of millions of unique sequences, we have discovered three fundamental 272 principles of antibody repertoire architecture: (i) reproducibility (ii) robustness and (iii) 273 redundancy. We were able to detect a high cross-individual reproducibility by quantifying 274 network parameters 17–19 at the global (size, diameter and assortativity) and clonal level (degree, 275 transitivity, authority, closeness, betweenness, PageRank) of antibody repertoires along B-cell 276 development. Importantly, the reproducible clonal similarity properties were suggestive of the 277 underlying immunobiology of each B-cell stage: antigen-inexperienced repertoires covered an 278 extended sequence diversity space (tree-like exponential similarity structure) to counter high 279 antigen diversity whereas, antigen-experienced repertoires presented a centralized network 280 structure (star-like, power-law), with many clones being similar to one central clone possibly 281 originating from antigen-dependent clonal expansion and selection 34. 282

Large-scale network analysis of entire antibody repertoires revealed that these systems are 283 robust enough to be amenable to subsampling, which is in contrast to other systems (Lee et al., 284 2006; Sethu and Chu, 2012). Specifically, we showed that the structure of antibody repertoire 285

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 8: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

8

networks was robust to extensive subsampling of up to 50% of the clones. This result is crucial 286 for the network analysis of human antibody repertoires, where biological subsampling remains 287 an important problem 35,36. The robustness of antibody repertoires might also explain their 288 functionality despite the large fluctuations of antibody repertoire composition over time 26–28. 289 Interestingly, the structure of antibody repertoires was fragile to the removal of public clones. 290 The crucial role that public clones play as pillars of antibody repertoire architecture was 291 revealed by large-scale networks, yet future research will need to determine the functional role 292 (antigen specificity) of public clones in a humoral response. 293

We found that antibody repertoires presented intrinsic redundancy across similarity layers. This 294 means that not only minimal differences (1 a.a. of the base layer LD1) but also further 295 diversification (> 1 a.a. differences between antibody sequences) may be hardcoded into the 296 constricted sequence space of antibody repertoires, thus rendering their evolvability robust 297 (analogously to other biological systems such as transcription factor networks 33). 298

This work delineates guidelines for the large-scale network construction and analysis of large 299 and diverse immune repertoires. In particular, our network analysis approach can be used 300 where a partial biological coverage of the repertoire is available, although this might depend on 301 the B-cell stage, species, and similarity layer investigated. The network quantitative analysis of 302 global and clonal properties of adaptive immune repertoires (antibody and T cell receptor 303 repertoires) in health and disease is essential to comprehensively understand their architecture 304 and may resolve limitations arising from visualization of graphics featuring high-dimensional 305 data. 306

The three fundamental principles of the architecture of antibody repertoires uncovered here 307 through network analysis may serve as a blueprint for the construction of synthetic antibody 308 repertoires, which may be used to simulate natural humoral immunity for monoclonal antibody 309 drug discovery and vaccine development 32,37. Large-scale antibody network analysis could be 310 useful in personalized medicine in the prediction of immunity scenarios for repertoire-311 transforming diseases such as autoimmunity or lymphomas, which lead to major alterations in 312 repertoire composition 38,39; this may allow for interventions to modify disease progression on 313 the repertoire level by precision therapeutic clonal targeting. Finally, we believe the stage is set 314 for a rapid progression of the present guidelines into what was long ago envisioned by Niels K. 315 Jerne 40: the field of network systems immunology, which offers the potential to obtain greater 316 understanding of the complexity of immune responses. 317 318 319 320 321 322 323 324 325 326 327 328 329

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 9: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

9

REFERENCES 330 331 1. Bonsignori, M. et al. Antibody-virus co-evolution in HIV infection: paths for HIV vaccine 332

development. Immunol. Rev. 275, 145–160 (2017). 333 2. Romero, P. et al. The Human Vaccines Project: A roadmap for cancer vaccine development. 334

Sci. Transl. Med. 8, 334ps9-334ps9 (2016). 335 3. Hozumi, N. & Tonegawa, S. Evidence for somatic rearrangement of immunoglobulin genes 336

coding for variable and constant regions. Proc. Natl. Acad. Sci. 73, 3628–3632 (1976). 337 4. Murphy, K., Travers, P. & Walport, M. Janeway’s immunobiology. N. Y. NY Garland Sci. 338

(2012). 339 5. Weinstein, J. A., Jiang, N., White, R. A., Fisher, D. S. & Quake, S. R. High-Throughput 340

Sequencing of the Zebrafish Antibody Repertoire. Science 324, 807–810 (2009). 341 6. Tonegawa, S. Somatic generation of antibody diversity. Nature 302, 575–581 (1983). 342 7. Bashford-Rogers, R. J. M. et al. Network properties derived from deep sequencing of human 343

B-cell receptor repertoires delineate B-cell populations. Genome Res. 23, 1874–1884 (2013). 344 8. Ben-Hamo, R. & Efroni, S. The whole-organism heavy chain B cell repertoire from 345

Zebrafish self-organizes into distinct network features. BMC Syst. Biol. 5, 27 (2011). 346 9. Chang, Y.-H. et al. Network Signatures of IgG Immune Repertoires in Hepatitis B 347

Associated Chronic Infection and Vaccination Responses. Sci. Rep. 6, 26556 (2016). 348 10. Hoehn, K. B. et al. Dynamics of immunoglobulin sequence diversity in HIV-1 infected 349

individuals. Philos. Trans. R. Soc. B Biol. Sci. 370, 20140241 (2015). 350 11. Lindner, C. et al. Diversification of memory B cells drives the continuous adaptation of 351

secretory antibodies to gut microbiota. Nat. Immunol. 16, 880–888 (2015). 352 12. Klein, F. et al. Somatic Mutations of the Immunoglobulin Framework Are Generally 353

Required for Broad and Potent HIV-1 Neutralization. Cell 153, 126–138 (2013). 354 13. Wu, X. et al. Maturation and Diversity of the VRC01-Antibody Lineage over 15 Years of 355

Chronic HIV-1 Infection. Cell 161, 470–485 (2015). 356 14. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. Spark: Cluster 357

Computing with Working Sets. HotCloud 10, 95 (2010). 358 15. Lee, S. H., Kim, P.-J. & Jeong, H. Statistical properties of sampled networks. Phys. Rev. E 359

73, (2006). 360 16. Sethu, H. & Chu, X. A new algorithm for extracting a small representative subgraph from a 361

very large graph. ArXiv Prepr. ArXiv12074825 (2012). 362 17. Amit, I. et al. Unbiased Reconstruction of a Mammalian Transcriptional Network Mediating 363

Pathogen Responses. Science 326, 257–263 (2009). 364 18. Newman, M. Networks: an introduction. (Oxford University Press Inc., 2010). 365 19. Pavlopoulos, G. A. et al. Using graph theory to analyze biological networks. BioData Min. 4, 366

10 (2011). 367 20. Newman, M. E. Assortative mixing in networks. Phys. Rev. Lett. 89, 208701 (2002). 368 21. Freeman, L. C. Centrality in social networks conceptual clarification. Soc. Netw. 1, 215–239 369

(1978). 370 22. Bonacich, P. Power and centrality: A family of measures. Am. J. Sociol. 92, 1170–1182 371

(1987). 372 23. Brin, S. & Page, L. The anatomy of a large-scale hypertextual web search engine. Comput. 373

Netw. ISDN Syst. 30, 107–117 (1998). 374 24. Barabási, A.-L. & Oltvai, Z. N. Network biology: understanding the cell’s functional 375

organization. Nat. Rev. Genet. 5, 101–113 (2004). 376

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 10: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

10

25. Seidman, S. B. Network structure and minimum degree. Soc. Netw. 5, 269–287 (1983). 377 26. Galson, J. D. et al. In-Depth Assessment of Within-Individual and Inter-Individual Variation 378

in the B Cell Receptor Repertoire. Front. Immunol. 6, (2015). 379 27. Horns, F. et al. Lineage tracing of human B cells reveals the in vivo landscape of human 380

antibody class switching. eLife 5, e16578 (2016). 381 28. Jiang, N. et al. Lineage structure of the human antibody repertoire in response to influenza 382

vaccination. Sci. Transl. Med. 5, 171ra19–171ra19 (2013). 383 29. Jackson, K. J. L., Kidd, M. J., Wang, Y. & Collins, A. M. The Shape of the Lymphocyte 384

Receptor Repertoire: Lessons from the B Cell Receptor. Front. Immunol. 4, (2013). 385 30. Wagner, A. Robustness against mutations in genetic networks of yeast. Nat. Genet. 24, 355–386

361 (2000). 387 31. Burton, D. R. & Hangartner, L. Broadly Neutralizing Antibodies to HIV and Their Role in 388

Vaccine Design. Annu. Rev. Immunol. 34, 635–659 (2016). 389 32. Briney, B. et al. Tailored Immunogens Direct Affinity Maturation toward HIV Neutralizing 390

Antibodies. Cell 166, 1459–1470.e11 (2016). 391 33. Payne, J. L. & Wagner, A. The robustness and evolvability of transcription factor binding 392

sites. Science 343, 875–877 (2014). 393 34. Burnet, S. F. M. The clonal selection theory of acquired immunity. 3, (Vanderbilt University 394

Press Nashville, 1959). 395 35. Greiff, V., Miho, E., Menzel, U. & Reddy, S. T. Bioinformatic and Statistical Analysis of 396

Adaptive Immune Repertoires. Trends Immunol. 36, 738–749 (2015). 397 36. Yaari, G. & Kleinstein, S. H. Practical guidelines for B-cell receptor repertoire sequencing 398

analysis. Genome Med. 7, (2015). 399 37. Sidhu, S. S. & Fellouse, F. A. Synthetic therapeutic antibodies. Nat. Chem. Biol. 2, 682–688 400

(2006). 401 38. Logan, A. C. et al. High-throughput VDJ sequencing for quantification of minimal residual 402

disease in chronic lymphocytic leukemia and immune reconstitution assessment. Proc. Natl. 403 Acad. Sci. 108, 21194–21199 (2011). 404

39. Stern, J. N. et al. B cells populating the multiple sclerosis brain mature in the draining 405 cervical lymph nodes. Sci. Transl. Med. 6, 248ra107–248ra107 (2014). 406

40. Jerne, N. K. Towards a network theory of the immune system. in Annales d’immunologie 407 125, 373–389 (1974). 408

41. Menzel, U. et al. Comprehensive Evaluation and Optimization of Amplicon Library 409 Preparation Methods for High-Throughput Antibody Sequencing. PLoS ONE 9, e96727 410 (2014). 411

42. Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. 412 Methods 12, 380–381 (2015). 413

43. Lefranc, M.-P. et al. IMGT, the international ImMunoGeneTics database. Nucleic Acids Res. 414 26, 297–303 (1998). 415

44. Dave, A. et al. Graphframes: an integrated api for mixing graph and relational queries. in 416 Proceedings of the Fourth International Workshop on Graph Data Management Experiences 417 and Systems 2 (ACM, 2016). 418

45. Schult, D. A. & Swart, P. Exploring network structure, dynamics, and function using 419 NetworkX. in Proceedings of the 7th Python in Science Conferences (SciPy 2008) 2008, 11–420 16 (2008). 421

46. Peixoto, T. P. The graph-tool python library. figshare (2014). 422

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 11: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

11

47. Walt, S. van der, Colbert, S. C. & Varoquaux, G. The NumPy array: a structure for efficient 423 numerical computation. Comput. Sci. Eng. 13, 22–30 (2011). 424

48. Clauset, A., Shalizi, C. R. & Newman, M. E. J. Power-Law Distributions in Empirical Data. 425 SIAM Rev. 51, 661–703 (2009). 426

49. Gillespie, C. S. Fitting heavy tailed distributions: the poweRlaw package. J. Stat. Softw. 64, 427 (2015). 428

50. Virkar, Y. & Clauset, A. Power-law distributions in binned empirical data. Ann. Appl. Stat. 8, 429 89–119 (2014). 430

51. Newman, M. E. The structure and function of complex networks. SIAM Rev. 45, 167–256 431 (2003). 432

52. Newman, M. E. Random graphs as models of networks. ArXiv Prepr. Cond-Mat0202208 433 (2002). 434

53. Kleinberg, J. M. Authoritative sources in a hyperlinked environment. J. ACM JACM 46, 435 604–632 (1999). 436

54. Csárdi, G. & Nepusz, T. The igraph library. http://igraph.org/ (2006). 437 55. Hyndman, R. J. & Khandakar, Y. Automatic time series for forecasting: the forecast package 438

for R. (Monash University, Department of Econometrics and Business Statistics, 2008). 439 56. R Core Team. R: A Language and Environment for Statistical Computing. R Found. Stat. 440

Comput. (2016). 441 57. Wickham, H. ggplot2: elegant graphics for data analysis. Springer N. Y. (2009). 442 58. Gaujoux, R. & Seoighe, C. A flexible R package for nonnegative matrix factorization. BMC 443

Bioinformatics 11, 367 (2010). 444 59. Bastian, M., Heymann, S., Jacomy, M. & others. Gephi: an open source software for 445

exploring and manipulating networks. ICWSM 8, 361–362 (2009). 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 12: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

12

MAIN FIGURE LEGENDS 467 468 Figure 1. Large-scale network analysis reveals the architecture of antibody repertoires 469 and its three main principles. 470 (A) Large-scale networks (>500,000 nodes) of antibody repertoires were constructed from the 471 Levenshtein distance (LD, edit string distance) matrix of CDR3 clonal sequences (a.a) using a 472 custom high-performance computing platform (see Methods). Networks represent antibody 473 repertoires of similar CDR3 nodes connected by edges when amino acid CDR3 sequences 474 differ by a predetermined LD. All clones of a repertoire connected at a given LD form a similarity 475 layer (LDn). 476 (B) Deconvolution of the complexity of antibody repertoire architecture was performed by 477 quantifying (i) its reproducibility through global and clonal (local) properties, (ii) robustness to 478 clonal deletion and (iii) redundancy across its similarity layers in the sequence space (Figure 479 S1). 480 481 Figure 2. Global and clonal properties of antibody repertoire networks are reproducible. 482 (A) Network size of antibody repertoires. The y-axis indicates the absolute number count of 483 CDR3 nodes, CDR3 edges (similarities) and CDR3 clones in the largest component. The mean 484 percentage of the CDR3s belonging to the largest component by B-cell development stage is 485 shown on top of the dark blue bar. 486 (B) Global properties, diameter and assortativity coefficient are shown for pre-B cells (pBC), 487 naïve B cells (nBC) and plasma cells (PC). 488 (C) The mean value of the coefficient of variation for clonal properties in pBC, nBC and PC 489 repertoires. Wilcoxon test, ppBC,nBC/PC < 0.05 (see Methods). 490 (D) Percentage of clones connected to at least one other clone in the repertoire at LD1, LD≤2, …, 491 LD≤12 in pre-B cells, naïve B cells, plasma cells and randomly constructed CDR3 strings. 492 (E) The power-law (orange), exponential (red) and Poisson (grey) distributions were fit to the 493 cumulative degree distributions of naïve B cell and plasma cell (unimmunized) repertoires of a 494 mouse for similarity layers LD1,3,7 (log-log scale). Representative clusters are shown for LD1. 495 (F) Percentage of CDR3 clones (mean±s.e.m) that compose the maximal core. Subgraph of the 496 maximal k-core (red), and k-1 (black), k-2 (dark grey) and k-3 (light grey) cores in a 497 representative pBC sample (Unimmunized mouse n. 2). 498 (G) Percentage overlap of CDR3 germline V-genes in the maximal core of nBC repertoires (n = 499 5 mice and data sets for Unimm, OVA, NP-HEL, n = 4 mice sets for HBsAg). 500 (H) Normalized neighborhood size for orders n={1–10, 15, 20, 30, 40, 50} across CDR3 clones 501 (similarity layer LD1). 502 For 2A, B, D, barplots show mean±s.e.m; for 2A-E, each B-cell stage n = 19 mice. 503 504 Figure 3. The architecture of antibody repertoires is robust and redundant. 505 (A) CDR3 clones of an exemplary naïve B-cell repertoire (from OVA-immunized mouse n. 1) 506 have been ordered from increasing to decreasing frequency (CDR3 rank). Public clones are 507 color-coded in red. 508 (B) Bootstrapped p-values of the power-law fit are shown for complete antibody repertoires and 509 after removing public clones. Power law is a good fit to degree distributions for p-values above 510

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 13: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

13

the dashed red line (p-value = 0.1). Examples of exponential (red) and power-law (grey) 511 networks are shown on the top panel. 512 (C) CDR3 clones were removed randomly at 10%, 50% and 90% from each original repertoire 513 (20 times) and the power-law distribution was fit to the cumulative degree distributions of the 514 remaining CDR3 clones. A p-value=0.1 is indicated as a red dashed line. In PC samples a fit 515 was not feasible after removal of 90% of CDR3 clones (NA). 516 (D) Heatmaps indicate the mean prediction accuracy (Q2, leave-one-out cross-validated R2) of 517 similarity layer LD1 versus similarity layers LD2–12. The scatterplot shows Q2 for LD1 vs. LD2 for 518 each CDR3 clone. 519 (E) Prediction accuracy (Q2) for LD1 vs. LD2 and LD3. 520 For 2B, C, E, barplots show mean±s.e.m. 521 522 SUPPLEMENTARY FIGURE LEGENDS 523 524 Figure S1. High-performance computing platform to construct and analyze large-scale 525 networks from entire antibody repertoires. 526 (A) Data preprocessing, network construction and model fits to degree distribution (see 527 Methods, Degree distribution fits for further details). Network parameters (global and mean 528 local/clonal) are shown for the exemplary network. 529 (B) Software schematics showing the distributed parallel computing platform used to partition 530 the work among a cluster of many workers. 531 (C) Computation time to construct large-scale networks depends on the number of CDR3 532 sequences and the number of cores used. 533 534 Figure S2. Global and clonal (local) network parameters of antibody repertoires of pre-B 535 cells (pBC), naïve B cells (nBC) and plasma cells (PC). 536 (A) Percentage of public clones, shared CDR3 clones between mice in pre-B cell (pBC), naïve B 537 cell (nBC) and plasma cell (PC) repertoires. 538 (B–C) Global properties: Cluster analysis shows the average normalized cluster size and cluster 539 number in the antibody repertoire networks. Average degree, clustering coefficient, density and 540 (degree) centralization characterize the networks at the global level. 541 (D) Local properties: authority, PageRank, eigenvector, closeness and betweenness describe 542 each clone in the network. Average values are shown for each B cell population, pre-B cells 543 (pBC), naïve B cells (nBC) and plasma cells (PC). Barplots show mean±s.e.m, mice n=19. 544 (E) Pairwise Pearson correlation (r, mean±s.e.m) of CDR3 degree with CDR3 frequency in pre-545 B cells (pBC), naïve B cells (nBC) and plasma cells (PC) antibody repertoire networks. 546 (F) Pairwise Pearson correlation of local properties with CDR3 frequency (median, mice n=19). 547 (G) Pairwise Pearson correlation of local properties with germline V-gene frequency (mean, 548 mice n=19). 549 (H) Pairwise Pearson correlation of CDR3 clonal (local) properties with public (1) vs. non-public 550 (0) CDR3 clones (mean, mice n=19). 551 (I) Percentage of public clones similar (connected) to at least one other public CDR3 clone 552 sequence by cohort (mean, mice n=19). 553

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 14: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

14

(J) Coreness density distribution for the unimmunized cohort of pre-B cells (pBC), naïve B cells 554 (nBC) and plasma cells (PC). The x-axis shows the k-core (after removing sequentially shells of 555 nodes of degree k-1). Line colors depict different mice. 556 (K) Example of CDR3 clones in the largest clique (complete connected subgraph) from a pre-B 557 cell repertoire (NP-HEL cohort). 558 For S2A-E, I, barplots represent mean±s.e.m; for each B-cell stage, n=19 mice. 559 560 561 Figure S3. Core and structure (degree distributions of CDR3 similarity) analysis, and 562 similarity layer prediction of antibody repertoire networks. 563 (A) Maximal core CDR3 clones overlap in pre-B cells (pBC), naïve B cells (nBC) and plasma 564 cell (PC) repertoire networks. 565 (B) Maximal core germline V-genes overlap in pre-B cells and plasma cell. 566 (C) Percentage of the largest cliques (completely connected subgraph) along B cell 567 development. Barplots represent mean±s.e.m; for each B-cell stage, n=19 mice. 568 (D) Cumulative degree distributions (CDF). Each distribution line (different symbols) depicts one 569 similarity layer LD1–12 (HBsAg-immunized mouse n. 4). 570 (E) p-values (mean±s.e.m) of the power-law fit for each cohort. 571 (F) One-sided and two-sided p-values (mean±s.e.m) for the discrimination between the 572 exponential (one-sided p-value=1, two-sided p-value=0) and the power-law fits. 573 (G) Graphics of power-law (γ =2.2), exponential and random network models of 100 nodes. 574 (H) Prediction accuracy (Q2, leave-one-out cross-validated R2, mean±s.e.m) of selected distant 575 similarity layers LD4–12 from LD1. 576 (I) Prediction accuracy (Q2, mean±s.e.m) of all similarity layers (LD2–12) from LD1. 577 578 579 580

581 582 583

584 585

586 587

588 589

590 591

592

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 15: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

15

METHODS 593 594 EXPERIMENTAL MODEL AND SUBJECT DETAILS 595 596 Dataset 597 The dataset analyzed was produced as described by Greiff et al. in the attached manuscript. 598 Briefly, murine B-cell populations of pre-B cells (pBC, IgM, bone marrow), naïve follicular B cells 599 (nBC, IgM, spleen), and memory plasma cells (PC, IgG, bone marrow) were sorted using 600 fluorescence-activated cell sorting (FACS) from C57BL/6J mice unimmunized (n=5) or prime-601 boost immunized with alum-precipitated antigens: nitrophenylacetyl-conjugated hen egg 602 lysozyme (NP-HEL, n=5), ovalbumin (OVA, n=5) or Hepatitis B virus surface antigen (HBsAg, 603 n=4). Following total RNA extraction, full-length antibody variable heavy chain (VDJ) libraries 604 were generated by a two-step PCR process, as described previously 41. Libraries were 605 sequenced using the Illumina MiSeq (2x300bp) platform. Mean Phred-scores of raw data were ≥ 606 30. Approximate paired-end reads (full-length VDJ) were: pBC 5x106 reads, nBC 10x106 reads 607 and PC 4x106 reads. 608 609 METHOD DETAILS 610 611 Data preprocessing and CDR3 clonal analysis 612 Antibody sequences have been preprocessed and VDJ annotated with MiXCR 42 and further 613 filtered to retain only those sequences that had CDR3 length ≥ 4 a.a. and occurred more than 614 once in each CDR3 repertoire data set (Figure S1A). Clones were defined by 100% a.a. 615 sequence identity of CDR3 regions. CDR3 regions were defined by MiXCR according to the 616 nomenclature of the Immunogenetics Database (IMGT) 43. 617 618 Network construction 619 To construct networks (graphs), a sparse triangle matrix of pairwise Levenshtein distances (LD) 620 between CDR3s must first be computed. For small samples (up to 100,000 unique CDR3 621 sequences) such a calculation is relatively quick on a single computer. However, due to the N2 622 complexity of required calculations, computing the pairwise matrix for samples of >100,000 623 unique CDR3 sequences becomes prohibitively expensive. To perform these computations, we 624 developed software that utilizes the Apache Spark (2) distributed computing framework to 625 partition the work among a cluster of many machines (Figure S1B). We chose specifically 626 Apache Spark because i) its deployment is very flexible with regard to underlying computing 627 infrastructure and ii) for similarity layers LD>1, the networks become extremely large and difficult 628 to process. In these cases, our package can take advantage of the Spark GraphFrames 629 distributed graph library 44, which allows scaling to even larger samples with millions of 630 sequences (Figure S1C). With this approach we were able to compute the distance matrices for 631 large samples (>100,000 unique CDR3 sequences) within minutes (Figure S1, B and C). 632 In addition to the computational complexity inherent in creating the distance matrix, the 633 construction of networks for large LD is very computationally and time-wise costly. We therefore 634 avoided constructing networks altogether for calculating the node degrees and instead used a 635 map-reduce distributed algorithm. For practical purposes, the construction of small networks 636 was performed using the Networkx library 45. For generating and outputting the largest graphs to 637 disk in common network formats, we used the efficient graph-tool library (https://graph-638 tool.skewed.de/, 46). For manipulating and analyzing the largest networks, our software package 639 took advantage of the Spark GraphFrames distributed graph library 44. 640

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 16: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

16

The software was developed in python (https://www.python.org/) using the Numpy/Scipy 47 641 scientific libraries for matrix and array manipulation and Apache Spark 14 as the distributed 642 backend. Our software package for antibody repertoires imNet is available 643 (https://github.com/rokroskar/imnet) and includes tutorials and demos, including scripts to set up 644 the distributed computation environment on commonly-used compute cluster infrastructure. The 645 results shown in this work were obtained using 1–625 cores of the Euler parallel-computing 646 cluster operated by ETH Zürich. 647 648 QUANTIFICATION AND STATISTICAL ANALYSIS 649 650 Degree distribution fits 651 Degrees (number of similar CDR3 sequences to a specific CDR3 sequence) were calculated for 652 each of the similarity layers LD1–12 for each CDR3 sequence in each sample. CDR3 with zero 653 degrees that were not similar to any other CDR3 in the network were excluded in order to fit 654 degree distributions. The power-law, exponential and Poisson distributions were fitted to the 655 empirical degree distributions of the networks, constructed as described in Network 656 construction, by estimating xmin (estimated lower degree threshold by minimizing the 657 Kolmogorov-Smirnoff statistic 48) and optimizing model parameters using the poweRlaw 49 658 package. We first discriminated if the power-law distribution could describe the best fit to the 659 degree distribution by bootstrapping 100 times the power-law p-value obtained from each 660 sample after estimating xmin. Following the approach described by Virkar and Clauset 50, a p-661 value ≥ 0.1 indicated that the power-law distribution described the degree distribution (Figure 662 S1A). To determine the degree distribution in cases where the power law was not the best 663 distribution fit (p-value < 0.1), we compared the exponential and the Poisson fits. Two-sided p-664 value≈0 indicated that the fitted models could be discriminated, and one-sided p-value≈1 665 indicated that the first (for example exponential) model was the best fit for the data 49. 666 667 Robustness of the architecture of antibody repertoire networks 668 Public clones were defined as clones shared among subjects in a cohort (Figure S2). In order to 669 assess the robustness of the architecture of antibody repertoire networks we removed public 670 clones from each sample-repertoire. As controls, we performed repeated removal (20 times) of 671 randomly selected clones in the size of public clones. The p-values for the power-law fit were 672 calculated after 100x bootstrapping for each repertoire; one-sided and two-sided p-values were 673 used for the comparison between the exponential and the Poisson fits (see Degree distribution 674 fits). 675

676 Network analysis 677 Drawing from network theory 51, we translated the concepts of network analysis 18 to antibody 678 repertoires. An antibody repertoire network is an undirected graph G = (V, E) described as a set 679 of nodes (CDR3 vertices, V) together with a set of connections (similarity edges, E), 680 representing the adjacency matrix A of pairwise Levenshtein distances (LD) between CDR3 a.a. 681

sequences𝐀 =0 ⋯ LDT,⋮ ⋱ ⋮

LD,T ⋯ LD,,. 682

In the context of antibody repertoires, we let N = |V| and L = |E|. The order of a graph N 683 represents the number of its unique CDR3 clones (nodes). The size of a graph L is the number 684 of its CDR3 similarity connections (edges). The degree k, that represents the edges connected 685 to a node, describes the count of all similar CDR3 clones to a CDR3 based on LD. Because the 686 degree indicates how active a node is, it could be interpreted as a measure of how central a 687

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 17: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

17

CDR3 clone is in the antibody repertoire network. In simpler terms, it quantifies the number of 688 CDR3 clones that are similar to a certain CDR3, and thus the potential development or the 689 evolutionary routes to this CDR3. 690 The average degree 𝑘 ≡ XYZ

Y[\]

= ^_]

is the average number of similar CDR3 clones. The 691 degree distribution P k = NX/N, defined as the fraction of nodes with degree 𝑘 (NX) in total 692 nodes, represents the fraction of CDR3 clones that have the same number of similar CDR3s. 693 The cumulative degree distribution PX = pXef

XegX describes the fraction of nodes with degree 694 greater than or equal to k′. In Erdős–Rényi (ER) random graph models, degrees follow a 695 Poisson distribution P k = X iHj i

X! in the limit of large numbers of nodes, while degree 696

distributions have an exponential tail P k ~enoX in exponential networks 52. 697 Global characterization 18 described the network as a whole, such as degree distribution, 698 centralization, largest component, diameter, clustering coefficient, assortativity and coreness. 699 The centralization analysis indicates if the network is homogeneous (clones are connected in 700 the same way) or is centered around certain nodes (highly connected clonal regions compared 701 to less connected regions in the same network). The largest component is the largest cluster of 702 connected CDR3 clones. The diameter (d) is the maximum distance (shortest path between two 703 nodes) between any pair of CDR3 sequences. The clustering coefficient (C) represents the 704 probability that neighbors of a node are also connected, which translates in antibody repertoires 705 as the probability that CDR3 clones similar to a specific CDR3 are also similar among one 706 another. Network density (D) is the ratio of the number of edges (CDR3 similarities) and the 707 number of all possible edges in the network. The assortativity coefficient (r) indicates if nodes in 708 a network connect to nodes with similar characteristics. It is positive if nodes tend to connect to 709 nodes that are similar to them (i.e. highly connected CDR3 sequences are similar and connect 710 to highly connected CDR3 sequences), and negative otherwise. Coreness is a measure of the 711 network’s cohesion and allows one to understand the global network structure and is useful in 712 comparing complex networks by analyzing the subsets of CDR3-cores that form layers in the 713 antibody repertoire. K-core decomposition is a process that is performed by iteratively removing 714 shells of all vertices of degree less than k (k<kmax) leaving the k-cores of a network (its 715 connected component). The k-core of a graph is the maximal subgraph in which each node has 716 at least degree k. We have computed the maximal k-core of antibody repertoire networks (the 717 innermost core, kmax) and the core distribution along k degrees. 718 Clonal (local) characterization of antibody repertoires was performed by analyzing local 719 properties of the networks 18. The importance of CDR3 clones was measured by calculating the 720 authority 53, eigenvector 22 and PageRank 23 scores of each node in repertoire networks. In 721 particular, the authority (a) of nodes is defined as the principal eigenvector of the transpose 722 matrix t 𝐀 ∗ 𝐀, where 𝐀 is the adjacency matrix of the network. Eigenvector centrality indicates 723 the centrality of a CDR3 clone, not only dependent on the number of similar CDR3 (number of 724 degree, connections) but also on the quality of those connections: CDR3-nodes with high 725 eigenvector values are connected to many other nodes which are, in turn, connected to many 726 others (and so on). PageRank measures the importance of the similarity between two CDR3 727 clones within the network extending beyond the approximation of a CDR3 importance or quality. 728 Closeness (centrality 21) (c) was calculated to measure how many steps were required to 729 access every other CDR3 from a given CDR3 clone in antibody repertoire networks. We 730 calculated the normalized closeness by multiplying the raw closeness by n-1, where n was the 731 number of nodes in the network. Clique analysis identified maximally-connected subgraphs (a 732 subset of nodes) in which every CDR3 was similar to every other CDR3 sequence and the 733 largest clique was the maximal completed subgraph which had more nodes than any other 734 clique in the network. The node betweenness (b) is the number of geodesics (shortest paths) 735

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 18: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

18

going through a node and indicates the “bridge” function of a CDR3 sequence. Network 736 properties were calculated using the igraph 54 R package. 737

738 Quantifying the predictive performance (Q2) of linear regression models 739 The predictive performance (𝑄^) of each linear regression model (𝒀 = 𝑿𝛽 + 𝜺) was calculated 740 using leave-one-out cross-validation (LOOCV): 𝑄^ = 1 − PRESS

TSS∙ 100, where PRESS is the 741

predictive error sum of squares ( 𝒀𝒋 − 𝒀[|]^=

|gT with 𝒀[|] denoting the prediction of the model 742 when the j-th case is deleted from the training set and TSS is the total sum of squares 743 ( 𝒀𝒋 − 𝒀

^=~gT ) (Greiff et al., 2012). X and Y are CDR3 degree vectors of repertoires at each 744

LD1–12. LOOCV was performed using the forecast R package 55. Cross-validation was used 745 because, in contrast to regular regression analysis, it enables the quantification of the predictive 746 performance of each regression model. 747 748 Simulated networks 749 Networks (nodes V=102–105) were simulated with the ER, exponential and power-law models 750 using base R 56 and igraph 54. Random networks were simulated according to the ER model, 751 exponential networks were simulated setting a probability of a connection between two nodes 752 p=0.5 and scale-free networks were simulated using the Barabási-Albert model (Barabási and 753 Albert, 1999). 754

755 Graphics 756 Graphic representations were produced using base R 56 and the ggplot2 R package 57 . 757 Heatmaps were produced using the NMF package 58. Networks and network clusters 758 visualization were performed using igraph 54 employing the Fruchterman–Reingold force-759 directed and Kamada–Kawai layout algorithms. Large-scale networks (Figure 1a) were 760 visualized using Gephi (version 0.9.1) 59; node size was scaled 10–100 proportional to the 761 degree of a node and a blue to grey color gradient was applied to nodes from high to low 762 degrees. 763 764 Statistical significance 765 Statistical significance was tested using the Wilcoxon rank-sum test. Results were considered 766 significant for p<0.05. 767 768 DATA AND SOFTWARE AVAILABILITY 769 770 Antibody repertoire sequencing data analyzed is available with ArrayExpress accession 771 number: E-MTAB-5349. Software is available at https://github.com/rokroskar/imnet. 772 773 Acknowledgments 774 775 We thank Manuel Kohler from the Scientific IT Services of ETH Zürich for technical support. We 776 thank Dr. Antonios Garas, Ulrike Menzel and Simon Friedensohn for scientific discussions, and 777 Dr. Laura Prochazka, Alexander Yermanos and Cédric Weber for critically reading the 778 manuscript. This work was funded by the Swiss National Science Foundation (Project no: 779 31003A_143869, 31003A_170110 to STR), SystemsX.ch – AntibodyX RTD project (to STR), 780 Swiss Vaccine Research Institute (to STR). The professorship of STR is made possible by the 781 generous endowment of the S. Leslie Misrock Foundation. 782

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 19: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

Figure 1

20

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 20: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

21

Figu

re 2

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 21: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

22

Figu

re 3

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 22: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

23

0

5

10

15

Number of CDR3 strings (millions)

Tim

e (m

inut

es) Number

of cores4080160320625

0.1 0.2 0.4 0.8 1.6

Full VDJ

sequences

Annotated

sequences

CDR3 a.a.

sequences

Network

G = (V,E)

Global

properties

Clonal

parameters

Cumulative

degree distribution

Pk =

P1k0=k pk0

Estimate x

min

x

min

6= x̂

min

Set x

min

min

Estimate parameters

one-sided

two-sided

p-values

p-value

Powerlaw

p-value � 0.1

Model discrimination

one-sided p-value ⇡ 0

two-sided p-value ⇡ 1

MiXCR

CDR3 length � 4 a.a.

CDR3 count > 1

b)

minimize KS

Power law

Exponential

Poisson

yes

no

Bootstrap

n=100

Compare

models

A

Network parameters of exemplary networkGlobal Local MeanNodes (V) 13 Transitivity 0.54Edges (E) 12 Authority 0.41Largest component (%)

77 PageRank 0.08Maximal k-core (V) 2 Eigenvector 0.32Clique (%) 23 Closeness 0.01Diameter 4 Betweenness 4.38Assortativity -0.42Average cluster size 6.5Average cluster # 2Average degree 2Clustering coefficient 0.14Density 0.15Centralization 0.34

B

0

BBB@

CDR3 repertoire AA AAB . . . CDR3

10

2n<10

6

AA 0 1 . . . LD

1n

AAB 1 0 . . . LD

2n.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CDR3

10

2n<10

6LDn1 LDn2 . . . LDnn

1

CCCA

| {z }Levenshtein distance (LD) matrix

CDR3 a.a.sequences

ID CDR3 string (s)0 CARGH1 CARAL2 CARGR3 CARAG

Indices = [3, 0, 2, 1]

Worker 1indices = [3, 0]

Worker 2indices = [2, 1]

0 1 2 30 0 21 12 23 0

0 1 2 30 2 11 0 22 03

0 1 2 30 0 2 1 21 0 2 12 0 23 0

NetworkG = (V,E)

Index

Randomize indices

Distribute

Ld(s1, s2) Ld(s1, s2)

Combine

Exemplary network

C

Figure S1SUPPLEMENTARY INFORMATION

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 23: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

24

15

2326

0

10

20

30

40

pBC nBC PC

Publ

ic c

lone

s (%

)

0.490.39

0.7

0.0

0.2

0.4

0.6

0.8

pBC nBC PCAver

age

clus

ter n

umbe

r

0 0

0.02

0.000

0.005

0.010

0.015

0.020

pBC nBC PCAv

erag

e cl

uste

r siz

eA

0.01

000.000

0.005

0.010

0.015

pBC nBC PC

Closeness

0.01 0

0.21

0.0

0.1

0.2

0.3

pBC nBC PC

Eigenvector

0.01

000.000

0.005

0.010

0.015

pBC nBC PC

Pagerank

C

1

5

3

0

2

4

6

pBC nBC PC

Aver

age

degr

ee

0.230.22

0.25

0.0

0.1

0.2

0.3

pBC nBC PCClu

ster

ing

coef

ficie

ntB

0 0

0.01

0

0.002

0.004

0.006

0.008

0.01

pBC nBC PC

Density

0 0

0.05

0.00

0.02

0.04

0.06

pBC nBC PC

Centralization

0 0

0.05

0.00

0.02

0.04

0.06

pBC nBC PC

Authority

175261571381

1101

103

105

pBC nBC PC

Betweenness

D

0.3

0.41

0.55

0.0

0.2

0.4

0.6

pBC nBC PCr (d

egre

e−fre

quen

cy) FE

Degree

Authority

Closeness

Betweenness

Pagerank

Plasma cell

Naïve B cell

Pre-B cell

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Public clonesHG

I

Degree

Authority

Closeness

Betweenness

Pagerank

Plasma cell

Naïve B cell

Pre-B cell

0.1

0.2

0.3

0.4

0.5

0.6

CDR3 frequency Pearson r Pearson r

Degree

Authority

Closeness

Betweenness

Pagerank

Plasma cell

Naïve B cell

Pre-B cell

0.05

0.1

0.15

0.2

0.25

0.3

0.35

V-gene frequency Pearson r

Pea

rson

2412

74

0

20

40

60

80

100

pBC nBC PC

Priva

te s

imila

r to

publ

ic (%

)P

rivat

e si

mila

r to

publ

ic (%

)

0 5 10 15 20

0.0

0.5

1.0

1.5

Unimmunized

Coreness

Density

0 5 10 15 20

0.0

0.5

1.0

1.5

NP−HEL

Coreness

Density

0 5 10 15 20

0.0

0.5

1.0

1.5

OVA

Coreness

Density

0 5 10 15 20

0.0

0.5

1.0

1.5

HbsAg

Coreness

Density

pBC

0 5 10 15 20

0.0

0.2

0.4

0.6

Unimmunized

Coreness

Density

0 5 10 15 20

0.0

0.2

0.4

0.6

NP−HEL

Coreness

Density

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

OVA

Coreness

Density

0 5 10 15 20

0.0

0.2

0.4

0.6

HbsAg

Coreness

Density

nBC

0 1 2 3

0.0

0.5

1.0

1.5

2.0

Unimmunized

Coreness

Density

0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

NP−HEL

Coreness

Density

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

01

23

4

OVA

Coreness

Density

−1 0 1 2 3 4

01

23

HbsAg

Coreness

Density

PC

K

J

Figure S2

certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint

Page 24: The fundamental principles of antibody repertoire ...63 biological insight of repertoire architecture. 64 To reveal the fundamental principles of antibody repertoire architecture,

25

A

B

Core CDR3

Core V-genepBC PC 3.53

0.010.010

1

2

3

4

pBC nBC PC

Larg

est

cliq

ue s

ize

(%)

C

PCnBCpBC

D

0.03 0.04 0.04

0.03 0.01

0.57

0

0.30.4

0 0.03

0.52

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

UntreatedNP−HEL

OVAHBsAg

Pre−B cell Naïve B cell Plasma cell

Powe

rlaw

p−va

lue

Exponential vs. Powerlaw Exponential vs. Poisson

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

10.0

0.5

1.0

0.0

0.5

1.0

UntreatedIm

munized

Pre−B cell Naïve B cell Pre−B cell Naïve B cell

p−va

lue

Two−sided One−sided

Degree

CD

F

Exponential

γ=1

●●

●●

●●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

γ=1.5

●●

●●

●●●

●●

●●

●●

●●●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

γ=2.2

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

γ=3.7

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

RandomPower law (scale-free) Exponential Random

E F

G

H I

4046

56

0

20

40

60

pBC nBC PC

Pred

ictio

n ac

cura

cy (%

)

3138

61

0

20

40

60

80

pBC nBC PC

Pred

ictio

n ac

cura

cy (%

)

Figure S3certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

The copyright holder for this preprint (which was notthis version posted April 5, 2017. ; https://doi.org/10.1101/124578doi: bioRxiv preprint