integrating biologic and clinical data towards resolving ... · integrating biologic and clinical...

Integrating Biologic and Clinical Data towards Resolving Heterogeneity in Childhood Inflammatory Diseases

by

Andrey Mikhaylov

A thesis submitted in conformity with the requirements for the degree of Master of Science

Department of Immunology University of Toronto

© Copyright by Andrey Mikhaylov 2016

ii

Integrating Biologic and Clinical Data towards Resolving

Heterogeneity in Childhood Inflammatory Diseases

Andrey Mikhaylov

Master of Science

Department of Immunology

University of Toronto

2016

Abstract

Kawasaki disease (KD) is the leading cause of acquired heart disease in children from the

developed world, with up to 25% risk of developing aneurysms if untreated. Diagnosis uses a set

of classical clinical symptoms, which fail to capture the heterogeneity in KD. One solution is to

incorporate new biomarkers and expanded biologic datasets to generate new predictive models

that can better discern homogeneous groups of patients. Using Similarity Network Fusion (SNF),

a novel computational technique, we uncovered 3 robust clusters of patients after fusing gene

expression and clinical datasets for 171 KD patients. The first cluster is older females with

marked activation of the innate immune response, second cluster is patients with prolonged fever

and markers of activation of the adaptive response, while cluster 3 is males with no

lymphadenopathy in a less severe innate immune response. SNF identified clinically meaningful

clusters of patients and is a promising new tool for future KD studies.

iii

Acknowledgments

I would like to express my deepest gratitude to my supervisor Dr. Rae Yeung for all the guidance

and support during my time in the lab. I would also like to acknowledge and thank my committee

members Dr. Pamela Ohashi, Dr. Anna Goldenberg, and Dr. Shannon Dunn, for providing

invaluable feedback for my thesis project. A huge thanks to Dr. Trang Duong for having patience

with me and helping me at every step of this journey. Lastly, I am very grateful for meeting

every single member of the Yeung lab – thank you for all the help and the fun times!

iv

Table of Contents

Acknowledgments.................................................................................................................... iii

Table of Contents ..................................................................................................................... iv

List of Tables .......................................................................................................................... vii

List of Figures ........................................................................................................................ viii

List of Abbreviations .................................................................................................................x

1. Introduction ............................................................................................................................1

1.1 Kawasaki Disease - overview and epidemiology ..........................................................1

1.1.1 Overview ............................................................................................................1

1.1.2 Incidence rates ...................................................................................................1

1.1.3 Seasonal outbreaks .............................................................................................2

1.2 Kawasaki Disease - Diagnosis and Treatment ...............................................................2

1.2.1 Clinical symptoms .............................................................................................2

1.2.2 Laboratory tests ..................................................................................................4

1.2.3 Extra-cardiac findings ........................................................................................5

1.2.4 Cardiac findings .................................................................................................5

1.2.5 KD Treatment ....................................................................................................6

1.2.6 AHA Diagnostic criteria sensitivity and specificity ..........................................6

1.2.7 Risk scoring systems ..........................................................................................7

1.3 Etiology ..........................................................................................................................9

1.3.1 Immune response ...............................................................................................9

1.3.2 Environmental triggers.......................................................................................9

1.4 Translational studies ....................................................................................................10

1.4.1 Linkage analysis...............................................................................................10

1.4.2 Genome-wide association studies (GWAS) .....................................................10

v

1.4.3 Gene expression ...............................................................................................11

1.5 Post-translational studies in children ...........................................................................12

1.5.1 Candidate gene approach .................................................................................12

1.5.2 ITPKC and CASP3 ..........................................................................................12

1.5.3 FCGR2A ..........................................................................................................12

1.5.4 MHCII, CD40, and BLK .................................................................................13

1.5.5 Summary of findings........................................................................................13

1.6 Computational Analysis ...............................................................................................14

1.6.1 Introduction ......................................................................................................14

1.6.2 Data aggregation ..............................................................................................14

1.6.3 Approach to computational analysis ................................................................15

1.6.4 Similarity network fusion ................................................................................15

1.6.5 Gene enrichment analysis ................................................................................16

1.6.6 Feature selection and classifiers.......................................................................18

1.6.7 Heterogeneity in KD ........................................................................................19

1.6.8 Rationale ..........................................................................................................20

1.6.9 Hypothesis and objectives................................................................................22

2 Methods ...............................................................................................................................23

2.1 KD Cohort ....................................................................................................................23

2.2 Gene expression microarray ........................................................................................23

2.3 Datasets ........................................................................................................................24

2.4 Computational analysis workflow ...............................................................................24

2.5 Data pre-processing .....................................................................................................25

2.6 Similarity network fusion ............................................................................................25

2.7 Gene enrichment analysis ............................................................................................26

vi

2.8 Co-clustering probability .............................................................................................26

2.9 Statistical analysis ........................................................................................................27

2.10 FeaLect feature selection ............................................................................................27

3 Results .................................................................................................................................29

3.1 KD cohort and data pre-processing ..............................................................................29

3.2 Three unique clusters were identified after aggregation of clinical and gene

expression datasets with SNF ......................................................................................31

3.3 High robustness and low clinical feature sensitivity amongst the 3 clusters ...............34

3.4 Unique clinical profiles characterize the 3 clusters .....................................................36

3.5 Unique gene expression profiles characterize the 3 clusters .......................................41

3.6 Variation in treatment response and coronary outcome across the 3 clusters .............51

3.7 Unique clinical and biological classifiers for predicting cluster assignment ...............52

4 Discussion ...........................................................................................................................57

5 Study Limitations ................................................................................................................68

6 Conclusions .........................................................................................................................69

7 References ...........................................................................................................................70

vii

List of Tables

Table 1. Laboratory measures and clinical characteristics for the KD cohort. ............................. 30

Table 2. Biologic and clinical datasets used for SNF analysis. .................................................... 31

Table 3. List of group 1 significant GO terms from the DAVID gene enrichment analysis. ....... 46

Table 4. List of group 2 significant GO terms from the DAVID gene enrichment analysis. ....... 48

Table 5. Description of FeaLect classifiers for predicting cluster assignment. ............................ 55

viii

List of Figures

Figure 1. Similarity Network Fusion Algorithm........................................................................... 16

Figure 2. Illustration of Gene Ontology hierarchy. ....................................................................... 18

Figure 3. Steps involved in the computational analysis of the KD cohort. .................................. 25

Figure 4. Three distinct clusters of patients recovered using SNF. .............................................. 33

Figure 5. SNF displays high robustness in identifying the three clusters in response to removal of

patients. ......................................................................................................................................... 34

Figure 6. The 3 clusters are most sensitive to removal of ‘Proportion Male’ and

‘Lymphadenopathy’ clinical variables. ......................................................................................... 35

Figure 7. Unique clinical and demographic profiles characterize the three clusters. ................... 37

Figure 8. Ethnic group profiles are similar across the 3 clusters .................................................. 38

Figure 9. Unique laboratory test profiles characterize the three clusters. ..................................... 40

Figure 10. The 3 clusters are characterized by unique gene expression profiles. ......................... 43

Figure 11. Group 1 genes are representative of inflammation and immune response related GO

terms. ............................................................................................................................................. 45

Figure 12. Group 2 genes are representative of metabolism related GO terms. ........................... 47

Figure 13. The 3 clusters vary with respect to treatment response and disease outcome. ............ 52

Figure 14. FeaLect total feature scores. ........................................................................................ 54

ix

Figure 15. Informative clinical and biologic features identified with FeaLect. ............................ 54

Figure 16. Relative gene expression profiles of biologic variables for each set of features

extracted with FeaLect. ................................................................................................................. 56

x

List of Abbreviations

AHA American Heart Association

ALT Alanine Transaminase

AST Aspartate Aminotransferase

BLK B Lymphocyte Kinase

CAA Coronary Artery Aneurysm

CAL Coronary Artery Lesions

CASP3 Caspase 3

CD40L CD40 Ligand

cDNA Complementary DNA

CRP C-Reactive Protein

DAVID Database for Annotation, Visualization and Integrated Discovery

DC Dendritic Cell

DC-SIGN Dendritic Cell-Specific Intercellular adhesion molecule-3-Grabbing Non-

integrin

EBV Epstein-Barr Virus

ESR Erythrocyte Sedimentation Rate

FAM167A Family with Sequence Similarity 167, Member A

FCGR1C Fc Fragment Of IgG, High Affinity Ic, Receptor

FCGR2A Low affinity immunoglobulin gamma Fc region receptor II-a

FcγR Fc-Gamma Receptor G-CSF Granulocyte-Colony Stimulating Factor

GGT Gammaglutamyl Transpeptidase

GO Gene Ontology

GPD1L Glycerol-3-Phosphate Dehydrogenase 1-Like

GSEA Gene Set Enrichment Analysis

GWAS Genome-Wide Association Study

HLA Human Leukocyte Antigen

HLA-DMA Major Histocompatibility Complex, Class II, DM Alpha

HLA-DMB Major Histocompatibility Complex, Class II, DM Beta

HLA-DOB Major Histocompatibility Complex, Class II, DO Beta

HLA-DQB2 Major Histocompatibility Complex, Class II, DQ Beta 2

IgG Immunoglobulin G

xi

IL-17 Interleukin-17

IL-18R1 Interleukin-18 receptor 1

IL1-R2 Interleukin-1 receptor 2

IL-1RAP Interleukin 1 Receptor Accessory Protein

IL-1β Interleukin 1 Beta

IL-6 Interleukin 6

IP3 Inositol trisphosphate

ITPKC Inositol-1,4,5 Trisphosphate 3-Kinase C

IVIG Intravenous Immunoglobulin

KD Kawasaki Disease

LAD Left Anterior Descending artery

LASSO Least Absolute Shrinkage and Selection Operatory

LMCA Left Main Coronary Artery

LSA Least Square Adaptive

MHCII Major Histocompatibility Complex II

MMP Matrix Metalloproteinase

MMP-9 Matrix Metalloproteinase-9

MRPL2 Mitochondrial Ribosomal Protein L2

PCA Principle Component Analysis

POLR2G Polymerase (RNA) II (DNA directed) polypeptide G

RCA Right Coronary Artery

S100A12 S100 Calcium Binding Protein A12



SCN5A Sodium Channel, Voltage Gated, Type V Alpha Subunit

SNF Similarity Network Fusion

SNP Single Nucleotide Polymorphism

Th17 T Helper 17 Cell

TNF-α Tumour Necrosis Factor Alpha

VARS Valyl-TRNA Synthetase

VEGF Vascular Endothelial Growth Factor

WBC White Blood Cell

1

1. Introduction

1.1 Kawasaki Disease - overview and epidemiology

1.1.1 Overview

Kawasaki Disease (KD) is an acute systemic vasculitis that predominantly occurs in children

under the age of 5 (1) with up to 25% of children developing aneurysms if left untreated (2). KD

is most common in Asian populations, with highest incidence rate found in Japan, and shows

evidence of seasonal association with disease occurrence (3). KD exhibits a great deal of

heterogeneity, with the current set of clinical signs and symptoms used for diagnosis not able to

distinguish homogenous groups of patients with respect to treatment response or disease outcome

(4-6). The overlap of KD with other infectious diseases necessitates better diagnostic and

prognostic tools (7).

1.1.2 Incidence rates

KD incidence has a gender bias, with a ratio of 1.5 to 1.7:1 (male to female) and most of the

affected children (76%) are under the age of 5 (8, 9). Incidence of KD around the world appears

to differ greatly. Highest rates of incidence are in Japan, with 239.6/100,000 children <5 years

old (2009 nationwide epidemiologic survey of KD) (3). The second highest incidence of KD in

the world is Korea, with 134.4/100,000 children <5 years of age (based on nationwide survey in

2011) (10), followed by Taiwan and Shanghai, China, with 66.24 (in 2006) and 30.3- 71.9

(ranging from 2008 to 2012) per 100,000 incidence rates respectively (11, 12). Incidence rates in

North America, though lower than in Asian countries listed above, differ depending on ancestry.

Children of Asian ethnicity had the highest incidence of 30.3/100,000 kids <5 years old, from

1997 to 2007 (13). Kids of African American and Hispanic ethnicities had the next highest

2

incidence of 17.5/100,000 and 15.7/100,000 respectively, while kids of Caucasian origin had the

lowest rate amongst all the racial groups surveyed, with incidence of 12/100,000 (13). As further

evidence supporting the genetic model of KD, disease incidence in children of Japanese descent

living outside of Japan had rates of 197.7/100,000 in a retrospective analysis of KD patients in

Hawaii from 1996 to 2001 (compared to 35.3/100,000 incidence in Caucasian children) (14).

Furthermore, according to several Japanese studies, a 10 fold relative risk (2.1%) was

documented in siblings of KD patients, increasing to 13% if the siblings were twins (15, 16).

1.1.3 Seasonal outbreaks

Serving as supporting evidence for an infectious trigger of the disease, KD happens to exhibit

seasonal patterns of occurrence. Japan, South Korea, and China have 2 peaks of disease

incidence in winter and summer (3, 17). Shanghai, China, has peaks in spring and summer, while

Taiwan, with the 3rd highest incidence of KD, tends to peak in summer months (18, 19). Winter

and spring seasonal peaks have also been observed in US (3). Furthermore, epidemics of KD

were previously recorded in Japan in the years of 1979, 1982, and 1986 (3). These reports,

together with incidence rates, collectively demonstrate an endemic and epidemic nature of

Kawasaki disease.

1.2 Kawasaki Disease - Diagnosis and Treatment

1.2.1 Clinical symptoms

KD is diagnosed in North America using a set of non-specific symptoms, which include

prolonged fever and at least 4 of the 5 following principal clinical symptoms: bilateral

conjunctival injection, oral mucosal inflammation, polymorphous skin rash, extremity changes,

and/or cervical lymphadenopathy (7). In contrast, principle clinical findings in Japan include

3

presence of fever in the list of principal clinical symptoms, and as such, diagnosis is made based

on 5 of 6 criteria (20). The typical course of the disease can be divided into 3 stages: acute,

subacute, and convalescent (21). Acute stage is characterized by fever and presence of the

classical KD symptoms (potentially lasting up to 1 to 2 weeks if not treated), followed by

subacute stage where fever and clinical symptoms subside, but biochemical evidence of

inflammation persists (eg. ESR and CRP) (21). Convalescent stage, when the signs of illness

have disappeared and inflammatory markers have also subsided, typically starts at 4 to 6 weeks

after disease onset (21).

Fever is present in all KD patients at the onset of the disease and as such is a key diagnostic

criteria in North America. If untreated, it can last on average for 11 days, but due to its high

variability in duration, may extend up to 3 or 4 weeks longer in some patients (7). Conjunctivitis

is bilateral with a frequency of 80 to 90% in KD patients (22). It may involve various parts of the

conjuctivate, but most of the time it is peri-limbic sparing (7). It is typically painless and goes

away fairly quickly (7). Frequency of oral mucosal inflammation in KD patients is also around

80 to 90% (22). The symptoms manifest themselves as swelling and cracking of the lips, a

strawberry tongue, and diffuse erythema of the oropharynx (22). Polymorphous skin rash occurs

in more than 90% of patient and manifests itself within 5 days of the onset of fever (7, 22).

Beginning at the trunk, it can take many forms such as maculopapular eruption, scarlatiniform

rash, or an erythoderma, and can be found in limited distributions or generalized all over the

body (7). The rash may also progress to desquamation in the perineal region and in extremities

(7). Changes in the extremities occur in 80% of patients (22). First signs, including erythema and

induration of hands and feet, occur for a short duration (1 to 3 days) within the acute phase of the

disease, while desquamation typically occurs 2 or 3 weeks after onset (7, 22). Cervical

4

lymphadenopathy is mostly unilateral, with varying number and size of affected nodes (23). It is

the rarest of the clinical symptoms with an occurrence rate of around 50% (under-detection

during palpation may contribute to the low reported frequency) (22).

1.2.2 Laboratory tests

Since not all patients present with the full set of diagnostic symptoms, and there is overlap with

other febrile diseases due to the non-specificity of the diagnostic criteria, a number of laboratory

measures are used in aiding KD diagnosis (7). About half of KD patients experience increase in

white blood cell (WBC) counts (levels > 15 000/mm3) due to the inflammation in these patients

(7). As fever is one of the key diagnostic criteria of KD patients, there is a marked increase in

inflammation markers lasting for 6 to 10 weeks – C-reactive protein (CRP) and Erythrocyte

Sedimentation Rate (ESR) (7). Both markers are used because there may sometime be

discrepancies due to difference in kinetics between the two methods and the potential

confounding effects of IVIG treatment on sedimentation rate of erythrocytes, thus affecting the

ESR measure (7, 24). Increase in platelet count (thrombocytosis) is typically delayed, usually

starting after 2 weeks of disease onset, often reaching 1 000 000/mm3 counts over time in the

subacute phase (7). Forty percent of patients exhibit increased levels of serum transaminases,

such as Gammaglutamyl Transpeptidase (GGT) and Alanine Transaminase (ALT) (25). Lower

levels of albumin, reflecting inflammation, are also common (7). Last but not least, anemia may

develop in 50% of the patients (often correlated with prolonged fever duration), with lower

hemoglobin counts (7, 22).

5

1.2.3 Extra-cardiac findings

In addition to the classical KD diagnostic criteria and the array of laboratory measures, KD

patients often present with a number of other clinical findings. Over 7% of children develop

arthritis, involving multiple joints, at diagnosis (7, 26). Vomiting, diarrhea, abdominal pain, and

other gastrointestinal complaints are common (in 1/3 of patients) (7). Hepatic abnormalities, such

as liver enlargement, jaundice, and hydrops of gallbladder (15% of patients), can present in

patients as well (7, 27). Other common clinical manifestations may include aseptic meningitis,

colitis, urethritis and anterior uveitis (26).

1.2.4 Cardiac findings

Coronary aneurysms, which may develop within or 2 weeks after acute phase, are the hallmark

of the disease and pose a great risk to a patient as they can clot or rupture, possibly leading to

myocardial infarction or death (22). As a result, having coronary abnormalities in the absence of

the rest of the classical KD symptoms is sufficient to diagnose a patient with KD (7). To account

for the variations in coronary artery measurements due to body size, coronary dimensions are

reported using z-scores (measures made only for Left Main Coronary Artery (LMCA), Left

Anterior Descending (LAD) artery, and Right Coronary Artery (RCA), where aneurysms are

most often associated with fatalities)(7). The measurements are adjusted for body surface area,

with a z-score greater than 2.5 statistically indicating an abnormality compared to the general

population (7, 28, 29). Aside from the main risk of coronary aneurysms, patients may also

develop a myriad of other cardiac related conditions. Examples include myocarditis and

arrhythmias, both of which may also present themselves in the acute stages of the disease (7).

6

1.2.5 KD Treatment

Intravenous Immunoglobulin (IVIG) is administered at 2 g/kg, together with high-dose aspirin,

and is the main form of treatment in KD (7). Previous reports have attributed the effects of IVIG

in KD to reduction of key pro-inflammatory cytokines (IL-1β, IL-6, TNF-α), G-CSF, CRP, and

CD40L amongst other targets (30, 31). IVIG effect on Treg expansion, Th17 downregulation,

and ability to neutralize superantigens has also been implicated in some studies (30, 31). The

exact mechanism of action of IVIG in KD is still unknown. Due to its effectiveness in many

other diseases other than KD, numerous modes of action have been proposed, including

complement binding, binding to DC-SIGN on DCs, FcγR interaction, and

neutralization/inhibition of soluble proteins or pathogens, to name a few (31). Unfortunately, up

to 20% of patients may not respond to IVIG treatment (32). Just as little is known about IVIG

mechanism of action in KD, reasons for unresponsiveness are also poorly understood. These

patients undergo retreatment with IVIG, along with corticosteroids in order to reduce the

prolonged fever duration and other symptoms (7). High-dose aspirin (80-100 mg/kg/day),

efficacy of which is under debate (33), is part of the standard treatment for KD in North America

(during the acute stage of the disease) due to its anti-inflammatory effects (7). Switch to low-

dose aspirin regimen (3-5 mg/kg/day), once patient’s fever subsides, is used for its anti-platelet

properties for 6-8 weeks after disease onset (7).

1.2.6 AHA Diagnostic criteria sensitivity and specificity

Due to the lack of specificity in the AHA set of diagnostic symptoms, KD may often be

misdiagnosed with a number of other conditions. Examples cover a wide range of febrile

illnesses including bacterial infections (eg. scarlet fever), viral infections (eg. Measles and EBV),

7

and drug reactions (eg. Steven Johnson syndrome) (34). The classical guidelines, though

effective, do not exhibit high specificity and sensitivity for diagnosing the disease (22)

Patients who do not meet the full AHA diagnostic criteria for KD have been previously linked to

significantly increased chance of CAA development, both in a single center study (20% of 127

patients vs 7% with full diagnosis) (6) and a nationwide survey in Japan (7.4% vs 2.5%) (35).

Such differences in disease outcome have been attributed to delay in IVIG treatment due to lack

of sensitivity of the full AHA KD diagnosis criteria to identify KD patients in time (6).

Furthermore, the AHA criteria are not particularly good at discerning incomplete Kawasaki

disease, defined as febrile patients with less than 4 of the principle clinical features (7).

Incomplete KD is especially more common in infants < 1 year old, with a previous study

reporting a 45% incidence rate amongst KD patients compared to 12% for older KD patients (4).

Additional laboratory tests and presentation of other clinical findings may assist in diagnosing

incomplete KD, but as mentioned earlier, timing is of the essence in order to prevent aneurysm

formation.

1.2.7 Risk scoring systems

Over the years, a number of clinical scores have been devised in order to better identify KD

patients and predict disease outcome and treatment response. Early attempts were made by Asai

in 1983 (36), Nakano et al in 1986 (37), and Iwasa et al in 1987 (38). The former scoring system

did not utilize 2-D echocardiography, which is routinely used in diagnosis today, while the latter

two lacked statistical power (7). The Harada score was developed in Japan in 1991 in order to

determine whether a KD patient required IVIG treatment, since unlike the current North

American treatment standard, not all KD patients received IVIG in Japan at that time, so a risk

8

stratification strategy and rational allocation of IVIG was needed (39). According to the score, in

order to qualify for IVIG treatment, patients had to fulfill 4 criteria within 9 days of disease onset

(39). The total set of criteria included elevated white blood cell count (> 12 000/mm3), increased

CRP levels (>3), platelet counts less than 350 000/mm3, albumin levels less than 3.5g/dL, male

sex, and age less than 12 months old (39).

Since IVIG treatment is recommended for all KD patients in North America, Beiser et al

developed a score in the hopes of predicting coronary outcome instead (40). The laboratory

measures used in the score consisted of baseline neutrophil and band counts, hemoglobin and

platelet levels, as well as body temperature on the day following IVIG administration (40).

However, though it was able to identify low-risk patients, the system did not perform well with

diagnosing individuals with a high risk of aneurysm formation (40)

One of the latest and currently best performing risk scoring systems was developed by

Kobayashi et al in 2006 (41). While developing the score, Kobayashi et al have identified high

serum AST and low sodium concentrations as strongly predictive of IVIG unresponsiveness

(41). Points in Kobayashi score are accumulated based on decrease in sodium concentration,

illness of 4 days or less at diagnosis , increased AST concentration, increase in neutrophil

percentage amongst white blood cells, baseline platelet count, increased CRP levels, and age less

than 12 months (41). This system showed high specificity and sensitivity in predicting IVIG

unresponsive patients in a Japanese cohort, but did not perform well in North American

population (42). As a result, none of the existing risk scoring systems in KD can generalize to all

ethnic groups around the world, and as such, there is currently no dependent method to predict

IVIG response and coronary outcome in North American patients.

9

1.3 Etiology

1.3.1 Immune response

KD’s etiology is still unknown, with evidence for both genetic and environmental factors. On the

one hand, linkage and GWAS studies appear to propose a genetic nature, but seasonal

occurrences and community outbreaks also suggest an infectious cause of the disease (7). Taken

together, a popular model of the disease suggests that KD occurs in genetically predisposed

children via some common environmental trigger (7).

Early stages of Kawasaki disease begin as systemic inflammation, orchestrated by dissociation of

smooth muscle cells within the media of blood vessels throughout the body and an influx of

neutrophils within 7 to 9 days after onset of disease (7). TNF-α, the key inflammatory cytokine

in KD pathology, is increased to promote inflammation and recruitment of immune cells (43).

The innate immune response quickly transitions to proliferation of large mononuclear cells

(CD8+ T and IgA plasma cells), leading to destruction of internal elastic lamina (44, 45).

Destruction of elastin is mainly driven by Matrix Metalloproteinases (MMPs), most notable

MMP9, which was previously shown to be necessary for aneurysm formation in a KD mouse

model (46). MMP-9, an elastolytic enzyme, is produced by coronary smooth muscle cells in

response to TNF- (46).

1.3.2 Environmental triggers

Although a specific infectious agent has not been identified, there are many studies in literature

linking the cause of KD to viruses and bacteria, including EBV (47), rotavirus (48), parvovirus

B19 (49), adenovirus (50), and chlamydia pneumonia (51), amongst others. A lot of the support

for the infectious model of KD also comes from animal models, the most known being the

10

Lactobacillus Casei Cell Wall Extract model developed by Lehman et al (52) in 1985. The

induced disease model shares a lot with the human counterpart in terms of disease kinetics,

histologic changes, Vβ skewing (Vb2, 4 and 6 in mice; Vb2 and 8 in humans), and response to

IVIG treatment (7, 53).

1.4 Translational studies

1.4.1 Linkage analysis

Before the popularization of large scale techniques like GWAS, early steps in finding disease

susceptibility at the genetic level relied on locating the chromosomal regions via linkage analysis

in families of patients (54). The idea is based on the principle that traits encoded by genes are

often inherited together due to genes being close to each other on a chromosome (55). By

analyzing patterns of inheritance in families of affected individuals, microsatellite genetic

markers across the genome can be sequenced and used to identify chromosomal loci linked to

disease (55). Further linkage mapping studies with SNPs restricted to these regions of interest,

can help identify specific genes that may be involved (54).

1.4.2 Genome-wide association studies (GWAS)

Advancements in technology now allow us to detect hundreds of thousands to millions of SNPs

at a genome-wide scale and gives us the power to genotype individuals and detect genetic

variation between them. Genome-wide association study (GWAS) is based on applying this

concept to a group of affected individuals with the purpose of finding common genetic variants

that may be risk factors for a given disease (56). An underlying principle for GWAS is the

common disease/common variant hypothesis which states that complex genetic diseases can be

caused by small additive effects of many common variants that exist in the general population,

11

instead of a single risk allele that can be found in Mendelian diseases (57). Both Illumina (San

Diego, CA) and Affymetrix (Santa Clara, CA), which is another popular platform for microarray

technologies, provide products that are popular choices for GWAS (56). Illumina uses

BeadArray technology (described in the next section) to identify SNPs, while Affymetrix

products print oligonucleotides sequences on a chip (instead of beads), that can specifically

recognize the SNPs via hybridization (56).

1.4.3 Gene expression

Microarray technologies have come a long way, now allowing scientists to quantify gene

expression on a genome-wide scale. Gene expression studies can now be performed on small

number of individuals by profiling mRNA abundance (58). An example of such technology is

BeadArray, developed by Illumina (San Diego, CA) and used in their gene expression array

products (eg. Illumina HT-12v4 used in our study) (58). It allows for such a large scale of gene

expression profiling by randomly assembling arrays of beads across the wells of a microplate,

with a bead identifier sequence and a gene-specific probe attached to each bead (58). Since each

bead type has hundreds of thousands of the same oligonucleotide probe sequences tethered to it,

hybridization with a target cDNA sequence can then be measured and quantified via

fluorescence intensity (58). Similar gene expression microarrays are manufactured by

Affymetrix (Santa Clara, CA), including their Human Genome U133 (U133), Human Exon

(HuEx), and Human Gene (HuGene) product series (59).

12

1.5 Post-translational studies in children

1.5.1 Candidate gene approach

Early attempts to link KD to genetic susceptibility were candidate-gene studies, where genes are

selected for association analysis based on prior knowledge of their function and how it may

relate to the disease (54). As such, several research groups have looked at associations between

KD susceptibility and/or CAL formation and polymorphisms in HLA (60), TNF-α (61), IL-4

(62), VEGF (63), MMPs (64), CRP (65), and other genes. However, this approach to identifying

associations at the genetic level suffered from conflicting results between studies, lack of

validation, and small patient numbers in these cohorts (54)

1.5.2 ITPKC and CASP3

The first large-scale attempt to find genetic susceptibility in KD was the 2007 genome-wide

linkage analysis of 78 sibling pairs from Japan (66). The study did not pinpoint any exact genes

that may be involved, but it did identify possible linkage in the 12q24, along with 9 other

chromosomal regions that may relate to KD susceptibility (66). Using these regions as a

roadmap, further studies have identified polymorphisms in ITPKC and CASP3 genes as

associated with KD susceptibility and CAL formation (67, 68). Both hits were relevant to KD as

ITPKC is a kinase of IP3, acting as a negative regulator of the downstream Ca2+ pathway during

T cell activation, while caspase-3 is an enzyme involved in apoptosis and, as a result, in duration

of T cell immune response (68).

1.5.3 FCGR2A

Results of a GWAS study in a European population (405 KD patients, 6,252 controls) were

published in 2011, identifying 2 loci at genome-wide significance (69). The first one was related

13

to the FCGR2A gene, which encodes the IgG receptor, and second one was in the 19q13 region,

with a SNP in ITPKC gene (69). Polymorphism affecting IgG binding by Fc receptors may

potentially have an effect on IVIG response and as such, may be a contributing factor to

explaining IVIG unresponsiveness in KD patients (69)

1.5.4 MHCII, CD40, and BLK

In 2012, the same group conducted another GWAS study to find more susceptibility genes in

addition to CASP3 and ITPKC (70). This time, a Japanese cohort of 428 subjects (3,379

controls), validated on 754 cases (947 controls), identified additional significant associations in

the FAM167A-BLK, HLA-DQB2 – HLA-DOB, and CD40 regions (with FCGR2A association,

from the previous study, replicated as well) (70). BLK is a known kinase associated with B- cell

receptor signaling and recently implicated in development of IL-17 producing cells (70). HLA

genes as a GWAS hit also appear to be relevant to KD due to the known ethnic differences in

KD susceptibility that were talked about in detail earlier (54). At the same time as the Japanese

group, a GWAS study was performed in Taiwan with 622 KD patients (1,107 controls) (71). The

results have independently identified loci in BLK and CD40 genes at genome-wide significance

(71).

1.5.5 Summary of findings

The genes identified in previous GWAS studies, while relatively dispersed in terms of function,

are all related to immune function. Associations with antigen-presentation (MHCII and CD40), T

cell response (ITPKC and CASP3), and IgG binding (FCGR2A), are all related to the

components of the immune system that we believe to be involved in KD.

14

1.6 Computational Analysis

1.6.1 Introduction

The risk scoring systems detailed earlier, used clinical information as their input in generating

their models. However, they failed to come up with a formula that can identify homogenous

groups of patients, let alone predict coronary outcome and/or IVIG unresponsiveness effectively.

Considering the complex nature of KD etiology, which is still poorly understood, it is not

surprising that using only clinical variables may not help us solve the problem of classifying

patients. Given that the phenotypes we see are driven by the underlying genetic patterns, the next

step would be to incorporate the large amount of biological data into creating models for

identifying homogenous groups of patients. While this may not have been possible decades ago,

high throughput gene expression tools and corresponding computational power to analyze this

big data, have now caught up and can be used to further our understanding of complex diseases

such as KD.

1.6.2 Data aggregation

Though we may now have access to the resources for handling these big datasets, making

meaningful inference from analyzing multiple layers of data is far from trivial. Aggregation of

multiple datasets is a challenging task due to the heterogeneity found between datasets, both in

terms of size and types of variables (continuous, categorical, or binary). Some common

approaches to analyzing multiple datasets have trouble overcoming these obstacles. Appending

the datasets together risks diluting potentially important variables when merging with such big

data as gene expression. Similarly, analyzing each dataset separately (prior to integration) makes

it hard to combine the individual patterns afterwards (72). In all cases, extracting information

15

that is representative of all the datasets equally, and incorporating it into a single model that

describes the patients, is difficult.

1.6.3 Approach to computational analysis

Our approach to analyzing our KD cohort is to incorporate the vast gene expression data from

patients with their corresponding clinical information and laboratory tests. As discussed above,

integration and subsequent analysis of such different datasets poses a lot of problems using

conventional techniques. To that end, we are using a novel computational technique called

Similarity Network Fusion, described in more detail below, that helps to identify homogenous

groups of patients while using multiple datasets as input (72). Comparing features between the

retrieved clusters helps to lay out the differences that exist between the groups of patients. While

looking at the clinical patterns is comprehensible due to a lower number of clinical variables,

making sense of emerging patterns amongst thousands of genes from biological datasets is more

complicated. For that reason, gene enrichment analysis can better summarize the results. Last,

but not least, to further narrow down important features, supervised learning methods are used

for feature selection to identify the variables that may potentially be used as classifiers for

assigning new patients to the newly discovered clusters.

1.6.4 Similarity network fusion

Similarity Network Fusion (SNF), a novel multiple data integration tool that was already

successfully used in discerning cancer subtypes and predicting patient survival, is able to

effectively address the problems associated with aggregation of heterogeneous data (72). As

conceptually demonstrated in Figure 1, it accomplishes this by independently expressing each

dataset as a network of patients, with the edges connecting each patient representing pairwise

16

similarity across all the features in each dataset. SNF is based on iterative fusion of such patient

similarity networks into a single shared network that integrates the patterns from all the data

layers (72). SNF tackles the challenges outlined above as it is effective with datasets of both

small and large datasets, can integrate datasets with varying number of features, and is robust to

noise (72).

Figure 1. Similarity Network Fusion Algorithm.

(Adapted from Wang et al. (72)) (A) Graphical representation of 2 datasets (biologic and clinical) describing a

cohort of patients. (B) Each dataset is independently converted to a patient-by-patient similarity matrix, which can

be visualized as a network, where each patient is represented as a node and pairwise similarities are depicted by an

edge/line. (C) The SNF algorithm iteratively updates each network using information derived from networks of the

other datasets. Every update cycle makes each network more similar to each other, until they all converge to the

final fused network (D). The edges have been color-coded to represent the source of information for each network.

1.6.5 Gene enrichment analysis

Working with large datasets, such as gene expression microarrays, allows for high-throughput

analysis that can greatly benefit a study by returning long lists of gene results. However, this in

17

itself becomes a problem because summarizing the findings from such big lists is difficult. To

address this challenge, many functional gene enrichment analysis tools have been created, such

as GSEA (73), GOstat (74), and Onto-express (75), among many others. The way they generate

results differs between each one, but the principle remains the same – map the input list of genes

to biological databases, sorting the results by annotations that are statistically the most enriched

with these genes (76). One such tool is the Database for Annotation, Visualization and Integrated

Discovery (DAVID) online computational tool created in 2003 (76). It compares the enrichment

of genes in a particular annotation database against a population background for a given species

(77). Amongst the many biological annotation databases that DAVID uses, one widely known

repository was formed by the Gene Ontology (GO) Consortium for the purpose of creating a

dynamic vocabulary of gene roles and products in any organism (77). The GO term, which is a

well-defined description of the gene relationships it contains, is in turn linked to many other

databases, such as SwissPROT, EMBL, etc, allowing the system to be dynamic and up to date

with rapidly changing biological knowledge (77). To better associate the GO terms together by

function, GO terms are further split into 3 categories: biological process, molecular function, and

cellular component (77). As illustrated in Figure 2, within each category, the GO terms form a

directed acyclic graph, where each term is a node in a network connected to each other with pre-

defined parent and children relationships (77). Each GO term may have more than one parent

node and the genes associated with each term are not exclusive.

18

Figure 2. Illustration of Gene Ontology hierarchy.

(Adapted from Ashburner et al. (77)). The illustration is for illustrative purposes – it has been simplified and may

also not be accurate due to the constantly changing nature of GO terms. The subset of terms shown belong to

biological processes category and depict the kind of connections and hierarchal structure that GO terms exhibit

(examples of genes listed are from the Saccharomyces Cerevisiae species). Furthermore, each node (GO term) can

have multiple parents (eg. “DNA ligation” is a sub-node of both “DNA recombination” and “DNA repair”) and

genes represented by these nodes are not exclusive (eg. CDC9 is part of both “DNA recombination” and “DNA

ligation”)

1.6.6 Feature selection and classifiers

A typical dataset is a collection of entries (such as patients), each with an array of values that

correspond to a set of features in that dataset (eg. age, gender, presence of symptoms, etc). The

values can be either continuous variables, categorical, or binary. If the learning done on a dataset

by a machine learning algorithm considers only the features, without any labels for each entry

(example of a label is disease outcome for each patient), then it falls under a branch of machine

learning called Unsupervised Learning, popular examples of which include Principle Component

Analysis (PCA) and K-means clustering (78). SNF falls into this category because its purpose is

to aggregate multiple datasets into a single patient network, without taking into consideration

patient outcome during the learning process (eg. Coronary outcome or IVIG responsiveness in

19

the case of KD) (72). As a result, even if the resulting patient network may identify clusters of

patients that correlate with clinical outcome measures, summarizing the features that contribute

to the cluster formation cannot be used as classifiers that predict cluster assignment for new

patients. In order to further narrow down and identify features that can later be used as

classifiers, feature selection methods can be used.

One promising method of feature selection is FeaLect developed by Zare et al (79). It is based on

the Least Absolute Shrinkage and Selection Operator (LASSO), which is a regularization

technique (prevention of overfitting by penalizing having too many features when learning a

model) for linear regression. Unlike the common ridge regularization method, LASSO tends to

shrink some coefficients to a value of 0, thus also effectively acting as a subset selection

algorithm for predictors (80). Simply put, it includes only a subset of input variables when

returning a fitted model, thus making the results much easier to interpret and assist in narrowing

down the variables that may best serve as classifiers. To increase consistency and reliability of

LASSO in selection of important features, there are modifications of the algorithm, such as

Bolasso (81), which run the algorithm on many subsamples of the data and consequently select

features based on multiple models that were generated. FeaLect, is a modification of LASSO as

well, but unlike Bolasso which is often too strict when it comes to feature selection, FeaLect is

less strict and has been demonstrated to perform well at selecting clinically relevant features in

real datasets (79).

1.6.7 Heterogeneity in KD

Heterogeneity exists in KD based on the clear differences between patients in coronary outcome

and IVIG resistance. However, it also manifests itself in clinical and biochemical measures. The

20

current set of AHA diagnostic criteria is far from being clear-cut and exhibits a lot of

heterogeneity, both in presentation and as variation in intensity, which is difficult to measure

objectively. Prime examples of this are polymorphous rash (which as mentioned before can take

many forms), fever (due to its high variability in duration), and cervical lymphadenopathy (the

number and size of affected nodes may vary with some being so dramatically enlarged and

inflamed that they are diagnosed as cervical adenitis) (7, 23). Additionally, each one of these

clinical features varies in terms of intensity/severity and the dichotomous measure of presence or

absence of a symptom (26). Aside from the clinical diagnostic criteria, additional layers of

information exist in the form of presence and severity of additional symptoms and supporting

features that KD patients may have. The fact that myocarditis, gastrointestinal complaints, and

hydrops of the gall bladder, just to name a few, manifest themselves only in a subset of patients,

further demonstrates the heterogeneity amongst KD patients due to the combinatorial

presentation of these symptoms (7). The same goes with laboratory tests – patients may or may

not have increased white blood cell counts, serum transaminases, and there may be varying

levels of elevated CRP and ESR at diagnosis (7). As a result, even though KD patients may

appear, by presence of diagnostic criteria, to belong to a homogeneous group having the classic

diagnostic features, this is far from the truth as they differ across many of the other aspects that

may reflect underlying pathobiology. The repercussions of the inability to identify these

heterogeneous groups may contribute to differences in treatment response and the remaining risk

of aneurysm formation.

1.6.8 Rationale

The clinical differences that we see in terms of coronary outcome, IVIG responsiveness, and

presentation at diagnosis, demonstrate that heterogeneity exists within KD, but the clinical

21

measures alone are not able to capture it. Previous attempts at coming up with scoring system

using only clinical features to capture heterogeneity in KD, such as the Kobayashi score

described earlier, failed to generalize to KD patients around the world. As has been suggested in

earlier studies (42), the solution to this problem is to develop new predictive models that are

more accurate and utilize new biomarkers and/or expanded biologic datasets that will bridge

ethnicities. After all, phenotype is driven by the underlying gene expression patterns, therefore

incorporating the large amounts of biologic data that we are now able to extract due to advances

in technology, may help us better identify homogenous groups of patients and further our

understanding of KD overall. In the recent times, genetic data has already proven its usefulness

in making new discoveries in KD via the several GWAS studies that have been conducted (66,

69, 70, 82) . Since the first sequencing of the genome, there has been a huge effort and progress

in bringing up the computational power and tools necessary for analyzing such big datasets. We

are now capable of processing such large amounts of data by identifying, annotating, and linking

variants to diseases with computational tools that are widely used by scientists today (83).

Clinical information with gene expression data, together with the ability and computational

power to process and analyze such a huge amount of data, can greatly contribute to gaining

better insight into KD and help develop better diagnostic tools.

In summary, though we may now have access and the resources to handle these big datasets,

making meaningful inference from analyzing multiple layers data is far from trivial. Aggregation

of multiple datasets is a challenging task due to the heterogeneity found between datasets.

Similarity Network Fusion (SNF) is a multiple data integration tool that effectively addresses

these problems and can therefore prove extremely useful in identification of homogenous groups

of patients in KD.

22

1.6.9 Hypothesis and objectives

Our hypothesis is that SNF can combine clinical and biologic data to identify homogenous

groups of KD patients. To test this hypothesis, our study objectives are:

1. To identify and determine robustness of homogenous clusters of KD patients based

on the SNF mediated fusion of clinical and biologic datasets

2. To characterize the unique gene expression and clinical profiles that define the

discovered clusters

3. To identify the subset of features that can be used as classifiers

23

2 Methods

2.1 KD Cohort

Children were included in these studies if they satisfied the American Heart Association (AHA)

diagnostic criteria for Kawasaki Disease (7). Informed consent for participation was obtained

from parents and informed consent or assent was obtained from patients as appropriate. The

patient cohorts consisted of 171 children with KD from Rady’s Children’s Hospital.

Standardized clinical data and echocardiographic measurements were prospectively collected for

all patients according to protocol. Clinical summaries of the KD cohort are detailed in Table 1

(representing 159 patients used for analysis after missing data removal outlined below). Whole

blood RNA was collected in PAXgene tubes during the acute phase, before administration of

IVIG. Contemporaneous blood sample was used for complete blood counts and the rest of the

laboratory testing. Z-worst, defined as the maximal z-score of the internal diameter of the left

anterior descending and right coronary arteries normalized to body surface area during the first 6

weeks after onset of illness, was used to describe coronary artery dimensions. See Table 1 for

summary of clinical characteristics and laboratory tests.

2.2 Gene expression microarray

Gene expression data was measured using the Illumina HumanHT-12 V4.0 expression beadchip

(47 000 probes targeting gene transcripts, where a given gene may have multiple transcripts),

scanned using the Illumina Bead Array Reader confocal scanner, and checked for quality using

the Illumina QC kit, as described in a previously published protocol in greater detail (82). Gene

expression data was normalized by a log10 transformation, followed by Z score conversion (82).

24

Both the raw and normalized datasets are publically available at the GEO database (Accession

number GSE63881).

2.3 Datasets

3 datasets were used for SNF: gene expression, continuous clinical data, and binary clinical data.

Illumina HT-12v4 chip was used to generate gene expression data covering 19,539 genes across

the genome. Continuous clinical features include age, duration of fever at diagnosis, and

laboratory tests (platelet count, Hb z-scores (haemoglobin concentrations normalized for age),

WBC, bands%, neutrophils%, lymphocytes%, monocytes%, eosinophils%, ESR, CRP, ALT,

GGT, and urinalysis WBC). Categorical clinical dataset includes proportion of males and

proportion of patients with the classical KD diagnostic features (conjunctivitis, oral changes,

rash, extremities, and lymphadenopathy)

2.4 Computational analysis workflow

Steps involved in the analysis of the KD cohort are summarized in Figure 3. Data is first pre-

processed via outlier removal and standardization of variables within the datasets. The datasets

are then fused using SNF and clusters are determined using spectral clustering of the shared

network. Spectral clustering is performed on the final fused network and is not part of the actual

SNF algorithm – it is just one of many methods of finding clusters from similarity matrices. The

algorithm itself is based on performing dimensionality reduction prior to running clustering

methods, such as K-nearest neighbors, which contributes to its high performance as a clustering

technique (84). The resulting clusters are annotated by comparing them across clinical variables

and functional groups of genes. Clinical and top biologic features are also used in FeaLect

analysis to determine classifiers that can predict cluster assignment. Lastly, performance of SNF

25

in analyzing the KD cohort is measured via robustness and feature sensitivity analysis. Each

method is described in more detail in sections below.

Figure 3. Steps involved in the computational analysis of the KD cohort.

Prior to the integration of clinical and biologic data using SNF, the datasets were pre-processed by removing outliers

and standardizing the features. Clusters of patients in the fused matrix were determined using spectral clustering.

The patterns between the groups of patients were then annotated by comparing the clusters across clinical variables

and functional gene groups (gene enrichment analysis). Clinical and top biologic features were also used for FeaLect

analysis to identify classifiers that can predict cluster assignment. Lastly, performance of SNF with respect to our

KD cohort was measured using robustness and sensitivity analysis.

2.5 Data pre-processing

Data were prepared for analysis first by removing any patients with >20% missing data across all

the variables within a dataset, followed by removal of extreme outliers (>Interquartile Range x3).

The variables were then standardized to have mean of 0 and standard deviation of 1.

2.6 Similarity network fusion

We used R statistical software v3.1.1 (www.r-project.org) with the “SNFtool” package v2.2

installed (cran.r-project.org/web/packages/SNFtool/) for running the SNF algorithm (72).

Patient-by-patient similarity networks were constructed for each individual dataset using an

26

Euclidian distance measure for continuous numerical data, or a chi-square distance for

categorical data, for every pairwise patient combination (72). The resulting matrices were used

as input for the network fusion algorithm where each network is simultaneously updated with

information from the other networks. In this manner, over the course of multiple iterations, all

the networks converge to a single shared network that integrates patterns derived from each of

the networks. Alternatively, these matrices can be visualized as similarity networks where nodes

correspond to individual patients and the edges represent patient-patient similarities (See Figure

1). Clustering of patients was done using spectral clustering.

2.7 Gene enrichment analysis

Before proceeding with further analysis, genes strongly correlated with gender were removed

(Mann-Whitney test p-value<0.05), since there are inherent physical differences between males

and females which may otherwise confound the results. Gene enrichment was performed using

the following analysis: top genes, ranked by Kruskal-wallis test between the clusters with an

adjusted p-value (Holm-Bonferroni) < 0.001, were hierarchally clustered. The identified clusters

were then used as input in the Database for Annotation, Visualization and Integrated Discovery

(DAVID) online computational tool to find enriched Gene Ontology (GO) terms. GO is a

collaborative initiative for creating consistent gene role and product descriptions using GO

terms, where the connections between the terms and their hierarchy are annotated (77). Genes

belonging to these groups were annotated using the RefSeq database (85).

2.8 Co-clustering probability

Co-clustering probability was used as a measure of robustness to removal of patients and

individual features. In lay terms, it is the likelihood of any given pair of patients to remain in the

27

same cluster after changes to the dataset. It is defined as the fraction of original pair-wise co-

clustering relationships between every pairwise combination of patients that remained intact after

re-running SNF and spectral clustering. Co-clustering probability was measured for subsets of

data with percent patients removed (5-60%). Each percent removal was repeated 10,000 times.

Similar method was applied for analyzing sensitivity of the clusters to each feature, where co-

clustering probability was also measured for subsets of data with each feature removed, one at a

time.

2.9 Statistical analysis

Kruskal-Wallis test was used for comparing clinical continuous numerical values between the

three clusters, while Fisher’s exact test was used for categorical variables. Ranking of genes was

done with Kruskal-Wallis test and p-values were adjusted using Holm-Bonferroni method

2.10 FeaLect feature selection

To identify possible classifiers, FeaLect, a method developed by Zare et al (79), was used to

narrow down informative features. FeaLect, based on the Least Absolute Shrinkage and

Selection Operator (LASSO) method for linear regression (80), assigns scores to each feature

based on their relevance in the models being generated (79). The logs of scores are plotted in

increasing order, producing a 3 segment graph, with the right-most non-linear portion containing

informative features. As a modification from the original protocol, features in this portion of the

graph were identified using the significance threshold of p<0.05 based on the frequency

distribution of the feature scores. FeaLect was carried once for each cluster separately, where the

cluster labels were converted to a binary form. “FeaLect” package v1.10 (https://cran.r-

project.org/web/packages/FeaLect/ ) was used for running the FeaLect algorithm. Missing data

28

was imputed using Least Square Adaptive (LSA) computation method (83). Any identified genes

were annotated using the RefSeq database (85).

29

3 Results

3.1 KD cohort and data pre-processing

The pre-processing steps, along with the summary of the rest of the analysis, can be seen in

Figure 3. Prior to running SNF analysis, the KD cohort had to undergo several steps of data pre-

processing. The data from the 171 KD patient cohort used for SNF analysis is comprised of

biologic and clinical variables, the latter being split into continuous and categorical variables

(Table 2). The clinical variables were further split into two categories – clinical laboratory-based

features and the classical AHA KD symptoms (along with gender). Furthermore, the former set

of features is comprised of continuous variables, whereas the latter is categorical, each requiring

different distance measures for calculating pairwise similarity during similarity matrix

construction in SNF (72). Splitting the data into relevant datasets for fusion is crucial to ensure

the SNF matrix produces meaningful results, since converting multi-dimensional data to a single

patient-patient similarity matrix compresses all the variables into a single similarity measure.

The next step in pre-processing was outlier removal (any value greater than three times the

Interquartile range) and subsequent removal of patients with too much data missing. As a result,

our dataset contained 12 patients with 20% or more data missing which have been excluded from

any further analysis. The remaining dataset of 159 patients is clinically representative of KD (86)

and is summarized in Table 1. The cohort is also comprised of patients of different self-reported

ethnicities, with 17% Asian, 25% Caucasian, 30% Hispanic, and a relatively large mixed

proportion (24%), amongst others (Table 1). Unlike the Kobayashi score that was effective only

in the Japanese population (42), running analysis on a mixed dataset may help uncover patterns

and generate statistical models that can be generalized to all ethnicities. As the last step in

30

preparing the data for SNF analysis, each feature was standardized to have a mean of 0 and a

standard deviation of 1.

Table 1. Laboratory measures and clinical characteristics for the KD cohort.

a Values are based on the cohort of 159 patients used for analysis, after patients with more than 20% missing data

have been removed from the initial 171 patient KD cohort

b Values are provided as number of patients (percent of cohort)

c Values are provided as median (range)

31

Table 2. Biologic and clinical datasets used for SNF analysis.

a Gene expression data is derived from a normalized Illumina HT-12 V4 chip that represents 19, 539 genes

3.2 Three unique clusters were identified after aggregation of clinical and gene expression datasets with SNF

After constructing similarity matrices for each dataset (gene expression dataset, continuous

clinical dataset with laboratory test variables, and categorical clinical dataset with classic AHA

KD criteria and gender), spectral clustering of each matrix showed varying patterns within each

data layer. Similarity matrices for the numerical clinical and gene expression data had 2 and 4

clusters respectively, while the categorical clinical had 5 distinct clusters of patients (Figure 4A).

Due to different patterns observed in each dataset alone, without a clear visual overlap between

the datasets, it is difficult to draw any conclusions about the overlaying pattern considering the

combination of all the datasets. SNF aggregation of the 3 datasets, where similarity matrices for

each dataset undergo simultaneous updates through several iterations until convergence to a

32

single network, was able to produce a unified similarity network that was able to incorporate the

patterns found in both the clinical and gene expression datasets. As a result, 3 clearly defined

clusters have been identified, comprising 78, 33 and 48 patients respectively (Figure 4B and C).

33

Figure 4. Three distinct clusters of patients recovered using SNF.

(A) A similarity matrix was generated from each individual dataset, showing different number of clusters for each

(2, 5, and 4 clusters respectively). Subsequent fusion of the 3 networks using SNF yielded a (B) fused matrix with 3

distinct clusters. (C) The outline representation of the fused matrix illustrates composition of each cluster - 78, 33,

and 48 patients for clusters 1 to 3 respectively. In all instances, patients were grouped using spectral clustering.

34

3.3 High robustness and low clinical feature sensitivity amongst the 3 clusters

To examine the robustness of the 3 identified clusters (in other words, how stable are these

clusters and whether or not they are driven by the patterns across the whole cohort or by only a

select few patients), we ran SNF on subsets of patients with increasing percentages of individuals

removed (5 to 60%, in increments of 5%). The measure for robustness was co-clustering

probability, which can be described as the likelihood of any pair of patients to remain in the same

cluster. Each percent removal was repeated 10,000 times to best represent the different

combinations of patients left over. Our analysis, summarized in Figure 5, showed that SNF is in

fact highly robust in identifying the 3 clusters, with 80% of co-clustering probability maintained

even when 40% of patients were removed.

Figure 5. SNF displays high robustness in identifying the three clusters in response to removal of patients.

Whiskers represent 2.5 and 97.5 percentiles. Each percent removal of patients was repeated 10,000 times and co-

clustering probability was measured relative to the original fused matrix. Removal of 40% of patients maintained

80% co-clustering probability.

35

To identify the sensitivity of our 3 clusters to removal of clinical features, a similar analysis was

performed, but by separately removing each one of the clinical variables and calculating co-

clustering probability after re-running SNF. As summarized in Figure 6, a majority of the

features did not impact the formation of the 3 clusters when removed, with the exception of

‘Proportion Male’ and ‘Lymphadenopathy’. These results show that our clusters were mostly

influenced by these 2 clinical variables, compared to the rest. Gender playing an important role

in the cluster formation is not surprising due to inherent differences between males and females,

and is therefore difficult to fully remove. However, we did attempt to remove the gender bias

when analyzing the clusters by removing any gene expression variables which are strongly

correlated with gender in our dataset (elaborated further in later sections).

Figure 6. The 3 clusters are most sensitive to removal of ‘Proportion Male’ and ‘Lymphadenopathy’ clinical

variables.

In order to assess the sensitivity of the 3 clusters to clinical variables, one at a time, variables were withheld from

SNF analysis and co-clustering probability was measured relative to the original fused matrix. ‘Proportion Male’

and ‘Lymphadenopathy’ appear to have the biggest impact on cluster formation.

36

3.4 Unique clinical profiles characterize the 3 clusters

Comparison of the three clusters across an array of clinical variables shows that the clusters are

clinically distinct from each other. According to the clinical and demographic features (Figure

7), cluster 1 appears to be composed of mostly older patients with a higher proportion of females.

The patients in cluster 2 appear to have longer duration of fever and lower incidence of the

principal diagnostic features of KD, including lower frequency of rash, oral changes, extremities

changes, and conjunctivitis (Figure 4A) (7). Cluster 3 patients are all boys and also differentiate

themselves with absence of cervical lymphadenopathy. Lastly, various ethnic profiles are

represented across the three clusters (Figure 8). Cluster 1 has a higher relative proportion of

children of Asian descent, while cluster 2 has a higher proportion of patients of African

American descent and a lower proportion of children of mixed ethnicities.

37

Figure 7. Unique clinical and demographic profiles characterize the three clusters.

Whiskers represent 2.5-97.5 percentiles. Clinical variables which are statistically significant (p < 0.05) are marked

with an asterisk (*). Based on clinical and demographic features, cluster 1 appears to be composed mostly of older

females, cluster 2 patients were diagnosed with a longer duration of fever with lower incidence of clinical

symptoms, and patients in cluster 3 were 100% male with no incidence of lymphadenopathy.

38

Figure 8. Ethnic group profiles across the 3 clusters

Patients in clusters 1,2, and 3 represent groups of Asian, African American, Caucasian, Hispanic, and Mixed

ethnicities. Cluster 1 appears to have a higher proportion of patients of Asian descent, relative to the other 2 clusters.

Cluster 2 has a higher proportion of African American patients and a lower proportion of mixed ethnicity patients,

relative to clusters 1 and 3.

Patterns across the routine clinical laboratory test measures in Figure 9 also appear to show

marked differences between the 3 clusters. Cluster 1 displays higher levels of CRP, but lower

levels of ESR, in contrast to the other groups of patients. Even though there are no significant

differences in total white blood cell count, the cellular composition is distinct, with cluster 1

having considerably lower percentages of monocytes, lymphocytes, but higher percentages of

neutrophils with respect to the other clusters. Inflammatory markers for cluster 2, in contrast,

show opposite patterns, with higher ESR, but decreased CRP. Discrepancies in CRP and ESR

measurements have been previously described in KD and have been attributed to CRP being a

direct measure of inflammation with a faster onset, while ESR is indirect and has much slower

kinetics (24). The patients in cluster 2 also presented with lowest hemoglobin levels, but highest

39

platelet counts compared to the other groups. The composition of white blood cells also showed

marked differences with lowest neutrophil percentages, but highest percentages of lymphocytes

with respect to the other clusters. All these features for cluster 2 are consistent with the longer

duration of disease (as measured by fever duration) in this cluster. Lastly, cluster 3 patients

displayed lower levels of ESR, but higher levels of CRP, much like that of cluster 1. Patterns

across other panels, such as neutrophil and lymphocyte percentages, were intermediary to

clusters 1 and 2. Cluster 3 patients, did however, appear to have increased ALT levels and

slightly higher urinalysis white blood counts and eosinophil percentages compared to the other

clusters (though the latter 2 were not significant).

40

Figure 9. Unique laboratory test profiles characterize the three clusters.

Whiskers represent 2.5-97.5 percentiles. Clinical variables which are statistically significant (p < 0.05) are marked

with an asterisk (*). According to laboratory tests, cluster 1 had high level of CRP, lower percentage of

lymphocytes, and higher number of neutrophils relative to the other groups. Cluster 2 had the highest platelet count

and lowest hemoglobin z-scores compared to the other 2 clusters. Cluster 2 also displayed lower levels of CRP,

lower neutrophils, but higher lymphocyte percentages. Cluster 3 had mostly intermediary levels relative to clusters 1

41

and 2 across most of the variables. Patients in this cluster did however have higher urinalysis WBC, ALT levels, and

slightly higher eosinophil percentages compared to the other 2 groups.

From the patterns seen in Figure 4, cluster 3 appears to be the most distinct cluster across most of

the variables. The contrasting ESR and CRP patterns between cluster 2 and clusters 1 and 3,

along with other laboratory measures such as lymphocyte and neutrophil percentages, appear to

correlate with the observed pattern of early stages of inflammation (innate) in clusters 1 and 3,

and later stages of disease (adaptive) in cluster 2.

3.5 Unique gene expression profiles characterize the 3 clusters

Unlike the clinical features, which have only few variables that can be easily summarized on a

single page with boxplots, we can’t do the same and compare clusters across every gene

expression variable. Our dataset contains 19,539 genes, and making sense of the results that we

get will be overwhelming and hard to interpret. To further aid us in extracting meaningful

results, we have removed 1,355 genes from further analysis that were strongly correlated with

gender (Mann-Whitney test p<0.05), as it can often introduce a lot of bias and confounding

effects into a dataset due to inherent differences between males and females in any mixed cohort

of patients (87). If the effects are not removed, the gene expression results may then be diluted

with genes that are linked to gender, thus potentially masking any important patterns in the

cohort. In order to proceed with the analysis, we identified the most significant genes based on

their ability to differentiate the clusters using Kruskal-Wallis test for each variable. Using this

method, we isolated 411 genes that had a p-value (Holm-Bonferroni adjusted) less than 0.001.

Due to the fact that performing the Kruskal-Wallis test does not discriminate whether a gene’s

42

expression goes up or down, we performed hierarchal clustering on these genes and identified 2

very distinct groups based on their patterns of expression in each patient cluster (Figure 10).

Group 1, the top part of the heatmap, contains genes where cluster 2 has a marked decrease in

expression relative to clusters 1 and 3. Genes in the group below display an opposite pattern

where clusters 1 and 3 have relatively lower expression in the genes that make up the group,

while cluster 2 is higher. To better characterize the genes that make up these 2 groups, each

group was used as input in DAVID online computational tool (a tool that identifies publically

available gene group annotations that best describe a given list of genes) to identify GO terms

that are enriched in each of these gene lists. It is important to note that GO terms vary in their

specificity of describing a particular biological process or molecular function, so their usefulness

in analysis of our data is only as good as how they are curated. Tables 3 and 4 lists the top GO

terms for the two groups respectively, with a p-value < 0.05 (Benjamini, a correction for multiple

hypothesis testing). For illustrative purposes, a subset of these functional groups (representative

of the overall patterns seen in these genes) were displayed as heatmaps in Figure 11 and Figure

12, results of which are described in detail in the following sections.

43

Figure 10. The 3 clusters are characterized by unique gene expression profiles.

(A) Heatmap showing the hierarchal clustering of the top 411 genes (after removal of genes highly correlated with

gender), Kruskal-Wallis adjusted p-value (Holm-Bonferroni) < 0.001, separated the genes into 2 groups based on

patterns of gene expression across the clusters.

In the first group of genes, identified by hierarchal clustering (Figure 11 and Table 3), the

functional groups related to inflammatory and innate immune response, and protein kinase

cascade, display a relative increase in expression in clusters 1 and 3 (more intense in cluster 1),

which leads us to believe that these patients have gene expression profile pointing to an active

innate immune response. The opposite is true for cluster 2, which appears to correlate with the

longer duration of fever in these patients, thus innate immune response may have decreased

activity, transitioning already to an adaptive response profile. The rest of the GO terms, such as

‘Plasma Membrane’ and ‘Insoluble Fraction’ are significantly enriched but are too generic to

44

draw meaningful conclusions from. These terms belong to the Cellular Component category of

GO terms (which are less useful by themselves, than Molecular Function and Biological Process

GO categories, when it comes to comparing groups of patients) and are also located closer to the

root of the GO terms hierarchy (the terms at the top have a much broader description and are

therefore more generic). As a result, even though they do tell us that gene expression patterns

differ amongst genes in the plasma membrane per se, without any functional descriptions in this

case, not much else can be inferred.

45

Figure 11. Group 1 genes are representative of inflammation and immune response related GO terms.

The genes in the first group of genes were further analyzed using DAVID online computational tool to find enriched

GO terms, which are annotated groups of genes with similar roles and descriptions. Cluster 2 had notably lower

levels of gene expression in GO terms related to innate immune response compared to the other clusters, while

cluster 1 had slightly higher levels compared to cluster 3. See Table 3 for the full list of significant GO terms.

46

Table 3. List of group 1 significant GO terms from the DAVID gene enrichment analysis.

Hierarchal clustering of the top 411 genes, Kruskal-Wallis adjusted p-value (Holm-Bonferroni) < 0.001, separated the genes into 2 groups (See Figure 5). GO terms

with an adjusted p-value <0.05 (Benjamini) are listed for group 1 genes.

Term Count P-Value List Total Benjamini

GO:0045087~innate immune response 11 4.59E-07 123 6.06E-04

IL18R1, CR1, NCF2, FCGR1C, CXCL16, FCGR1A, IL1RAP, VNN1, TLR5, TLR6, TLR8, RAB27A

GO:0006952~defense response 18 3.62E-05 123 2.36E-02

IL18R1, CR1, HIST1H2BC, NCF2, HIST1H2BE, STAT5B, FPR2, TLR5, TLR6, TLR8, MMP25, S100A12, HDAC4, NLRC4, FCGR1C, CXCL16, FCGR1A, IL1RAP, VNN1, RAB27A

GO:0000267~cell fraction 24 8.18E-05 116 1.61E-02

ATP6V0E1, CYP1B1, STX3, AQP9, LIMK2, FLOT1, MAN1A1, GYG1, GCLM, SOD2, S100A12, MCTP1, LIN7A, ACSL1, DGAT2, SH3GLB1, CD59, FAS, CEACAM4, ACSL4, LRRK2, CEACAM1, PSTPIP2, HIP1

GO:0009611~response to wounding 16 8.32E-05 123 3.60E-02

CR1, STAT5B, FPR2, TLR5, TLR6, TLR8, MMP25, S100A12, SOD2, HDAC4, NLRC4, CD59, IL1RAP, SERPINB2, VNN1, RAB27A

GO:0007243~protein kinase cascade 13 1.27E-04 123 4.12E-02

SOCS3, STAT5B, FPR1, TLR5, TLR6, TLR8, TANK, IFNAR1, TRIB1, OSM, TGFA, GADD45B, LRRK2

GO:0006955~immune response 18 1.49E-04 123 3.86E-02

IL1R2, IL18R1, CR1, AQP9, NCF2, BST1, NCF4, TLR5, TLR6, TLR8, OSM, FCGR1C, CXCL16, FCGR1A, IL1RAP, FCGR1B, VNN1, FAS, RAB27A

GO:0006954~inflammatory response 12 1.70E-04 123 3.67E-02

HDAC4, CR1, NLRC4, IL1RAP, STAT5B, VNN1, TLR5, FPR2, TLR6, TLR8, MMP25, S100A12

GO:0005626~insoluble fraction 19 4.81E-04 116 4.65E-02

ATP6V0E1, CYP1B1, STX3, AQP9, FLOT1, MAN1A1, S100A12, MCTP1, LIN7A, ACSL1, DGAT2, SH3GLB1, CD59, CEACAM4, ACSL4, LRRK2, CEACAM1, PSTPIP2, HIP1

GO:0005886~plasma membrane 51 1.00E-03 116 4.84E-02

GPR84, AQP9, TLR5, TLR6, MMP25, SLC2A3, IL1RAP, VNN1, TGFA, SV2A, CEACAM4, FAS, CEACAM1, RAB27A, PTPRJ, GPR97, STX3, NCF2, BST1, NCF4, FLOT1, IFNAR1, OSM, ARRB2, MGAM, LRRK2, SLC40A1, SLC2A14, FPR1, FPR2, KCNJ2, GPR141, ITGAM, ACSL1, FCGR1C, CD177, FCGR1A, FCGR1B, NUMB, ACSL4, IL18R1, CR1, TRIM25, S100A12, LIN7A, P2RY13, CXCL16, GNG10, CD59, RIT1, GK, FCGR2A

47

The second group of genes (Figure 12 and Table 4), where cluster 2 has relatively higher levels

of expression (whereas cluster 3 has slightly lower levels compared to cluster 1), correspond to a

number of GO terms that relate to metabolism, including translation, RNA processing, as well as

mitochondrial and ribonucleoprotein complex related genes. MHC protein binding group of

genes has also been identified as significant and has been previously associate with KD via

GWAS studies described earlier (70). Taken together, these GO terms implicate that cells in

these patients are undergoing increased levels of protein synthesis and MHC mediated

presentation, both of which are consistent with an increasingly more active adaptive immune

response.

Figure 12. Group 2 genes are representative of metabolism related GO terms.

The genes in the second group of genes were further analyzed using DAVID online computational tool to find

enriched GO terms, which are annotated groups of genes with similar roles and descriptions. Due to the large list of

GO terms, a representative sample was picked for illustrative purposes (see Table 4 for the full list). GO terms

related to metabolism displayed an increased pattern of expression in cluster 2, relative to clusters 1 and 3. Lower

levels of expression were observed in cluster 1 relative to cluster 3.

48

Table 4. List of group 2 significant GO terms from the DAVID gene enrichment analysis.

Hierarchal clustering of the top 411 genes, Kruskal-Wallis adjusted p-value (Holm-Bonferroni) < 0.001, separated the genes into 2 groups (See Figure 5). GO terms

with an adjusted p-value <0.05 (Benjamini) are listed for group 2 genes.

Term Count P-Value List Total Benjamini

GO:0070013~intracellular organelle lumen 60 6.89E-14 157 1.63E-11

MMS19, ATP5D, RNMT, LYAR, QARS, CDC16, PDHB, TMEM109, LONP1, MCCC1, SMARCD1, CIRH1A, PRPF31, ELP2, ANAPC5, BYSL, ERP29, RING1, MTA1, POLR1C, LAS1L, MCM3, RSL1D1, RPS15, EDF1, PCCB, NHP2, PMPCA, AARS2, GLTSCR2, POLR2G, SDAD1, TH1L, RPL36, C14ORF169, TRRAP, BOP1, BMS1, PRPF19, HNRNPM, RPA2, DDX47, REXO4, NAT10, GTF3C2, APEX1, SHMT2, TSR1, PHB, MPHOSPH10, SMAD3, ILF3, SF3A3, CDC25B, ILF2, SUMF2, ATP5A1, DDX54, PARP1, DAP3

GO:0031974~membrane-enclosed lumen 61 1.19E-13 157 1.41E-11

MMS19, ATP5D, RNMT, LYAR, QARS, CDC16, PDHB, TMEM109, LONP1, MCCC1, SMARCD1, TIMM9, CIRH1A, PRPF31, ELP2, ANAPC5, BYSL, ERP29, RING1, MTA1, POLR1C, LAS1L, MCM3, RSL1D1, RPS15, EDF1, PCCB, NHP2, PMPCA, AARS2, GLTSCR2, POLR2G, SDAD1, TH1L, RPL36, C14ORF169, TRRAP, BOP1, BMS1, PRPF19, HNRNPM, RPA2, DDX47, REXO4, NAT10, GTF3C2, APEX1, SHMT2, TSR1, PHB, MPHOSPH10, SMAD3, ILF3, SF3A3, CDC25B, ILF2, SUMF2, ATP5A1, DDX54, PARP1, DAP3

GO:0043233~organelle lumen 60 1.89E-13 157 1.50E-11

MMS19, ATP5D, RNMT, LYAR, QARS, CDC16, PDHB, TMEM109, LONP1, MCCC1, SMARCD1, CIRH1A, PRPF31, ELP2, ANAPC5, BYSL, ERP29, RING1, MTA1, POLR1C, LAS1L, MCM3, RSL1D1, RPS15, EDF1, PCCB, NHP2, PMPCA, AARS2, GLTSCR2, POLR2G, SDAD1, TH1L, RPL36, C14ORF169, TRRAP, BOP1, BMS1, PRPF19, HNRNPM, RPA2, DDX47, REXO4, NAT10, GTF3C2, APEX1, SHMT2, TSR1, PHB, MPHOSPH10, SMAD3, ILF3, SF3A3, CDC25B, ILF2, SUMF2, ATP5A1, DDX54, PARP1, DAP3

GO:0031981~nuclear lumen 47 5.20E-10 157 3.08E-08

MMS19, GLTSCR2, POLR2G, SDAD1, RNMT, LYAR, TH1L, RPL36, C14ORF169, TRRAP, BOP1, CDC16, BMS1, PRPF19, HNRNPM, RPA2, TMEM109, DDX47, REXO4, SMARCD1, NAT10, CIRH1A, GTF3C2, APEX1, PRPF31, ELP2, TSR1, ANAPC5, PHB, BYSL, RING1, MPHOSPH10, SMAD3, MTA1, POLR1C, LAS1L, ILF3, MCM3, SF3A3, CDC25B, RSL1D1, ILF2, RPS15, EDF1, DDX54, PARP1, NHP2

GO:0005730~nucleolus 29 2.12E-08 157 1.01E-06

GLTSCR2, SDAD1, LYAR, C14ORF169, RPL36, BOP1, BMS1, HNRNPM, DDX47, TMEM109, REXO4, SMARCD1, NAT10, CIRH1A, ELP2, TSR1, BYSL, RING1, MPHOSPH10, MTA1, ILF3, LAS1L, POLR1C, MCM3, RSL1D1, ILF2, DDX54, PARP1, NHP2

GO:0030529~ribonucleoprotein complex 23 2.87E-07 157 1.13E-05

49

MRPL2, MRPS27, SNRPA1, PRPF31, RPL19, PABPC4, MPHOSPH10, RPL36, EEF2, ILF3, BOP1, SF3A3, PRPF19, RSL1D1, HNRNPM, TARBP2, ILF2, RPS15, MRPL38, SNRNP40, APEX1, NHP2, DAP3

GO:0006412~translation 18 6.55E-07 167 8.13E-04

MRPL2, YARS, RPL19, PABPC4, RPL36, EEF2, QARS, VARS, RSL1D1, EIF3D, EIF3H, RPS15, EIF3F, EIF4A1, EIF3K, LGTN, AARS2, EIF2B4

GO:0006396~RNA processing 23 9.17E-07 167 5.70E-04

POLR2G, SNRPA1, PRPF31, RNMT, TSR2, PABPC4, MPHOSPH10, SMAD3, BOP1, RNMTL1, SF3A3, PRPF19, RSL1D1, HNRNPM, TARBP2, DDX39, DNAJC8, RPS15, SNRNP40, CPSF4, DDX54, NHP2, TYW1B

GO:0005654~nucleoplasm 28 7.79E-06 157 2.64E-04

MMS19, POLR2G, RNMT, TH1L, C14ORF169, TRRAP, BOP1, CDC16, PRPF19, RPA2, SMARCD1, GTF3C2, APEX1, PRPF31, ELP2, ANAPC5, PHB, RING1, SMAD3, MTA1, POLR1C, MCM3, CDC25B, SF3A3, RPS15, EDF1, PARP1, NHP2

GO:0003723~RNA binding 24 1.24E-05 155 4.37E-03

POLR2G, SNRPA1, PRPF31, YARS, RPUSD4, RNMT, RPL19, RPUSD2, PABPC4, ILF3, RNMTL1, RSL1D1, HNRNPM, TARBP2, DDX47, LONP1, DDX18, ILF2, EIF4A1, CPSF4, LGTN, DDX10, DDX54, NHP2

GO:0022613~ribonucleoprotein complex biogenesis 12 1.35E-05 167 5.56E-03

PRPF31, TARBP2, SDAD1, TSR1, TSR2, BYSL, RPS15, MPHOSPH10, BOP1, BMS1, NHP2, SF3A3

GO:0031967~organelle envelope 21 6.55E-05 157 1.94E-03

ATP5D, NXT1, SHMT2, NDUFB11, GIMAP5, SAMM50, NDUFB8, COX10, PHB, SMAD3, IPO9, TMEM109, NPIP, NUP205, MCCC1, TIMM9, ATP5A1, NDUFS3, PARP1, PMPCA, SLC25A17

GO:0031975~envelope 21 6.85E-05 157 1.80E-03

ATP5D, NXT1, SHMT2, NDUFB11, GIMAP5, SAMM50, NDUFB8, COX10, PHB, SMAD3, IPO9, TMEM109, NPIP, NUP205, MCCC1, TIMM9, ATP5A1, NDUFS3, PARP1, PMPCA, SLC25A17

GO:0003743~translation initiation factor activity 7 8.19E-05 155 1.44E-02

EIF3D, EIF3H, EIF4A1, EIF3F, EIF3K, LGTN, EIF2B4

GO:0044429~mitochondrial part 20 1.16E-04 157 2.73E-03

ATP5D, SHMT2, NDUFB11, GIMAP5, SAMM50, NDUFB8, COX10, PHB, QARS, PDHB, LONP1, MCCC1, TIMM9, ATP5A1, NDUFS3, AARS2, PCCB, PMPCA, DAP3, SLC25A17

GO:0005739~mitochondrion 29 1.17E-04 157 0.0025095

ATP5D, SAMM50, COX10, NDUFB8, QARS, VARS, PDHB, LONP1, MCCC1, TIMM9, MRPL38, NDUFS3, GTF3C2, RTN4IP1, MRPS27, MRPL2, NDUFB11, SHMT2, GIMAP5, PHB, ILF3, GLOD4, SMCR7L, ATP5A1, PMPCA, AARS2, PCCB, SLC25A17, DAP3

GO:0042254~ribosome biogenesis 9 1.28E-04 167 0.038887

50

SDAD1, TSR1, TSR2, BYSL, RPS15, MPHOSPH10, BOP1, BMS1, NHP2

GO:0008135~translation factor activity, nucleic acid binding 8 1.62E-04 155 0.0188477

EIF3D, EIF3H, EIF4A1, EIF3F, EIF3K, EEF2, LGTN, EIF2B4

GO:0042287~MHC protein binding 5 1.98E-04 155 0.0173521

TARP, ATP5A1, HLA-DMB, HLA-DMA, CD74

GO:0031980~mitochondrial lumen 11 4.68E-04 157 0.0092094

ATP5D, LONP1, SHMT2, MCCC1, QARS, ATP5A1, AARS2, PMPCA, PCCB, PDHB, DAP3

GO:0005759~mitochondrial matrix 11 4.68E-04 157 0.0092094

ATP5D, LONP1, SHMT2, MCCC1, QARS, ATP5A1, AARS2, PMPCA, PCCB, PDHB, DAP3

GO:0005852~eukaryotic translation initiation factor 3 complex 4 7.28E-04 157 0.013198

EIF3D, EIF3H, EIF3F, EIF3K

GO:0031966~mitochondrial membrane 14 0.001078 157 0.0180872

ATP5D, SHMT2, NDUFB11, GIMAP5, SAMM50, COX10, NDUFB8, PHB, MCCC1, TIMM9, ATP5A1, NDUFS3, PMPCA, SLC25A17

GO:0005740~mitochondrial envelope 14 0.001869 157 0.0291238

ATP5D, SHMT2, NDUFB11, GIMAP5, SAMM50, COX10, NDUFB8, PHB, MCCC1, TIMM9, ATP5A1, NDUFS3, PMPCA, SLC25A17

GO:0043228~non-membrane-bounded organelle 48 0.002223 157 0.0324264

GLTSCR2, SDAD1, RPL19, LYAR, RPL36, C14ORF169, BOP1, CDC16, BMS1, SUMO3, HNRNPM, RPA2, TMEM109, DDX47, LONP1, REXO4, CENPB, SMARCD1, NAT10, MRPL38, CIRH1A, APEX1, ZW10, MRPS27, MRPL2, SHMT2, ELP2, TSR1, BYSL, RING1, MPHOSPH10, MTA1, POLR1C, LAS1L, ILF3, MCM3, MPRIP, CDC25B, KLHDC3, RSL1D1, CCDC6, ILF2, RPS15, DDX54, BIN1, PARP1, NHP2, DAP3

GO:0043232~intracellular non-membrane-bounded organelle 48 0.002223 157 0.0324264

GLTSCR2, SDAD1, RPL19, LYAR, RPL36, C14ORF169, BOP1, CDC16, BMS1, SUMO3, HNRNPM, RPA2, TMEM109, DDX47, LONP1, REXO4, CENPB, SMARCD1, NAT10, MRPL38, CIRH1A, APEX1, ZW10, MRPS27, MRPL2, SHMT2, ELP2, TSR1, BYSL, RING1, MPHOSPH10, MTA1, POLR1C, LAS1L, ILF3, MCM3, MPRIP, CDC25B, KLHDC3, RSL1D1, CCDC6, ILF2, RPS15, DDX54, BIN1, PARP1, NHP2, DAP3

GO:0019866~organelle inner membrane 12 0.002347 157 0.0322239

ATP5D, NDUFB11, SHMT2, NDUFB8, PHB, MCCC1, TIMM9, SMAD3, ATP5A1, NDUFS3, PMPCA, SLC25A17

51

3.6 Variation in treatment response and coronary outcome across the 3 clusters

Figure 13 compares two clinical outcome measures highly relevant in KD – responsiveness to

IVIG treatment and coronary outcome. Looking at the differences between the 3 clusters for

these 2 variables, it seems very apparent that there is a trend between the groups of patients for

both variables. Looking first at clusters 1 and 3 (which appeared similar across most of the

features) - cluster 1, which was earlier described to have a higher proportion of females, had the

higher fraction of patients that responded to treatment and relatively better coronary outcomes.

Cluster 3, on the other hand, which was exclusively male, had higher IVIG non-responsiveness

compared to the other clusters, but also had the largest proportion of patients with a z-worst

score > 2.5. Cluster 2, on the other hand, had the lowest proportion of patients that were IVIG

resistant, compared to clusters 1 and 3. This result was surprising, considering the previously

reported association between longer duration of fever and relatively worse coronary outcome

(88-90). Despite displaying clear differences in clinical and gene expression patterns between the

3 clusters, these important trends in outcome measures for KD did not exhibit statistical

significance between the 3 clusters.

52

Figure 13. The 3 clusters vary with respect to treatment response and disease outcome.

All patients were treated with the identical therapeutic protocol, which included first line treatment with IVIG, the

current standard of care. IVIG non-responsiveness exhibits a trend, though not statistically significant (Fisher’s

exact test, p-value of 0.051), where cluster 2 had the lowest proportion of kids that were IVIG non-responsive, while

cluster 3 had higher rate of IVIG non-responsiveness compared to clusters 1 and 2. A reversed trend was seen for

clusters 1 and 3 in terms of coronary outcome (Fisher’s exact test, p-value of 0.73), where cluster 3 showed an

increase in patients with poor coronary outcome (Z-worst score > 2.5), while cluster 1 showed a decreased

proportion.

3.7 Unique clinical and biological classifiers for predicting cluster assignment

The clinical and gene expression profiles described in previous figures, describe only the three

groups of patients that are within our cohort. Further analysis is required to identify the specific

features that can classify a new patient into either of the 3 clusters. The supervised learning

approach to select informative features in our study is called FeaLect, which is based on the

LASSO method for linear regression, where it scores the features based on their relevance when

generating the models (79). To limit the number of features being tested, we used all the clinical

variables and the top 411 genes (as mentioned earlier) as input. FeaLect validates the results by

training data on 100 randomly generated subsamples (without replacement, each ¾ the size of

the original cohort). As part of the analysis, FeaLect generates a list of scores for each feature

53

from the input data, which if plotted as logs of total scores and arranged in increasing order,

produce a 3 segment graph with curved ends and a linear middle portion (79). The authors

hypothesized that the linear portion represents irrelevant features that most likely contribute to

overfitting, while the non-linear exponential curve represents the most informative variables that

can be used as classifiers (79). Figure 14 shows the feature score graphs that were generated for

our clusters, with the vertical line at the end of each plot pointing to the part of the exponential

curve we used for identifying our informative features. The placement of the lines, unlike the

spline-construction method in the original paper (79), was alternatively done by using the

standard significance threshold of p<0.05 based on the frequency distribution of the feature

scores. Figure 15 and Table 5 illustrate the informative features (for each cluster) that were

identified using FeaLect. Based on these results, classifiers that appear to be the most important

in identifying cluster 1 patients were gender, lymphadenopathy, extremity changes, Interferon

alpha and beta receptor subunit 1 (IFNAR1), amongst others. Cluster 2 patients were

differentiated by extremity changes, conjunctivitis, and rash, all of which showed largest

contrasting differences in the clinical profiles described earlier in Figure 7, as well as genes

relating to metabolism (eg. polymerase (RNA) II (DNA directed) polypeptide G (POLR2G) and

mitochondrial ribosomal protein L2 (MRPL2), which encode subunit of RNA polymerase II and

and a mitochondrial ribosomal protein, respectively) that correlate with transition to adaptive

immune response in these patients. Lastly, cluster 3 patients identified gender and

lymphadenopathy as informative features, both of which had very contrasting clinical profiles in

Figure 7 for this group of patients, as well some genes previously linked to KD, namely

S100A12 (S100A calcium binding protein family) (91), amongst others.

54

Figure 14. FeaLect total feature scores.

FeaLect feature scoring algorithm, using the top 411 genes from the gene expression dataset (Kruskal-Wallis

adjusted p-value (Holm-Bonferroni) < 0.001) and all the clinical variables, was performed 3 times, as a set of binary

regressions for each cluster assignment. Total feature scores (log-scale) were plotted for each cluster and

informative features were picked based on a p<0.05 statistical significance, as denoted by the vertical line at the end

of each graph. See Figure 15 and Table 5 for the list of extracted features.

Figure 15. Informative clinical and biologic features identified with FeaLect.

FeaLect feature scoring algorithm, using the top 411 genes from the gene expression dataset (Kruskal-Wallis

adjusted p-value (Holm-Bonferroni) < 0.001) and all the clinical variables, was performed 3 times, as a set of binary

regressions for each cluster assignment. The bar graphs represent the total feature scores (log-scale) informative

features (based on a p<0.05 statistical significance) that were extracted via FeaLect. See Table 5 for a list of all these

features along with a description of some biologic variables.

55

Table 5. Description of FeaLect classifiers for predicting cluster assignment.

Annotation of the features presented in Figure 15. Variables listed for each cluster represent the informative

features selected with FeaLect (based on a p<0.05 statistical significance) for each cluster. Relative expression of

some the biologic variables between the clusters is displayed in Figure 16.

56

Figure 16. Relative gene expression profiles of biologic variables for each set of features extracted with

FeaLect.

These heatmaps represent the relative expression of genes from the informative features identified with FeaLect in

Table 5 and Figure 15.

57

4 Discussion

Our SNF analysis (Figure 4 B and C) yielded 3 distinct clusters consisting of 78, 33, and 48

patients from merging of biologic and clinical datasets. What’s most important is these clusters

are the fused product of the patterns across each individual dataset. The problem with analyzing

multiple datasets separately is apparent in Figure 4A, where similarity matrices for each

individual datasets are shown. Even though the original datasets were converted to patient

similarity matrices, thus getting rid of the large contrast in number of features and allowing for

easier comparison between gene expression (19, 539 genes) and clinical data (20 variables), each

respective dataset is still showing varying patterns across the cohort. Gene expression dataset has

2 clusters, clinical categorical has 5, while clinical numerical has 4 clusters of patients when

clustered using spectral clustering. This makes it difficult to make inference about the patients in

the cohort due to the disagreement of the clusters between each dataset. Fusion of the similarity

networks, however, led us to clearly identify the 3 clusters while incorporating the patterns from

the 3 datasets in the final network. It achieves this by strengthening any patterns common

between datasets, while weakening the patterns that are not shared.

As with any other clustering method, an important question is whether the patterns we are seeing

are based on the entire cohort or only a small subset of patients that are driving the cluster

formation. Co-clustering probability in Figure 5 shows that even when 40% of patients were

removed, we were still able to maintain 80% co-clustering probability. In other words, 80% of

the original pairwise patient-patient co-clustering relationships remained the same after re-

running the analysis with trimmed datasets. This is evidence that SNF is highly robust with

respect to our KD study and the 3 clusters can be re-constructed even with smaller sample sizes.

58

A similar question was asked for the effect of variables on cluster formation – are the patterns

observed based on all the variables in the datasets or are they driven by only a select few? This

may not be relevant in gene expression because removing one gene out of 19,539 isn’t going to

have much of an effect, but it definitely plays a role in the clinical datasets, both continuous and

categorical (20 features in total), as any of these variables have a much stronger effect on the

pairwise similarities being calculated. Figure 6 illustrates that most of these variables have co-

clustering probability higher than 0.90, with the exception of gender and lymphadenopathy

which have co-clustering probabilities below 0.7 when removed. This does show up in our

clusters as cluster number 2 is largely unique in both of these features, but the probabilities are

still pretty high and demonstrate that the patterns our clusters represent are driven by a

combination of variables rather than just a few single features. The effect of gender on our

cluster formation is to be expected, since our cohort is 60% male (Table 1) and there are inherent

differences between males and females in any dataset (87). Furthermore, KD is known to have a

higher occurrence in males, thus gender playing a key role in cluster formation in KD is

expected. Due to the confounding nature of the gender related genes in gene expression data and

for the purpose of extracting more meaningful results, we did, however, remove any gender

correlated variables from our gene expression data before gene enrichment analysis and feature

selection with FeaLect.

SNF was able to identify homogenous clusters in our KD cohort, in a completely data-driven

way. Furthermore, the extracted clusters exhibit clinically meaningful results, both across the

clinical and gene expression datasets, in the context of KD. Our SNF analysis was able to discern

patients that appeared to be in different biologic stages of disease at presentation, namely in

innate (cluster 1 and 3) or adaptive immune responses (cluster 2). Importance of innate and

59

adaptive immune system in KD was established in previous studies and has been summarized

earlier (92).

The first line of evidence for different phases of the immune response can be observed in the

patterns across the clinical variables in Figure 7 and Figure 9, covering demographic and

laboratory test features respectively. Looking at the clinical features more closely, cluster 2 has

longer duration of fever and lower incidence of classical KD symptoms (rash, lymphadenopathy,

oral changes, extremities, and conjunctivitis), which goes along with a lower inflammatory

response (reflected by lower CRP values in Figure 9). The seemingly contradictory elevated ESR

values in these patients actually correlate with the longer duration of fever, as ESR is an indirect

measure of inflammation and has much slower kinetics than CRP (24). As would be expected,

the differential white blood cell counts are also in accordance with our claim – cluster 2 shows

decreased neutrophil and bands percentages, but increased lymphocytes. Lastly, the increased

platelet count that is seen in cluster 2 also correlates with increased duration of fever, as the 2

have been previously linked, and is consistent with the natural history of KD moving into the

subacute phase with an IL-6 driven increase in platelets (26).

Cluster 1, in contrast to cluster 2, shows signs of an earlier response involving the innate immune

system. Cluster 1 still has higher incidence of the classical symptoms of KD, and the duration of

fever indicates that the patients are earlier in the inflammatory response, as seen in Figure 7 and

Figure 9. The higher levels of neutrophils, but lower lymphocytes further support the innate

phase of the disease.

Based on the clinical patterns, cluster 3 does not appear to strongly stand out as the other 2

clusters. It is however, most similar to cluster 1 in terms of fever duration at diagnosis,

60

presentation of the majority of classical KD symptoms, and most of the laboratory test variables

(Figure 7 and Figure 9). Consequently, the patients in this cluster also appear to be in the early

innate immune response stage of the disease, much like cluster 1. The 2 variables that mostly

differentiate clusters 1 and 3 are gender and presence of lymphadenopathy – the 2 features that

were previously identified in Figure 6 as having the strongest effect on cluster formation relative

to all the other variables. However, despite the similarity across a lot of the features, cluster 1

may actually exhibit a slightly more pronounced inflammatory response, as cluster 3 appears to

not only have lower neutrophil percentages and ESR levels compared to cluster 1, but the

biologic patterns seen in Figure 11 (related to inflammation and innate immune response) show

the same pattern, but with less intensity as cluster 1.

Based on just the clinical results, patients in cluster 2 have been sick for longer than those in the

other groups, and are transitioning into the adaptive immune response, hence the lower

inflammation profile both across markers of inflammation and white blood differential counts.

Clusters 1 and 3, on the other hand, are still early in the innate immune response, which is

reflected in higher inflammation profiles and white blood counts skewed towards innate immune

cells.

The most interesting part of our SNF analysis came from the incorporation of biologic data. SNF

formed the 3 clusters that were not only based on the clinical patterns described earlier, but

actually took into account the underlying patterns in the thousands of features in the gene

expression dataset as well. The results we got were ranking of genes that were significantly

different amongst the groups of patients (measured with Kruskal-Wallis test and adjusted with

Holm-Bonferroni multiple hypothesis testing correction). After removal of genes that strongly

61

correlated with gender, gene expression of the top remaining significant genes (411 genes)

revealed 2 clear patterns when taking the 3 clusters into account – first group of genes had lower

expression in cluster 2 and higher in clusters 1 and 3, while the opposite pattern was observed for

the second group of genes (Figure 10). The gene enrichment analysis carried out on the two gene

sets revealed patterns in accordance with the clinical findings.

The GO terms representative of the genes that follow a pattern of decreased levels of expression

in cluster 2 (Figure 11 and Table 3), appear to correlate with the patterns of inflammation that we

have seen in the clinical variables across the 3 clusters. “Innate Immune response”, “Immune

Response”, “Protein Kinase Cascade”, and “Inflammatory Response” GO terms all communicate

the same thing and follow the same patterns we have seen in neutrophil % and CRP (Figure 4B).

The pattern is that clusters 1 and 3 have the highest level of inflammation (cluster 1 slightly

higher than 3) and cluster 2 has the lowest. These correlations are expected because the genes

identified are the ones driving the inflammation. Although the list of genes that make up these

groups is still rather large, some specific genes that are of interest are IL-1 related (IL18R1,

IL1RAP, IL1R2 in Figure 11) and Fc-gamma receptor genes (FCGR1C in Figure 11), which

have been previously linked to Kawasaki Disease (69, 93). Fc-gamma receptors, which bind IgG

antibodies and transduce downstream signaling cascades (94), may potentially be involved in

IVIG response due to the abundance of IgG in IVIG preparations. Furthermore, a previously

mentioned GWAS study has identified FCGR2A gene, encoding one of the Fc-gamma receptor,

as linked to KD susceptibility (69). IL-1 secretion is linked to KD with increased levels of the

cytokine found in KD patients during the acute phase of the disease (93). IL-1RAP is part of the

IL-1 receptor and Il-18R1 encodes part of the IL-18 receptor (the genes can be found under

‘Innate Immune Response’ GO term in Figure 11), both of which are inflammatory, show

62

increased patterns of expression in clusters 1 and 3, relative to cluster 2 (95). IL1R2 (under the

‘Immune Response’ GO term in Figure 11), though an inhibitory decoy receptor, is normally

expressed as well in order to modulate the inflammatory response, so its increased expression in

clusters 1 and 3 alongside the pro-inflammatory IL-1 and 18 components is not out of the

ordinary (95). Furthermore, a paper studying the same KD cohort against other pediatric

bacterial and viral infections, identified the IL-1 signaling pathway as a key signature in KD

compared to the other diseases, with implications for use in treatment (82). In fact, a case report

of a relapsing KD patient has previously demonstrated the beneficial effects of an IL-1 receptor

antagonist on disease outcome, further supporting the importance of IL-1 signaling in KD

pathogenesis (96). The remaining GO groups pertaining to cell fractions and plasma membrane

appear to be non-specific and do not clearly describe a molecular function, so it is hard to draw

any conclusions about their implications in our clusters.

GO terms describing group 2 genes in Figure 12 (full list in Table 4), such as “Translation”,

“RNA processing”, “Mitochondrion”, and “Ribonucleoprotein Complex”, are all terms related to

metabolism and protein synthesis. Higher relative expression in cluster 2 further supports the

initiation of adaptive immune response in these patients, while innate immune response is still

active in clusters 1 and 3 (cluster 1 expression appears to be slightly lower than cluster 3,

consistent with the opposite intensity for group 1 GO terms). Even though both innate and

adaptive immune responses undergo cell activation and subsequent proliferation, innate cells,

such as neutrophils, are more transient and short lived (97). Since prolonged inflammation is

damaging to the host, it is tightly regulated to be diminished following acute inflammation (98).

This is in concert with observations of increased lymphocytes and decreased neutrophils in

63

cluster 2, indicative of a transition to the adaptive immune response and increased metabolism to

support lymphocyte proliferation in adaptive immunity.

Aside from metabolism involvement, another important GO group describing a set of genes was

related to MHC protein binding – a component of the immune system that was previously

associated with KD (60, 70). Once again, cluster 2 here showed higher levels of expression in the

related genes, compared to clusters 1 and 3. The HLA region has been previously linked to

Kawasaki Disease, as in the GWAS study published in 2012 that identified 6p21.3 region to have

genome-wide significance association with KD, which happened to include HLA Class 2 genes,

such as HLA-DQA2 and HLA-DOB (70). Though HLA-DMA and HLA-DMB (encoding the

HLA-DM heterodimer) in Figure 12 are not part of this list, they are actually closely linked to

HLA-DO (one of the 2 chains is encoded by HLA-DOB), where HLA-DO is a modulator of

HLA-DM (99). HLA-DMA is an MHC-Class 2 protein that plays an important role in the

loading of peptides onto HLA molecules for antigen presentation, helping to exchange CLIP for

the antigen peptide (100). It is hypothesized to also impose a form of peptide selection that

creates a specific immune response and prevents cross-reactivity (100). HLA-DO is upregulated

in resting APCs, which in turn downregulates HLA-DM activity, consequently promoting a

broader low-abundance repertoire of antigens to be presented on the surface to diminish the

chance of a reaction against self-peptides. (101). During an immune response, however, when

APCs need to present foreign antigens to the adaptive immune system, HLA-DM activity is

increased to facilitate an immunodominant response (101). The pattern observed seems to

correlate with transition to the adaptive response in cluster 2, that our clinical results and the role

of MHC related molecules appear to support.

64

Clinical outcomes, shown in Figure 6, compare and contrast the 3 clusters across IVIG

responsiveness and coronary outcome in KD. Though not significant, there is an apparent trend

between the patient groups that makes sense in the light of the other findings that were presented

earlier. Cluster 1 has lower proportion of patients that did not responded to IVIG, compared to

cluster 3, which can be attributed to the fact that this cluster was mostly composed of females.

This makes sense as there is a gender bias in KD susceptibility, with higher incidence in males

(1.5-1.7:1 ratio) (7). For the same reason, it is quite possible that cluster 3 displays relatively

higher proportion of patients with coronary enlargements due to the fact that this cluster is 100%

male. Together, these clinical patterns make sense, as IVIG non-responsiveness is often

associated with worse coronary outcome. However, taking into account most of the clinical and

laboratory test variables in Figure 7 and Figure 9, the differences observed between clusters 1

and 3 are small. Gene expression results, on the other hand, especially in Figure 11, appear to

show a more visible difference between the two clusters, relating to innate immune response and

inflammation gene signatures. Comparing this to the contrasting trend between outcome

measures in these groups of patients further supports the notion that the current AHA criteria for

diagnosis of KD is not good enough at identifying at-risk patients without biological input.

Failure to identify and properly treat at-risk patients in time, may very well attribute to the

relatively worse coronary outcome and lower IVIG responsiveness of cluster 3 patients that was

observed in relation to cluster 1. In contrast to these observations, we have also observed a rather

unique pattern in IVIG non-responsiveness of our cluster 2 patients. This appears to be very

meaningful clinically and biologically with respect to IVIG non-responsiveness, as cluster 2 is

the cluster of patients with longer duration of fever at diagnosis. Prolonged fever has been

identified as a strong predictor of poor coronary outcome in numerous previous studies (88-90).

65

As a result, cluster 2 having a relatively lower incidence of patients with a z-worst score > 2.5

compared to clusters 1 and 3, and having a relatively lower IVIG non-responsiveness rate, is an

indication that the clinical variables are not able to capture the underlying biological differences

within these patients. That said, unfortunately we did not find statistical significance in the

outcome measures due to lack of statistical power, as the differences between the clusters were

small and would require a larger sample size to reject the null hypothesis.

Analyzing gene patterns amongst the 3 clusters serves the purpose of describing the similarities

and differences between the groups of patients that we have in our cohort, but it shouldn’t be

used to make generalizations about any future patients. That is, out of all the significant hits we

found, we cannot draw too many conclusions about a feature’s significance in placing a given

patient into a given cluster, without doing some kind of supervised learning or validation. Using

FeaLect as a method for feature selection (79), we identified features that can be used as

classifiers for assigning one of the three clusters to any given patient. Since this is a multi-class

problem, the algorithm was used on 3 separate models, each for one of the clusters. In other

words, in each model, the output variable was converted to a binary classification of the patient

belonging to a given cluster or not. As a result, Figure 15 and Table 5 list the 3 sets of features

that are important in classifying a patient into either one of the clusters. The top clinical features

across the 3 sets of variables in Figure 15 and Table 5 are gender and the classical KD

symptoms. This is not surprising as these variables previously showed contrasting levels across

the 3 clusters (eg. gender and lymphadenopathy are the two variables that differ the most

between clusters 1 and 3). A number of clinical laboratory test features have also made the list as

they also appeared to strongly differ amongst the different groups of patients. These include

66

‘Platelet Count’, ‘Duration of Fever’, and ‘Neutrophils’, amongst others, all of which play a big

role in drawing a line between adaptive and innate immunity that we observed in this cohort.

Moving on to the biological variables that have been identified as informative, IFNAR1

(Interferon alpha and beta Receptor subunit 1) is one of the several genes that stands out in

relation to KD. This gene encodes one of the subunits of the IFNAR receptor, which is known to

bind type I interferons (102). Type I interferons, such as IFN-α and IFN-β, often induce anti-viral

and immune system modulating effects in response to viral or bacterial infections (102). This

subclass of cytokines also happens to be on the other end of the balance for autoinflammation

from IL-1 signaling (103), which we know is an important immune axis in KD (93). As they

counter-regulate each other, balance between Type I interferon activity and IL-1 signaling may

have implications for KD pathology. Figure 16 shows the relative expression of IFNAR1

between the clusters, with clusters 1 and 3 showing higher relative expression, is in accordance

with their innate immune response state. Previous studies have also found increased interferon

type I induced gene regulation in coronary arteries of KD patients (104).

Another gene in Figure 15 and Table 5 (with heatmap showing expression in Figure 16), that is

worth noting is GPD1L. GPD1L encodes a sodium channel interacting protein that is expressed

in cardiac tissues (105). It is thought to affect levels of a sodium channel SCN5A on the cell

membrane, with mutations in the GPD1L gene linked to inherited arrhythmias, such as Brugada

syndrome (105). Even though no exact association has been previously reported between this

gene and KD, low sodium concentrations were previously linked to IVIG unresponsiveness in

the Kobayashi score that was previously described (41).

67

Another important feature identified by FeaLect that is interesting is S100A12 (under cluster 3

classifiers in Figure 15 and Table 5). Its relative expression in cluster 2 is lower, compared to the

other two clusters, as seen in Figure 16. The importance of S100A12 in our list of classifiers is

not surprising as it was previously linked to the acute stages of KD (106). The S100 proteins

belong to the calcium binding family, including S100A8 and S100A9, which together with

S100A12, are found in phagocytes (106). S100A12 in particular, is known for binding the RAGE

protein (a pro-inflammatory pattern recognition receptor), and leading to downstream production

of pro-inflammatory markers such as TNF-α and IL-1β (106). S100A12 being a good indicator

of inflammation (106) is supported in our results, as cluster 2 does have decreased inflammation.

Lastly, POLR2G, MRLP2, and VARS encoding the polymerase II polypeptide G and

mitochondrial ribosomal protein L2, mitochondrial ribosomal protein L2, and valyl-tRNA

synthase, respectively, are worth noting as they appear under the cluster 2 set of classifiers in

Table 5. They show higher levels of expression and are most likely part of the features selected

by FeaLect due to the increased metabolic profile that was identified in cluster 2 earlier (Figure

12).

68

5 Study Limitations

Several components of the experimental design and analysis can be improved in future studies of

KD utilizing SNF. First and foremost, increasing the number of patients can increase the power

to detect significant patterns in coronary outcome and IVIG unresponsiveness in KD patients.

Another consideration during experimental design would be to include other datasets, such as

methylation data, protein expression, and other clinical variables in SNF analysis. Using other

large datasets may uncover relevant patterns related to KD pathology that SNF can incorporate in

its fused matrix. Acquiring gene expression data from coronary tissue, in contrast to whole blood

used in our study, may also provide more insight into KD. Similarly, adjusting the composition

of the smaller clinical datasets (where each variable may have a larger effect on the final

network, compared to a single feature from a large gene expression dataset) should be taken into

consideration. On the same note, removing effects of potentially confounding variables, such as

gender, may improve the discoverability of important new patterns. Removing gender effects

from the datasets, before running SNF, may aid in generating patient clusters with less

confounding effects. Lastly, validating findings in another KD cohort and carrying out laboratory

experiments to confirm effects of some of the genes identified can further improve our

understanding of KD in future studies.

69

6 Conclusions

Our hypothesis stated that SNF can combine clinical and biological data to identify homogenous

groups of KD patients, and our analysis showed this to be true. We have identified 3 unique

clusters that not only were based on the clinical variables, but were generated based on

contributions from the gene expression dataset as well. Taken together, our results showed that

cluster 2 patients were uniquely identified as being in the later stages of disease with transition to

the adaptive immune response phase, in the form of decreased inflammation markers, increase in

lymphocytes, and decrease in neutrophils. Gene expression results for cluster 2 reflected the

clinical manifestations with increase in metabolic profile, protein synthesis, and HLA expression,

as well as decrease in genes related to innate immune response and inflammation. Clusters 1 and

3, displayed a much earlier stage of the disease that was reflective of an innate immune response,

with higher neutrophil percentages and increased levels of inflammatory markers and genes.

Despite being similar to each other across most of the clinical and laboratory measures, clusters 1

and 3 differed across the intensity of their gene expression patterns, with cluster 1 showing a

slightly more pronounced inflammatory response. An even more pronounced difference was

observed across IVIG non-responsiveness and coronary outcome, exhibiting a higher risk trend

for cluster 3 patients. Overall, our findings have identified patterns that were previously

associated with KD and show the power of modern computational techniques in bringing

together multiple datasets to drive analysis. In a completely data driven way, SNF identified

clinically meaningful homogenous groups of patients in a KD cohort, further confirming the

heterogeneity of the disease and giving us a new toolset for future studies of KD.

70

7 References

1. Taubert, K. A., A. H. Rowley, and S. T. Shulman. 1991. Nationwide survey of Kawasaki

disease and acute rheumatic fever. The Journal of pediatrics 119: 279-282.

2. Kato, H., T. Sugimura, T. Akagi, N. Sato, K. Hashino, Y. Maeno, T. Kazue, G. Eto, and

R. Yamakawa. 1996. Long-term consequences of Kawasaki disease. A 10- to 21-year

follow-up study of 594 patients. Circulation 94: 1379-1385.

3. Uehara, R., and E. D. Belay. 2012. Epidemiology of Kawasaki disease in Asia, Europe,

and the United States. Journal of epidemiology / Japan Epidemiological Association 22:

79-85.

4. Joffe, A., A. Kabani, and T. Jadavji. 1995. Atypical and complicated Kawasaki disease in

infants. Do we need criteria? The Western journal of medicine 162: 322-327.

5. Tseng, C. F., Y. C. Fu, L. S. Fu, H. Betau, and C. S. Chi. 2001. Clinical spectrum of

Kawasaki disease in infants. Zhonghua yi xue za zhi = Chinese medical journal; Free

China ed 64: 168-173.

6. Witt, M. T., L. L. Minich, J. F. Bohnsack, and P. C. Young. 1999. Kawasaki disease:

more patients are being diagnosed who do not meet American Heart Association criteria.

Pediatrics 104: e10.

7. Newburger, J. W., M. Takahashi, M. A. Gerber, M. H. Gewitz, L. Y. Tani, J. C. Burns, S.

T. Shulman, A. F. Bolger, P. Ferrieri, R. S. Baltimore, W. R. Wilson, L. M. Baddour, M.

E. Levison, T. J. Pallasch, D. A. Falace, K. A. Taubert, E. Committee on Rheumatic

Fever, and C. o. C. D. i. t. Y. A. H. A. Kawasaki Disease. 2004. Diagnosis, treatment, and

long-term management of Kawasaki disease: a statement for health professionals from

the Committee on Rheumatic Fever, Endocarditis, and Kawasaki Disease, Council on

Cardiovascular Disease in the Young, American Heart Association. Pediatrics 114: 1708-

1733.

8. Holman, R. C., A. T. Curns, E. D. Belay, C. A. Steiner, and L. B. Schonberger. 2003.

Kawasaki syndrome hospitalizations in the United States, 1997 and 2000. Pediatrics 112:

495-501.

9. Chang, R. K. 2003. The incidence of Kawasaki disease in the United States did not

increase between 1988 and 1997. Pediatrics 111: 1124-1125.

10. Kim, G. B., J. W. Han, Y. W. Park, M. S. Song, Y. M. Hong, S. H. Cha, D. S. Kim, and

S. Park. 2014. Epidemiologic features of Kawasaki disease in South Korea: data from

nationwide survey, 2009-2011. The Pediatric infectious disease journal 33: 24-27.

11. Lue, H. C., L. R. Chen, M. T. Lin, L. Y. Chang, J. K. Wang, C. Y. Lee, and M. H. Wu.

2014. Epidemiological features of Kawasaki disease in Taiwan, 1976-2007: results of

five nationwide questionnaire hospital surveys. Pediatrics and neonatology 55: 92-96.

71

12. Chen, J. J., X. J. Ma, F. Liu, W. L. Yan, M. R. Huang, M. Huang, G. Y. Huang, and G.

Shanghai Kawasaki Disease Research. 2016. Epidemiologic Features of Kawasaki

Disease in Shanghai From 2008 Through 2012. The Pediatric infectious disease journal

35: 7-12.

13. Holman, R. C., E. D. Belay, K. Y. Christensen, A. M. Folkema, C. A. Steiner, and L. B.

Schonberger. 2010. Hospitalizations for Kawasaki syndrome among children in the

United States, 1997-2007. The Pediatric infectious disease journal 29: 483-488.

14. Holman, R. C., A. T. Curns, E. D. Belay, C. A. Steiner, P. V. Effler, K. L. Yorita, J.

Miyamura, S. Forbes, L. B. Schonberger, and M. Melish. 2005. Kawasaki syndrome in

Hawaii. The Pediatric infectious disease journal 24: 429-433.

15. Fujita, Y., Y. Nakamura, K. Sakata, N. Hara, M. Kobayashi, M. Nagai, H. Yanagawa,

and T. Kawasaki. 1989. Kawasaki disease in families. Pediatrics 84: 666-669.

16. Harada, F., M. Sada, T. Kamiya, Y. Yanase, T. Kawasaki, and T. Sasazuki. 1986. Genetic

analysis of Kawasaki syndrome. American journal of human genetics 39: 537-539.

17. Park, Y. W., J. W. Han, Y. M. Hong, J. S. Ma, S. H. Cha, T. C. Kwon, S. B. Lee, C. H.

Kim, J. S. Lee, and C. H. Kim. 2011. Epidemiological features of Kawasaki disease in

Korea, 2006-2008. Pediatrics international : official journal of the Japan Pediatric

Society 53: 36-39.

18. Huang, W. C., L. M. Huang, I. S. Chang, L. Y. Chang, B. L. Chiang, P. J. Chen, M. H.

Wu, H. C. Lue, C. Y. Lee, and G. Kawasaki Disease Research. 2009. Epidemiologic

features of Kawasaki disease in Taiwan, 2003-2006. Pediatrics 123: e401-405.

19. Ma, X. J., C. Y. Yu, M. Huang, S. B. Chen, M. R. Huang, G. Y. Huang, and G. Shanghai

Kawasaki Research. 2010. Epidemiologic features of Kawasaki disease in Shanghai from

2003 through 2007. Chinese medical journal 123: 2629-2634.

20. Kawasaki, T. 2006. Kawasaki disease. Proceedings of the Japan Academy. Series B,

Physical and biological sciences 82: 59-71.

21. Gerding, R. 2011. Kawasaki disease: a review. Journal of pediatric health care : official

publication of National Association of Pediatric Nurse Associates & Practitioners 25:

379-387.

22. Son, M. B., and R. P. Sundel. 2016. Chapter 35 - Kawasaki Disease A2 - Wedderburn,

Ross E. PettyRonald M. LaxerCarol B. LindsleyLucy R. In Textbook of Pediatric

Rheumatology (Seventh Edition). W.B. Saunders, Philadelphia. 467-483.e466.

23. Tashiro, N., T. Matsubara, M. Uchida, K. Katayama, T. Ichiyama, and S. Furukawa.

2002. Ultrasonographic evaluation of cervical lymph nodes in Kawasaki disease.

Pediatrics 109: E77-77.

72

24. Anderson, M. S., J. Burns, T. A. Treadwell, B. A. Pietra, and M. P. Glode. 2001.

Erythrocyte sedimentation rate and C-reactive protein discrepancy and high prevalence of

coronary artery abnormalities in Kawasaki disease. The Pediatric infectious disease

journal 20: 698-702.

25. Burns, J. C., W. H. Mason, M. P. Glode, S. T. Shulman, M. E. Melish, C. Meissner, J.

Bastian, A. S. Beiser, H. M. Meyerson, and J. W. Newburger. 1991. Clinical and

epidemiologic characteristics of patients referred for evaluation of possible Kawasaki

disease. United States Multicenter Kawasaki Disease Study Group. The Journal of

pediatrics 118: 680-686.

26. Yeung, R. S. 2007. Phenotype and coronary outcome in Kawasaki's disease. Lancet 369:

85-87.

27. Suddleson, E. A., B. Reid, M. M. Woolley, and M. Takahashi. 1987. Hydrops of the

gallbladder associated with Kawasaki syndrome. Journal of pediatric surgery 22: 956-

959.

28. de Zorzi, A., S. D. Colan, K. Gauvreau, A. L. Baker, R. P. Sundel, and J. W. Newburger.

1998. Coronary artery dimensions may be misclassified as normal in Kawasaki disease.

The Journal of pediatrics 133: 254-258.

29. Kurotobi, S., T. Nagai, N. Kawakami, and T. Sano. 2002. Coronary diameter in normal

infants, children and patients with Kawasaki disease. Pediatrics international : official

journal of the Japan Pediatric Society 44: 1-4.

30. Bayry, J., V. S. Negi, and S. V. Kaveri. 2011. Intravenous immunoglobulin therapy in

rheumatic diseases. Nature reviews. Rheumatology 7: 349-359.

31. Gelfand, E. W. 2012. Intravenous immune globulin in autoimmune and inflammatory

diseases. The New England journal of medicine 367: 2015-2025.

32. Tremoulet, A. H., B. M. Best, S. Song, S. Wang, E. Corinaldesi, J. R. Eichenfield, D. D.

Martin, J. W. Newburger, and J. C. Burns. 2008. Resistance to intravenous

immunoglobulin in children with Kawasaki disease. The Journal of pediatrics 153: 117-

121.

33. Durongpisitkul, K., V. J. Gururaj, J. M. Park, and C. F. Martin. 1995. The prevention of

coronary artery aneurysm in Kawasaki disease: a meta-analysis on the efficacy of aspirin

and immunoglobulin treatment. Pediatrics 96: 1057-1061.

34. Tremoulet, A. H., J. Dutkowski, Y. Sato, J. T. Kanegaye, X. B. Ling, and J. C. Burns.

2015. Novel data-mining approach identifies biomarkers for diagnosis of Kawasaki

disease. Pediatric research.

73

35. Sudo, D., Y. Monobe, M. Yashiro, M. N. Mieno, R. Uehara, K. Tsuchiya, T. Sonobe, and

Y. Nakamura. 2012. Coronary artery lesions of incomplete Kawasaki disease: a

nationwide survey in Japan. European journal of pediatrics 171: 651-656.

36. Asai, T. 1983. Evaluation Method for the degree of seriousness in Kawasaki Disease.

Pediatrics International 25: 170-175.

37. Nakano, H., K. Ueda, A. Saito, Y. Tsuchitani, J. Kawamori, T. Miyake, and T. Yoshida.

1986. Scoring method for identifying patients with Kawasaki disease at high risk of

coronary artery aneurysms. The American journal of cardiology 58: 739-742.

38. Iwasa, M., K. Sugiyama, T. Ando, H. Nomura, T. Katoh, and Y. Wada. 1987. Selection

of high-risk children for immunoglobulin therapy in Kawasaki disease. Progress in

clinical and biological research 250: 543-544.

39. Harada, K. 1991. Intravenous gamma-globulin treatment in Kawasaki disease. Acta

paediatrica Japonica; Overseas edition 33: 805-810.

40. Beiser, A. S., M. Takahashi, A. L. Baker, R. P. Sundel, and J. W. Newburger. 1998. A

predictive instrument for coronary artery aneurysms in Kawasaki disease. US Multicenter

Kawasaki Disease Study Group. The American journal of cardiology 81: 1116-1120.

41. Kobayashi, T., Y. Inoue, K. Takeuchi, Y. Okada, K. Tamura, T. Tomomasa, T.

Kobayashi, and A. Morikawa. 2006. Prediction of intravenous immunoglobulin

unresponsiveness in patients with Kawasaki disease. Circulation 113: 2606-2612.

42. Kobayashi, T., T. Saji, T. Otani, K. Takeuchi, T. Nakamura, H. Arakawa, T. Kato, T.

Hara, K. Hamaoka, S. Ogawa, M. Miura, Y. Nomura, S. Fuse, F. Ichida, M. Seki, R.

Fukazawa, C. Ogawa, K. Furuno, H. Tokunaga, S. Takatsuki, S. Hara, A. Morikawa, and

R. s. g. investigators. 2012. Efficacy of immunoglobulin plus prednisolone for prevention

of coronary artery abnormalities in severe Kawasaki disease (RAISE study): a

randomised, open-label, blinded-endpoints trial. Lancet 379: 1613-1620.

43. Hui-Yuen, J. S., T. T. Duong, and R. S. Yeung. 2006. TNF-alpha is necessary for

induction of coronary artery inflammation and aneurysm formation in an animal model of

Kawasaki disease. Journal of immunology 176: 6294-6301.

44. Rowley, A. H., S. T. Shulman, B. T. Spike, C. A. Mask, and S. C. Baker. 2001.

Oligoclonal IgA response in the vascular wall in acute Kawasaki disease. Journal of

immunology 166: 1334-1343.

45. Brown, T. J., S. E. Crawford, M. L. Cornwall, F. Garcia, S. T. Shulman, and A. H.

Rowley. 2001. CD8 T lymphocytes and macrophages infiltrate coronary artery

aneurysms in acute Kawasaki disease. The Journal of infectious diseases 184: 940-943.

74

46. Lau, A. C., T. T. Duong, S. Ito, and R. S. Yeung. 2008. Matrix metalloproteinase 9

activity leads to elastin breakdown in an animal model of Kawasaki disease. Arthritis and

rheumatism 58: 854-863.

47. Kikuta, H., Y. Sakiyama, S. Matsumoto, I. Hamada, M. Yazaki, T. Iwaki, and M.

Nakano. 1993. Detection of Epstein-Barr virus DNA in cardiac and aortic tissues from

chronic, active Epstein-Barr virus infection associated with Kawasaki disease-like

coronary artery aneurysms. The Journal of pediatrics 123: 90-92.

48. Matsuno, S., E. Utagawa, and A. Sugiura. 1983. Association of rotavirus infection with

Kawasaki syndrome. The Journal of infectious diseases 148: 177.

49. Holm, J. M., L. K. Hansen, and H. Oxhoj. 1995. Kawasaki disease associated with

parvovirus B19 infection. European journal of pediatrics 154: 633-634.

50. Okano, M., G. M. Thiele, Y. Sakiyama, S. Matsumoto, and D. T. Purtilo. 1990.

Adenovirus infection in patients with Kawasaki disease. Journal of medical virology 32:

53-57.

51. Normann, E., J. Naas, J. Gnarpe, H. Backman, and H. Gnarpe. 1999. Demonstration of

Chlamydia pneumoniae in cardiovascular tissues from children with Kawasaki disease.

The Pediatric infectious disease journal 18: 72-73.

52. Lehman, T. J., S. M. Walker, V. Mahnovski, and D. McCurdy. 1985. Coronary arteritis in

mice following the systemic injection of group B Lactobacillus casei cell walls in

aqueous suspension. Arthritis and rheumatism 28: 652-659.

53. Duong, T. T., E. D. Silverman, M. V. Bissessar, and R. S. Yeung. 2003. Superantigenic

activity is responsible for induction of coronary arteritis in mice: an animal model of

Kawasaki disease. International immunology 15: 79-89.

54. Onouchi, Y. 2009. Molecular genetics of Kawasaki disease. Pediatric research 65: 46R-

54R.

55. Pulst, S. M. 1999. Genetic linkage analysis. Archives of neurology 56: 667-672.

56. Bush, W. S., and J. H. Moore. 2012. Chapter 11: Genome-wide association studies. PLoS

computational biology 8: e1002822.

57. Reich, D. E., and E. S. Lander. 2001. On the allelic spectrum of human disease. Trends in

genetics : TIG 17: 502-510.

58. Kuhn, K., S. C. Baker, E. Chudin, M. H. Lieu, S. Oeser, H. Bennett, P. Rigault, D.

Barker, T. K. McDaniel, and M. S. Chee. 2004. A novel, high-performance random array

platform for quantitative gene expression profiling. Genome research 14: 2347-2356.

75

59. Robinson, M. D., and T. P. Speed. 2007. A comparison of Affymetrix gene expression

arrays. BMC bioinformatics 8: 449.

60. Kato, S., M. Kimura, K. Tsuji, S. Kusakawa, T. Asai, T. Juji, and T. Kawasaki. 1978.

HLA antigens in Kawasaki disease. Pediatrics 61: 252-255.

61. Kamizono, S., A. Yamada, T. Higuchi, H. Kato, and K. Itoh. 1999. Analysis of tumor

necrosis factor-alpha production and polymorphisms of the tumor necrosis factor-alpha

gene in individuals with a history of Kawasaki disease. Pediatrics international : official

journal of the Japan Pediatric Society 41: 341-345.

62. Burns, J. C., C. Shimizu, H. Shike, J. W. Newburger, R. P. Sundel, A. L. Baker, T.

Matsubara, Y. Ishikawa, V. A. Brophy, S. Cheng, M. A. Grow, L. L. Steiner, N. Kono,

and R. M. Cantor. 2005. Family-based association analysis implicates IL-4 in

susceptibility to Kawasaki disease. Genes and immunity 6: 438-444.

63. Ohno, T., H. Igarashi, K. Inoue, K. Akazawa, K. Joho, and T. Hara. 2000. Serum

vascular endothelial growth factor: a new predictive indicator for the occurrence of

coronary artery lesions in Kawasaki disease. European journal of pediatrics 159: 424-

429.

64. Senzaki, H., S. Masutani, J. Kobayashi, T. Kobayashi, H. Nakano, H. Nagasaka, N.

Sasaki, H. Asano, S. Kyo, and Y. Yokote. 2001. Circulating matrix metalloproteinases

and their inhibitors in patients with Kawasaki disease. Circulation 104: 860-863.

65. Cheung, Y. F., G. Y. Huang, S. B. Chen, X. Q. Liu, L. Xi, X. C. Liang, M. R. Huang, S.

Chen, L. S. Huang, X. Q. Liu, K. W. Chan, and Y. L. Lau. 2008. Inflammatory gene

polymorphisms and susceptibility to kawasaki disease and its arterial sequelae. Pediatrics

122: e608-614.

66. Onouchi, Y., M. Tamari, A. Takahashi, T. Tsunoda, M. Yashiro, Y. Nakamura, H.

Yanagawa, K. Wakui, Y. Fukushima, T. Kawasaki, Y. Nakamura, and A. Hata. 2007. A

genomewide linkage analysis of Kawasaki disease: evidence for linkage to chromosome

12. Journal of human genetics 52: 179-190.

67. Onouchi, Y., T. Gunji, J. C. Burns, C. Shimizu, J. W. Newburger, M. Yashiro, Y.

Nakamura, H. Yanagawa, K. Wakui, Y. Fukushima, F. Kishi, K. Hamamoto, M. Terai,

Y. Sato, K. Ouchi, T. Saji, A. Nariai, Y. Kaburagi, T. Yoshikawa, K. Suzuki, T. Tanaka,

T. Nagai, H. Cho, A. Fujino, A. Sekine, R. Nakamichi, T. Tsunoda, T. Kawasaki, Y.

Nakamura, and A. Hata. 2008. ITPKC functional polymorphism associated with

Kawasaki disease susceptibility and formation of coronary artery aneurysms. Nature

genetics 40: 35-42.

68. Onouchi, Y., K. Ozaki, J. C. Buns, C. Shimizu, H. Hamada, T. Honda, M. Terai, A.

Honda, T. Takeuchi, S. Shibuta, T. Suenaga, H. Suzuki, K. Higashi, K. Yasukawa, Y.

Suzuki, K. Sasago, Y. Kemmotsu, S. Takatsuki, T. Saji, T. Yoshikawa, T. Nagai, K.

Hamamoto, F. Kishi, K. Ouchi, Y. Sato, J. W. Newburger, A. L. Baker, S. T. Shulman,

76

A. H. Rowley, M. Yashiro, Y. Nakamura, K. Wakui, Y. Fukushima, A. Fujino, T.

Tsunoda, T. Kawasaki, A. Hata, Y. Nakamura, and T. Tanaka. 2010. Common variants in

CASP3 confer susceptibility to Kawasaki disease. Human molecular genetics 19: 2898-

2906.

69. Khor, C. C., S. Davila, W. B. Breunis, Y. C. Lee, C. Shimizu, V. J. Wright, R. S. Yeung,

D. E. Tan, K. S. Sim, J. J. Wang, T. Y. Wong, J. Pang, P. Mitchell, R. Cimaz, N. Dahdah,

Y. F. Cheung, G. Y. Huang, W. Yang, I. S. Park, J. K. Lee, J. Y. Wu, M. Levin, J. C.

Burns, D. Burgner, T. W. Kuijpers, M. L. Hibberd, C. Hong Kong-Shanghai Kawasaki

Disease Genetics, C. Korean Kawasaki Disease Genetics, C. Taiwan Kawasaki Disease

Genetics, C. International Kawasaki Disease Genetics, U. S. K. D. G. Consortium, and S.

Blue Mountains Eye. 2011. Genome-wide association study identifies FCGR2A as a

susceptibility locus for Kawasaki disease. Nature genetics 43: 1241-1246.

70. Onouchi, Y., K. Ozaki, J. C. Burns, C. Shimizu, M. Terai, H. Hamada, T. Honda, H.

Suzuki, T. Suenaga, T. Takeuchi, N. Yoshikawa, Y. Suzuki, K. Yasukawa, R. Ebata, K.

Higashi, T. Saji, Y. Kemmotsu, S. Takatsuki, K. Ouchi, F. Kishi, T. Yoshikawa, T.

Nagai, K. Hamamoto, Y. Sato, A. Honda, H. Kobayashi, J. Sato, S. Shibuta, M.

Miyawaki, K. Oishi, H. Yamaga, N. Aoyagi, S. Iwahashi, R. Miyashita, Y. Murata, K.

Sasago, A. Takahashi, N. Kamatani, M. Kubo, T. Tsunoda, A. Hata, Y. Nakamura, T.

Tanaka, C. Japan Kawasaki Disease Genome, and U. S. K. D. G. Consortium. 2012. A

genome-wide association study identifies three new risk loci for Kawasaki disease.

Nature genetics 44: 517-521.

71. Lee, Y. C., H. C. Kuo, J. S. Chang, L. Y. Chang, L. M. Huang, M. R. Chen, C. D. Liang,

H. Chi, F. Y. Huang, M. L. Lee, Y. C. Huang, B. Hwang, N. C. Chiu, K. P. Hwang, P. C.

Lee, L. C. Chang, Y. M. Liu, Y. J. Chen, C. H. Chen, I. D. A. Taiwan Pediatric, Y. T.

Chen, F. J. Tsai, and J. Y. Wu. 2012. Two new susceptibility loci for Kawasaki disease

identified through genome-wide association analysis. Nature genetics 44: 522-525.

72. Wang, B., A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains, and

A. Goldenberg. 2014. Similarity network fusion for aggregating data types on a genomic

scale. Nature methods 11: 333-337.

73. Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette,

A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. 2005. Gene

set enrichment analysis: a knowledge-based approach for interpreting genome-wide

expression profiles. Proceedings of the National Academy of Sciences of the United

States of America 102: 15545-15550.

74. Beissbarth, T., and T. P. Speed. 2004. GOstat: find statistically overrepresented Gene

Ontologies within a group of genes. Bioinformatics 20: 1464-1465.

75. Khatri, P., P. Bhavsar, G. Bawa, and S. Draghici. 2004. Onto-Tools: an ensemble of web-

accessible, ontology-based tools for the functional design and interpretation of high-

throughput gene expression experiments. Nucleic acids research 32: W449-456.

77

76. Huang da, W., B. T. Sherman, and R. A. Lempicki. 2009. Systematic and integrative

analysis of large gene lists using DAVID bioinformatics resources. Nature protocols 4:

44-57.

77. Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis,

K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A.

Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G.

Sherlock. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology

Consortium. Nature genetics 25: 25-29.

78. Bousquet, O., G. Raetsch, and U. von Luxburg. 2004. Advanced Lectures on Machine

Learning. Springer-Verlag Berlin Heidelberg.

79. Zare, H., G. Haffari, A. Gupta, and R. R. Brinkman. 2013. Scoring relevancy of features

based on combinatorial analysis of Lasso with application to lymphoma diagnosis. BMC

Genomics 14: 1-9.

80. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society. Series B (Methodological): 267-288.

81. Bach, F. R. 2008. Bolasso: model consistent Lasso estimation through the bootstrap.

CoRR abs/0804.1302.

82. Hoang, L. T., C. Shimizu, L. Ling, A. N. Naim, C. C. Khor, A. H. Tremoulet, V. Wright,

M. Levin, M. L. Hibberd, and J. C. Burns. 2014. Global gene expression profiling

identifies new therapeutic targets in acute Kawasaki disease. Genome medicine 6: 541.

83. Bao, R., L. Huang, J. Andrade, W. Tan, W. A. Kibbe, H. Jiang, and G. Feng. 2014.

Review of current methods, applications, and data management for the bioinformatics

analysis of whole exome sequencing. Cancer informatics 13: 67-82.

84. von Luxburg, U. 2007. A tutorial on spectral clustering. Statistics and Computing 17:

395-416.

85. Pruitt, K. D., T. Tatusova, and D. R. Maglott. 2005. NCBI Reference Sequence (RefSeq):

a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic

acids research 33: D501-504.

86. Saguil, A., M. Fargo, and S. Grogan. 2015. Diagnosis and management of kawasaki

disease. American family physician 91: 365-371.

87. Hajian Tilaki, K. 2012. Methodological issues of confounding in analytical

epidemiologic studies. Caspian journal of internal medicine 3: 488-495.

88. Daniels, S. R., B. Specker, T. E. Capannari, D. C. Schwartz, M. J. Burke, and S. Kaplan.

1987. Correlates of coronary artery aneurysm formation in patients with Kawasaki

disease. American journal of diseases of children 141: 205-207.

78

89. Ichida, F., N. S. Fatica, M. A. Engle, J. E. O'Loughlin, A. A. Klein, M. S. Snyder, K. H.

Ehlers, and A. R. Levin. 1987. Coronary artery involvement in Kawasaki syndrome in

Manhattan, New York: risk factors and role of aspirin. Pediatrics 80: 828-835.

90. Koren, G., S. Lavi, V. Rose, and R. Rowe. 1986. Kawasaki disease: review of risk factors

for coronary aneurysms. The Journal of pediatrics 108: 388-392.

91. Foell, D., F. Ichida, T. Vogl, X. Yu, R. Chen, T. Miyawaki, C. Sorg, and J. Roth. 2003.

S100A12 (EN-RAGE) in monitoring Kawasaki disease. Lancet 361: 1270-1272.

92. Schulte, D. J., A. Yilmaz, K. Shimada, M. C. Fishbein, E. L. Lowe, S. Chen, M. Wong,

T. M. Doherty, T. Lehman, T. R. Crother, R. Sorrentino, and M. Arditi. 2009.

Involvement of innate and adaptive immunity in a murine model of coronary arteritis

mimicking Kawasaki disease. Journal of immunology 183: 5311-5318.

93. Leung, D. Y., R. S. Cotran, E. Kurt-Jones, J. C. Burns, J. W. Newburger, and J. S. Pober.

1989. Endothelial cell activation and high interleukin-1 secretion in the pathogenesis of

acute Kawasaki disease. Lancet 2: 1298-1302.

94. Nimmerjahn, F., and J. V. Ravetch. 2008. Fcgamma receptors as regulators of immune

responses. Nature reviews. Immunology 8: 34-47.

95. Boraschi, D., and A. Tagliabue. 2013. The interleukin-1 receptor family. Seminars in

immunology 25: 394-407.

96. Cohen, S., C. E. Tacke, B. Straver, N. Meijer, I. M. Kuipers, and T. W. Kuijpers. 2012. A

child with severe relapsing Kawasaki disease rescued by IL-1 receptor blockade and

extracorporeal membrane oxygenation. Annals of the rheumatic diseases 71: 2059-2061.

97. Ganeshan, K., and A. Chawla. 2014. Metabolic regulation of immune responses. Annual

review of immunology 32: 609-634.

98. Medzhitov, R., and T. Horng. 2009. Transcriptional control of the inflammatory

response. Nature reviews. Immunology 9: 692-703.

99. Mellins, E. D., and L. J. Stern. 2014. HLA-DM and HLA-DO, key regulators of MHC-II

processing and presentation. Current opinion in immunology 26: 115-122.

100. Sadegh-Nasseri, S., M. Chen, K. Narayan, and M. Bouvier. 2008. The convergent roles

of tapasin and HLA-DM in antigen presentation. Trends in immunology 29: 141-147.

101. Denzin, L. K. 2013. Inhibition of HLA-DM Mediated MHC Class II Peptide Loading by

HLA-DO Promotes Self Tolerance. Frontiers in immunology 4: 465.

102. McNab, F., K. Mayer-Barber, A. Sher, A. Wack, and A. O'Garra. 2015. Type I

interferons in infectious disease. Nature reviews. Immunology 15: 87-103.

79

103. van Kempen, T. S., M. H. Wenink, E. F. Leijten, T. R. Radstake, and M. Boes. 2015.

Perception of self: distinguishing autoimmunity from autoinflammation. Nature reviews.

Rheumatology 11: 483-492.

104. Rowley, A. H., K. M. Wylie, K. Y. Kim, A. J. Pink, A. Yang, R. Reindel, S. C. Baker, S.

T. Shulman, J. M. Orenstein, M. W. Lingen, G. M. Weinstock, and T. N. Wylie. 2015.

The transcriptional profile of coronary arteritis in Kawasaki disease. BMC Genomics 16:

1076.

105. London, B., M. Michalec, H. Mehdi, X. Zhu, L. Kerchner, S. Sanyal, P. C. Viswanathan,

A. E. Pfahnl, L. L. Shang, M. Madhusudanan, C. J. Baty, S. Lagana, R. Aleong, R.

Gutmann, M. J. Ackerman, D. M. McNamara, R. Weiss, and S. C. Dudley, Jr. 2007.

Mutation in glycerol-3-phosphate dehydrogenase 1 like gene (GPD1-L) decreases cardiac

Na+ current and causes inherited arrhythmias. Circulation 116: 2260-2268.

106. Foell, D., and J. Roth. 2004. Proinflammatory S100 proteins in arthritis and autoimmune

disease. Arthritis and rheumatism 50: 3762-3771.

integrating biologic and clinical data towards resolving ... · integrating biologic and clinical...

Documents