integrating biologic and clinical data towards resolving ... · integrating biologic and clinical...
TRANSCRIPT
Integrating Biologic and Clinical Data towards Resolving Heterogeneity in Childhood Inflammatory Diseases
by
Andrey Mikhaylov
A thesis submitted in conformity with the requirements for the degree of Master of Science
Department of Immunology University of Toronto
© Copyright by Andrey Mikhaylov 2016
ii
Integrating Biologic and Clinical Data towards Resolving
Heterogeneity in Childhood Inflammatory Diseases
Andrey Mikhaylov
Master of Science
Department of Immunology
University of Toronto
2016
Abstract
Kawasaki disease (KD) is the leading cause of acquired heart disease in children from the
developed world, with up to 25% risk of developing aneurysms if untreated. Diagnosis uses a set
of classical clinical symptoms, which fail to capture the heterogeneity in KD. One solution is to
incorporate new biomarkers and expanded biologic datasets to generate new predictive models
that can better discern homogeneous groups of patients. Using Similarity Network Fusion (SNF),
a novel computational technique, we uncovered 3 robust clusters of patients after fusing gene
expression and clinical datasets for 171 KD patients. The first cluster is older females with
marked activation of the innate immune response, second cluster is patients with prolonged fever
and markers of activation of the adaptive response, while cluster 3 is males with no
lymphadenopathy in a less severe innate immune response. SNF identified clinically meaningful
clusters of patients and is a promising new tool for future KD studies.
iii
Acknowledgments
I would like to express my deepest gratitude to my supervisor Dr. Rae Yeung for all the guidance
and support during my time in the lab. I would also like to acknowledge and thank my committee
members Dr. Pamela Ohashi, Dr. Anna Goldenberg, and Dr. Shannon Dunn, for providing
invaluable feedback for my thesis project. A huge thanks to Dr. Trang Duong for having patience
with me and helping me at every step of this journey. Lastly, I am very grateful for meeting
every single member of the Yeung lab – thank you for all the help and the fun times!
iv
Table of Contents
Acknowledgments.................................................................................................................... iii
Table of Contents ..................................................................................................................... iv
List of Tables .......................................................................................................................... vii
List of Figures ........................................................................................................................ viii
List of Abbreviations .................................................................................................................x
1. Introduction ............................................................................................................................1
1.1 Kawasaki Disease - overview and epidemiology ..........................................................1
1.1.1 Overview ............................................................................................................1
1.1.2 Incidence rates ...................................................................................................1
1.1.3 Seasonal outbreaks .............................................................................................2
1.2 Kawasaki Disease - Diagnosis and Treatment ...............................................................2
1.2.1 Clinical symptoms .............................................................................................2
1.2.2 Laboratory tests ..................................................................................................4
1.2.3 Extra-cardiac findings ........................................................................................5
1.2.4 Cardiac findings .................................................................................................5
1.2.5 KD Treatment ....................................................................................................6
1.2.6 AHA Diagnostic criteria sensitivity and specificity ..........................................6
1.2.7 Risk scoring systems ..........................................................................................7
1.3 Etiology ..........................................................................................................................9
1.3.1 Immune response ...............................................................................................9
1.3.2 Environmental triggers.......................................................................................9
1.4 Translational studies ....................................................................................................10
1.4.1 Linkage analysis...............................................................................................10
1.4.2 Genome-wide association studies (GWAS) .....................................................10
v
1.4.3 Gene expression ...............................................................................................11
1.5 Post-translational studies in children ...........................................................................12
1.5.1 Candidate gene approach .................................................................................12
1.5.2 ITPKC and CASP3 ..........................................................................................12
1.5.3 FCGR2A ..........................................................................................................12
1.5.4 MHCII, CD40, and BLK .................................................................................13
1.5.5 Summary of findings........................................................................................13
1.6 Computational Analysis ...............................................................................................14
1.6.1 Introduction ......................................................................................................14
1.6.2 Data aggregation ..............................................................................................14
1.6.3 Approach to computational analysis ................................................................15
1.6.4 Similarity network fusion ................................................................................15
1.6.5 Gene enrichment analysis ................................................................................16
1.6.6 Feature selection and classifiers.......................................................................18
1.6.7 Heterogeneity in KD ........................................................................................19
1.6.8 Rationale ..........................................................................................................20
1.6.9 Hypothesis and objectives................................................................................22
2 Methods ...............................................................................................................................23
2.1 KD Cohort ....................................................................................................................23
2.2 Gene expression microarray ........................................................................................23
2.3 Datasets ........................................................................................................................24
2.4 Computational analysis workflow ...............................................................................24
2.5 Data pre-processing .....................................................................................................25
2.6 Similarity network fusion ............................................................................................25
2.7 Gene enrichment analysis ............................................................................................26
vi
2.8 Co-clustering probability .............................................................................................26
2.9 Statistical analysis ........................................................................................................27
2.10 FeaLect feature selection ............................................................................................27
3 Results .................................................................................................................................29
3.1 KD cohort and data pre-processing ..............................................................................29
3.2 Three unique clusters were identified after aggregation of clinical and gene
expression datasets with SNF ......................................................................................31
3.3 High robustness and low clinical feature sensitivity amongst the 3 clusters ...............34
3.4 Unique clinical profiles characterize the 3 clusters .....................................................36
3.5 Unique gene expression profiles characterize the 3 clusters .......................................41
3.6 Variation in treatment response and coronary outcome across the 3 clusters .............51
3.7 Unique clinical and biological classifiers for predicting cluster assignment ...............52
4 Discussion ...........................................................................................................................57
5 Study Limitations ................................................................................................................68
6 Conclusions .........................................................................................................................69
7 References ...........................................................................................................................70
vii
List of Tables
Table 1. Laboratory measures and clinical characteristics for the KD cohort. ............................. 30
Table 2. Biologic and clinical datasets used for SNF analysis. .................................................... 31
Table 3. List of group 1 significant GO terms from the DAVID gene enrichment analysis. ....... 46
Table 4. List of group 2 significant GO terms from the DAVID gene enrichment analysis. ....... 48
Table 5. Description of FeaLect classifiers for predicting cluster assignment. ............................ 55
viii
List of Figures
Figure 1. Similarity Network Fusion Algorithm........................................................................... 16
Figure 2. Illustration of Gene Ontology hierarchy. ....................................................................... 18
Figure 3. Steps involved in the computational analysis of the KD cohort. .................................. 25
Figure 4. Three distinct clusters of patients recovered using SNF. .............................................. 33
Figure 5. SNF displays high robustness in identifying the three clusters in response to removal of
patients. ......................................................................................................................................... 34
Figure 6. The 3 clusters are most sensitive to removal of ‘Proportion Male’ and
‘Lymphadenopathy’ clinical variables. ......................................................................................... 35
Figure 7. Unique clinical and demographic profiles characterize the three clusters. ................... 37
Figure 8. Ethnic group profiles are similar across the 3 clusters .................................................. 38
Figure 9. Unique laboratory test profiles characterize the three clusters. ..................................... 40
Figure 10. The 3 clusters are characterized by unique gene expression profiles. ......................... 43
Figure 11. Group 1 genes are representative of inflammation and immune response related GO
terms. ............................................................................................................................................. 45
Figure 12. Group 2 genes are representative of metabolism related GO terms. ........................... 47
Figure 13. The 3 clusters vary with respect to treatment response and disease outcome. ............ 52
Figure 14. FeaLect total feature scores. ........................................................................................ 54
ix
Figure 15. Informative clinical and biologic features identified with FeaLect. ............................ 54
Figure 16. Relative gene expression profiles of biologic variables for each set of features
extracted with FeaLect. ................................................................................................................. 56
x
List of Abbreviations
AHA American Heart Association
ALT Alanine Transaminase
AST Aspartate Aminotransferase
BLK B Lymphocyte Kinase
CAA Coronary Artery Aneurysm
CAL Coronary Artery Lesions
CASP3 Caspase 3
CD40L CD40 Ligand
cDNA Complementary DNA
CRP C-Reactive Protein
DAVID Database for Annotation, Visualization and Integrated Discovery
DC Dendritic Cell
DC-SIGN Dendritic Cell-Specific Intercellular adhesion molecule-3-Grabbing Non-
integrin
EBV Epstein-Barr Virus
ESR Erythrocyte Sedimentation Rate
FAM167A Family with Sequence Similarity 167, Member A
FCGR1C Fc Fragment Of IgG, High Affinity Ic, Receptor
FCGR2A Low affinity immunoglobulin gamma Fc region receptor II-a
FcγR Fc-Gamma Receptor G-CSF Granulocyte-Colony Stimulating Factor
GGT Gammaglutamyl Transpeptidase
GO Gene Ontology
GPD1L Glycerol-3-Phosphate Dehydrogenase 1-Like
GSEA Gene Set Enrichment Analysis
GWAS Genome-Wide Association Study
HLA Human Leukocyte Antigen
HLA-DMA Major Histocompatibility Complex, Class II, DM Alpha
HLA-DMB Major Histocompatibility Complex, Class II, DM Beta
HLA-DOB Major Histocompatibility Complex, Class II, DO Beta
HLA-DQB2 Major Histocompatibility Complex, Class II, DQ Beta 2
IgG Immunoglobulin G
xi
IL-17 Interleukin-17
IL-18R1 Interleukin-18 receptor 1
IL1-R2 Interleukin-1 receptor 2
IL-1RAP Interleukin 1 Receptor Accessory Protein
IL-1β Interleukin 1 Beta
IL-6 Interleukin 6
IP3 Inositol trisphosphate
ITPKC Inositol-1,4,5 Trisphosphate 3-Kinase C
IVIG Intravenous Immunoglobulin
KD Kawasaki Disease
LAD Left Anterior Descending artery
LASSO Least Absolute Shrinkage and Selection Operatory
LMCA Left Main Coronary Artery
LSA Least Square Adaptive
MHCII Major Histocompatibility Complex II
MMP Matrix Metalloproteinase
MMP-9 Matrix Metalloproteinase-9
MRPL2 Mitochondrial Ribosomal Protein L2
PCA Principle Component Analysis
POLR2G Polymerase (RNA) II (DNA directed) polypeptide G
RCA Right Coronary Artery
S100A12 S100 Calcium Binding Protein A12
S100A8 S100 Calcium Binding Protein A8
S100A9 S100 Calcium Binding Protein A9
SCN5A Sodium Channel, Voltage Gated, Type V Alpha Subunit
SNF Similarity Network Fusion
SNP Single Nucleotide Polymorphism
Th17 T Helper 17 Cell
TNF-α Tumour Necrosis Factor Alpha
VARS Valyl-TRNA Synthetase
VEGF Vascular Endothelial Growth Factor
WBC White Blood Cell
1
1. Introduction
1.1 Kawasaki Disease - overview and epidemiology
1.1.1 Overview
Kawasaki Disease (KD) is an acute systemic vasculitis that predominantly occurs in children
under the age of 5 (1) with up to 25% of children developing aneurysms if left untreated (2). KD
is most common in Asian populations, with highest incidence rate found in Japan, and shows
evidence of seasonal association with disease occurrence (3). KD exhibits a great deal of
heterogeneity, with the current set of clinical signs and symptoms used for diagnosis not able to
distinguish homogenous groups of patients with respect to treatment response or disease outcome
(4-6). The overlap of KD with other infectious diseases necessitates better diagnostic and
prognostic tools (7).
1.1.2 Incidence rates
KD incidence has a gender bias, with a ratio of 1.5 to 1.7:1 (male to female) and most of the
affected children (76%) are under the age of 5 (8, 9). Incidence of KD around the world appears
to differ greatly. Highest rates of incidence are in Japan, with 239.6/100,000 children <5 years
old (2009 nationwide epidemiologic survey of KD) (3). The second highest incidence of KD in
the world is Korea, with 134.4/100,000 children <5 years of age (based on nationwide survey in
2011) (10), followed by Taiwan and Shanghai, China, with 66.24 (in 2006) and 30.3- 71.9
(ranging from 2008 to 2012) per 100,000 incidence rates respectively (11, 12). Incidence rates in
North America, though lower than in Asian countries listed above, differ depending on ancestry.
Children of Asian ethnicity had the highest incidence of 30.3/100,000 kids <5 years old, from
1997 to 2007 (13). Kids of African American and Hispanic ethnicities had the next highest
2
incidence of 17.5/100,000 and 15.7/100,000 respectively, while kids of Caucasian origin had the
lowest rate amongst all the racial groups surveyed, with incidence of 12/100,000 (13). As further
evidence supporting the genetic model of KD, disease incidence in children of Japanese descent
living outside of Japan had rates of 197.7/100,000 in a retrospective analysis of KD patients in
Hawaii from 1996 to 2001 (compared to 35.3/100,000 incidence in Caucasian children) (14).
Furthermore, according to several Japanese studies, a 10 fold relative risk (2.1%) was
documented in siblings of KD patients, increasing to 13% if the siblings were twins (15, 16).
1.1.3 Seasonal outbreaks
Serving as supporting evidence for an infectious trigger of the disease, KD happens to exhibit
seasonal patterns of occurrence. Japan, South Korea, and China have 2 peaks of disease
incidence in winter and summer (3, 17). Shanghai, China, has peaks in spring and summer, while
Taiwan, with the 3rd highest incidence of KD, tends to peak in summer months (18, 19). Winter
and spring seasonal peaks have also been observed in US (3). Furthermore, epidemics of KD
were previously recorded in Japan in the years of 1979, 1982, and 1986 (3). These reports,
together with incidence rates, collectively demonstrate an endemic and epidemic nature of
Kawasaki disease.
1.2 Kawasaki Disease - Diagnosis and Treatment
1.2.1 Clinical symptoms
KD is diagnosed in North America using a set of non-specific symptoms, which include
prolonged fever and at least 4 of the 5 following principal clinical symptoms: bilateral
conjunctival injection, oral mucosal inflammation, polymorphous skin rash, extremity changes,
and/or cervical lymphadenopathy (7). In contrast, principle clinical findings in Japan include
3
presence of fever in the list of principal clinical symptoms, and as such, diagnosis is made based
on 5 of 6 criteria (20). The typical course of the disease can be divided into 3 stages: acute,
subacute, and convalescent (21). Acute stage is characterized by fever and presence of the
classical KD symptoms (potentially lasting up to 1 to 2 weeks if not treated), followed by
subacute stage where fever and clinical symptoms subside, but biochemical evidence of
inflammation persists (eg. ESR and CRP) (21). Convalescent stage, when the signs of illness
have disappeared and inflammatory markers have also subsided, typically starts at 4 to 6 weeks
after disease onset (21).
Fever is present in all KD patients at the onset of the disease and as such is a key diagnostic
criteria in North America. If untreated, it can last on average for 11 days, but due to its high
variability in duration, may extend up to 3 or 4 weeks longer in some patients (7). Conjunctivitis
is bilateral with a frequency of 80 to 90% in KD patients (22). It may involve various parts of the
conjuctivate, but most of the time it is peri-limbic sparing (7). It is typically painless and goes
away fairly quickly (7). Frequency of oral mucosal inflammation in KD patients is also around
80 to 90% (22). The symptoms manifest themselves as swelling and cracking of the lips, a
strawberry tongue, and diffuse erythema of the oropharynx (22). Polymorphous skin rash occurs
in more than 90% of patient and manifests itself within 5 days of the onset of fever (7, 22).
Beginning at the trunk, it can take many forms such as maculopapular eruption, scarlatiniform
rash, or an erythoderma, and can be found in limited distributions or generalized all over the
body (7). The rash may also progress to desquamation in the perineal region and in extremities
(7). Changes in the extremities occur in 80% of patients (22). First signs, including erythema and
induration of hands and feet, occur for a short duration (1 to 3 days) within the acute phase of the
disease, while desquamation typically occurs 2 or 3 weeks after onset (7, 22). Cervical
4
lymphadenopathy is mostly unilateral, with varying number and size of affected nodes (23). It is
the rarest of the clinical symptoms with an occurrence rate of around 50% (under-detection
during palpation may contribute to the low reported frequency) (22).
1.2.2 Laboratory tests
Since not all patients present with the full set of diagnostic symptoms, and there is overlap with
other febrile diseases due to the non-specificity of the diagnostic criteria, a number of laboratory
measures are used in aiding KD diagnosis (7). About half of KD patients experience increase in
white blood cell (WBC) counts (levels > 15 000/mm3) due to the inflammation in these patients
(7). As fever is one of the key diagnostic criteria of KD patients, there is a marked increase in
inflammation markers lasting for 6 to 10 weeks – C-reactive protein (CRP) and Erythrocyte
Sedimentation Rate (ESR) (7). Both markers are used because there may sometime be
discrepancies due to difference in kinetics between the two methods and the potential
confounding effects of IVIG treatment on sedimentation rate of erythrocytes, thus affecting the
ESR measure (7, 24). Increase in platelet count (thrombocytosis) is typically delayed, usually
starting after 2 weeks of disease onset, often reaching 1 000 000/mm3 counts over time in the
subacute phase (7). Forty percent of patients exhibit increased levels of serum transaminases,
such as Gammaglutamyl Transpeptidase (GGT) and Alanine Transaminase (ALT) (25). Lower
levels of albumin, reflecting inflammation, are also common (7). Last but not least, anemia may
develop in 50% of the patients (often correlated with prolonged fever duration), with lower
hemoglobin counts (7, 22).
5
1.2.3 Extra-cardiac findings
In addition to the classical KD diagnostic criteria and the array of laboratory measures, KD
patients often present with a number of other clinical findings. Over 7% of children develop
arthritis, involving multiple joints, at diagnosis (7, 26). Vomiting, diarrhea, abdominal pain, and
other gastrointestinal complaints are common (in 1/3 of patients) (7). Hepatic abnormalities, such
as liver enlargement, jaundice, and hydrops of gallbladder (15% of patients), can present in
patients as well (7, 27). Other common clinical manifestations may include aseptic meningitis,
colitis, urethritis and anterior uveitis (26).
1.2.4 Cardiac findings
Coronary aneurysms, which may develop within or 2 weeks after acute phase, are the hallmark
of the disease and pose a great risk to a patient as they can clot or rupture, possibly leading to
myocardial infarction or death (22). As a result, having coronary abnormalities in the absence of
the rest of the classical KD symptoms is sufficient to diagnose a patient with KD (7). To account
for the variations in coronary artery measurements due to body size, coronary dimensions are
reported using z-scores (measures made only for Left Main Coronary Artery (LMCA), Left
Anterior Descending (LAD) artery, and Right Coronary Artery (RCA), where aneurysms are
most often associated with fatalities)(7). The measurements are adjusted for body surface area,
with a z-score greater than 2.5 statistically indicating an abnormality compared to the general
population (7, 28, 29). Aside from the main risk of coronary aneurysms, patients may also
develop a myriad of other cardiac related conditions. Examples include myocarditis and
arrhythmias, both of which may also present themselves in the acute stages of the disease (7).
6
1.2.5 KD Treatment
Intravenous Immunoglobulin (IVIG) is administered at 2 g/kg, together with high-dose aspirin,
and is the main form of treatment in KD (7). Previous reports have attributed the effects of IVIG
in KD to reduction of key pro-inflammatory cytokines (IL-1β, IL-6, TNF-α), G-CSF, CRP, and
CD40L amongst other targets (30, 31). IVIG effect on Treg expansion, Th17 downregulation,
and ability to neutralize superantigens has also been implicated in some studies (30, 31). The
exact mechanism of action of IVIG in KD is still unknown. Due to its effectiveness in many
other diseases other than KD, numerous modes of action have been proposed, including
complement binding, binding to DC-SIGN on DCs, FcγR interaction, and
neutralization/inhibition of soluble proteins or pathogens, to name a few (31). Unfortunately, up
to 20% of patients may not respond to IVIG treatment (32). Just as little is known about IVIG
mechanism of action in KD, reasons for unresponsiveness are also poorly understood. These
patients undergo retreatment with IVIG, along with corticosteroids in order to reduce the
prolonged fever duration and other symptoms (7). High-dose aspirin (80-100 mg/kg/day),
efficacy of which is under debate (33), is part of the standard treatment for KD in North America
(during the acute stage of the disease) due to its anti-inflammatory effects (7). Switch to low-
dose aspirin regimen (3-5 mg/kg/day), once patient’s fever subsides, is used for its anti-platelet
properties for 6-8 weeks after disease onset (7).
1.2.6 AHA Diagnostic criteria sensitivity and specificity
Due to the lack of specificity in the AHA set of diagnostic symptoms, KD may often be
misdiagnosed with a number of other conditions. Examples cover a wide range of febrile
illnesses including bacterial infections (eg. scarlet fever), viral infections (eg. Measles and EBV),
7
and drug reactions (eg. Steven Johnson syndrome) (34). The classical guidelines, though
effective, do not exhibit high specificity and sensitivity for diagnosing the disease (22)
Patients who do not meet the full AHA diagnostic criteria for KD have been previously linked to
significantly increased chance of CAA development, both in a single center study (20% of 127
patients vs 7% with full diagnosis) (6) and a nationwide survey in Japan (7.4% vs 2.5%) (35).
Such differences in disease outcome have been attributed to delay in IVIG treatment due to lack
of sensitivity of the full AHA KD diagnosis criteria to identify KD patients in time (6).
Furthermore, the AHA criteria are not particularly good at discerning incomplete Kawasaki
disease, defined as febrile patients with less than 4 of the principle clinical features (7).
Incomplete KD is especially more common in infants < 1 year old, with a previous study
reporting a 45% incidence rate amongst KD patients compared to 12% for older KD patients (4).
Additional laboratory tests and presentation of other clinical findings may assist in diagnosing
incomplete KD, but as mentioned earlier, timing is of the essence in order to prevent aneurysm
formation.
1.2.7 Risk scoring systems
Over the years, a number of clinical scores have been devised in order to better identify KD
patients and predict disease outcome and treatment response. Early attempts were made by Asai
in 1983 (36), Nakano et al in 1986 (37), and Iwasa et al in 1987 (38). The former scoring system
did not utilize 2-D echocardiography, which is routinely used in diagnosis today, while the latter
two lacked statistical power (7). The Harada score was developed in Japan in 1991 in order to
determine whether a KD patient required IVIG treatment, since unlike the current North
American treatment standard, not all KD patients received IVIG in Japan at that time, so a risk
8
stratification strategy and rational allocation of IVIG was needed (39). According to the score, in
order to qualify for IVIG treatment, patients had to fulfill 4 criteria within 9 days of disease onset
(39). The total set of criteria included elevated white blood cell count (> 12 000/mm3), increased
CRP levels (>3), platelet counts less than 350 000/mm3, albumin levels less than 3.5g/dL, male
sex, and age less than 12 months old (39).
Since IVIG treatment is recommended for all KD patients in North America, Beiser et al
developed a score in the hopes of predicting coronary outcome instead (40). The laboratory
measures used in the score consisted of baseline neutrophil and band counts, hemoglobin and
platelet levels, as well as body temperature on the day following IVIG administration (40).
However, though it was able to identify low-risk patients, the system did not perform well with
diagnosing individuals with a high risk of aneurysm formation (40)
One of the latest and currently best performing risk scoring systems was developed by
Kobayashi et al in 2006 (41). While developing the score, Kobayashi et al have identified high
serum AST and low sodium concentrations as strongly predictive of IVIG unresponsiveness
(41). Points in Kobayashi score are accumulated based on decrease in sodium concentration,
illness of 4 days or less at diagnosis , increased AST concentration, increase in neutrophil
percentage amongst white blood cells, baseline platelet count, increased CRP levels, and age less
than 12 months (41). This system showed high specificity and sensitivity in predicting IVIG
unresponsive patients in a Japanese cohort, but did not perform well in North American
population (42). As a result, none of the existing risk scoring systems in KD can generalize to all
ethnic groups around the world, and as such, there is currently no dependent method to predict
IVIG response and coronary outcome in North American patients.
9
1.3 Etiology
1.3.1 Immune response
KD’s etiology is still unknown, with evidence for both genetic and environmental factors. On the
one hand, linkage and GWAS studies appear to propose a genetic nature, but seasonal
occurrences and community outbreaks also suggest an infectious cause of the disease (7). Taken
together, a popular model of the disease suggests that KD occurs in genetically predisposed
children via some common environmental trigger (7).
Early stages of Kawasaki disease begin as systemic inflammation, orchestrated by dissociation of
smooth muscle cells within the media of blood vessels throughout the body and an influx of
neutrophils within 7 to 9 days after onset of disease (7). TNF-α, the key inflammatory cytokine
in KD pathology, is increased to promote inflammation and recruitment of immune cells (43).
The innate immune response quickly transitions to proliferation of large mononuclear cells
(CD8+ T and IgA plasma cells), leading to destruction of internal elastic lamina (44, 45).
Destruction of elastin is mainly driven by Matrix Metalloproteinases (MMPs), most notable
MMP9, which was previously shown to be necessary for aneurysm formation in a KD mouse
model (46). MMP-9, an elastolytic enzyme, is produced by coronary smooth muscle cells in
response to TNF- (46).
1.3.2 Environmental triggers
Although a specific infectious agent has not been identified, there are many studies in literature
linking the cause of KD to viruses and bacteria, including EBV (47), rotavirus (48), parvovirus
B19 (49), adenovirus (50), and chlamydia pneumonia (51), amongst others. A lot of the support
for the infectious model of KD also comes from animal models, the most known being the
10
Lactobacillus Casei Cell Wall Extract model developed by Lehman et al (52) in 1985. The
induced disease model shares a lot with the human counterpart in terms of disease kinetics,
histologic changes, Vβ skewing (Vb2, 4 and 6 in mice; Vb2 and 8 in humans), and response to
IVIG treatment (7, 53).
1.4 Translational studies
1.4.1 Linkage analysis
Before the popularization of large scale techniques like GWAS, early steps in finding disease
susceptibility at the genetic level relied on locating the chromosomal regions via linkage analysis
in families of patients (54). The idea is based on the principle that traits encoded by genes are
often inherited together due to genes being close to each other on a chromosome (55). By
analyzing patterns of inheritance in families of affected individuals, microsatellite genetic
markers across the genome can be sequenced and used to identify chromosomal loci linked to
disease (55). Further linkage mapping studies with SNPs restricted to these regions of interest,
can help identify specific genes that may be involved (54).
1.4.2 Genome-wide association studies (GWAS)
Advancements in technology now allow us to detect hundreds of thousands to millions of SNPs
at a genome-wide scale and gives us the power to genotype individuals and detect genetic
variation between them. Genome-wide association study (GWAS) is based on applying this
concept to a group of affected individuals with the purpose of finding common genetic variants
that may be risk factors for a given disease (56). An underlying principle for GWAS is the
common disease/common variant hypothesis which states that complex genetic diseases can be
caused by small additive effects of many common variants that exist in the general population,
11
instead of a single risk allele that can be found in Mendelian diseases (57). Both Illumina (San
Diego, CA) and Affymetrix (Santa Clara, CA), which is another popular platform for microarray
technologies, provide products that are popular choices for GWAS (56). Illumina uses
BeadArray technology (described in the next section) to identify SNPs, while Affymetrix
products print oligonucleotides sequences on a chip (instead of beads), that can specifically
recognize the SNPs via hybridization (56).
1.4.3 Gene expression
Microarray technologies have come a long way, now allowing scientists to quantify gene
expression on a genome-wide scale. Gene expression studies can now be performed on small
number of individuals by profiling mRNA abundance (58). An example of such technology is
BeadArray, developed by Illumina (San Diego, CA) and used in their gene expression array
products (eg. Illumina HT-12v4 used in our study) (58). It allows for such a large scale of gene
expression profiling by randomly assembling arrays of beads across the wells of a microplate,
with a bead identifier sequence and a gene-specific probe attached to each bead (58). Since each
bead type has hundreds of thousands of the same oligonucleotide probe sequences tethered to it,
hybridization with a target cDNA sequence can then be measured and quantified via
fluorescence intensity (58). Similar gene expression microarrays are manufactured by
Affymetrix (Santa Clara, CA), including their Human Genome U133 (U133), Human Exon
(HuEx), and Human Gene (HuGene) product series (59).
12
1.5 Post-translational studies in children
1.5.1 Candidate gene approach
Early attempts to link KD to genetic susceptibility were candidate-gene studies, where genes are
selected for association analysis based on prior knowledge of their function and how it may
relate to the disease (54). As such, several research groups have looked at associations between
KD susceptibility and/or CAL formation and polymorphisms in HLA (60), TNF-α (61), IL-4
(62), VEGF (63), MMPs (64), CRP (65), and other genes. However, this approach to identifying
associations at the genetic level suffered from conflicting results between studies, lack of
validation, and small patient numbers in these cohorts (54)
1.5.2 ITPKC and CASP3
The first large-scale attempt to find genetic susceptibility in KD was the 2007 genome-wide
linkage analysis of 78 sibling pairs from Japan (66). The study did not pinpoint any exact genes
that may be involved, but it did identify possible linkage in the 12q24, along with 9 other
chromosomal regions that may relate to KD susceptibility (66). Using these regions as a
roadmap, further studies have identified polymorphisms in ITPKC and CASP3 genes as
associated with KD susceptibility and CAL formation (67, 68). Both hits were relevant to KD as
ITPKC is a kinase of IP3, acting as a negative regulator of the downstream Ca2+ pathway during
T cell activation, while caspase-3 is an enzyme involved in apoptosis and, as a result, in duration
of T cell immune response (68).
1.5.3 FCGR2A
Results of a GWAS study in a European population (405 KD patients, 6,252 controls) were
published in 2011, identifying 2 loci at genome-wide significance (69). The first one was related
13
to the FCGR2A gene, which encodes the IgG receptor, and second one was in the 19q13 region,
with a SNP in ITPKC gene (69). Polymorphism affecting IgG binding by Fc receptors may
potentially have an effect on IVIG response and as such, may be a contributing factor to
explaining IVIG unresponsiveness in KD patients (69)
1.5.4 MHCII, CD40, and BLK
In 2012, the same group conducted another GWAS study to find more susceptibility genes in
addition to CASP3 and ITPKC (70). This time, a Japanese cohort of 428 subjects (3,379
controls), validated on 754 cases (947 controls), identified additional significant associations in
the FAM167A-BLK, HLA-DQB2 – HLA-DOB, and CD40 regions (with FCGR2A association,
from the previous study, replicated as well) (70). BLK is a known kinase associated with B- cell
receptor signaling and recently implicated in development of IL-17 producing cells (70). HLA
genes as a GWAS hit also appear to be relevant to KD due to the known ethnic differences in
KD susceptibility that were talked about in detail earlier (54). At the same time as the Japanese
group, a GWAS study was performed in Taiwan with 622 KD patients (1,107 controls) (71). The
results have independently identified loci in BLK and CD40 genes at genome-wide significance
(71).
1.5.5 Summary of findings
The genes identified in previous GWAS studies, while relatively dispersed in terms of function,
are all related to immune function. Associations with antigen-presentation (MHCII and CD40), T
cell response (ITPKC and CASP3), and IgG binding (FCGR2A), are all related to the
components of the immune system that we believe to be involved in KD.
14
1.6 Computational Analysis
1.6.1 Introduction
The risk scoring systems detailed earlier, used clinical information as their input in generating
their models. However, they failed to come up with a formula that can identify homogenous
groups of patients, let alone predict coronary outcome and/or IVIG unresponsiveness effectively.
Considering the complex nature of KD etiology, which is still poorly understood, it is not
surprising that using only clinical variables may not help us solve the problem of classifying
patients. Given that the phenotypes we see are driven by the underlying genetic patterns, the next
step would be to incorporate the large amount of biological data into creating models for
identifying homogenous groups of patients. While this may not have been possible decades ago,
high throughput gene expression tools and corresponding computational power to analyze this
big data, have now caught up and can be used to further our understanding of complex diseases
such as KD.
1.6.2 Data aggregation
Though we may now have access to the resources for handling these big datasets, making
meaningful inference from analyzing multiple layers of data is far from trivial. Aggregation of
multiple datasets is a challenging task due to the heterogeneity found between datasets, both in
terms of size and types of variables (continuous, categorical, or binary). Some common
approaches to analyzing multiple datasets have trouble overcoming these obstacles. Appending
the datasets together risks diluting potentially important variables when merging with such big
data as gene expression. Similarly, analyzing each dataset separately (prior to integration) makes
it hard to combine the individual patterns afterwards (72). In all cases, extracting information
15
that is representative of all the datasets equally, and incorporating it into a single model that
describes the patients, is difficult.
1.6.3 Approach to computational analysis
Our approach to analyzing our KD cohort is to incorporate the vast gene expression data from
patients with their corresponding clinical information and laboratory tests. As discussed above,
integration and subsequent analysis of such different datasets poses a lot of problems using
conventional techniques. To that end, we are using a novel computational technique called
Similarity Network Fusion, described in more detail below, that helps to identify homogenous
groups of patients while using multiple datasets as input (72). Comparing features between the
retrieved clusters helps to lay out the differences that exist between the groups of patients. While
looking at the clinical patterns is comprehensible due to a lower number of clinical variables,
making sense of emerging patterns amongst thousands of genes from biological datasets is more
complicated. For that reason, gene enrichment analysis can better summarize the results. Last,
but not least, to further narrow down important features, supervised learning methods are used
for feature selection to identify the variables that may potentially be used as classifiers for
assigning new patients to the newly discovered clusters.
1.6.4 Similarity network fusion
Similarity Network Fusion (SNF), a novel multiple data integration tool that was already
successfully used in discerning cancer subtypes and predicting patient survival, is able to
effectively address the problems associated with aggregation of heterogeneous data (72). As
conceptually demonstrated in Figure 1, it accomplishes this by independently expressing each
dataset as a network of patients, with the edges connecting each patient representing pairwise
16
similarity across all the features in each dataset. SNF is based on iterative fusion of such patient
similarity networks into a single shared network that integrates the patterns from all the data
layers (72). SNF tackles the challenges outlined above as it is effective with datasets of both
small and large datasets, can integrate datasets with varying number of features, and is robust to
noise (72).
Figure 1. Similarity Network Fusion Algorithm.
(Adapted from Wang et al. (72)) (A) Graphical representation of 2 datasets (biologic and clinical) describing a
cohort of patients. (B) Each dataset is independently converted to a patient-by-patient similarity matrix, which can
be visualized as a network, where each patient is represented as a node and pairwise similarities are depicted by an
edge/line. (C) The SNF algorithm iteratively updates each network using information derived from networks of the
other datasets. Every update cycle makes each network more similar to each other, until they all converge to the
final fused network (D). The edges have been color-coded to represent the source of information for each network.
1.6.5 Gene enrichment analysis
Working with large datasets, such as gene expression microarrays, allows for high-throughput
analysis that can greatly benefit a study by returning long lists of gene results. However, this in
17
itself becomes a problem because summarizing the findings from such big lists is difficult. To
address this challenge, many functional gene enrichment analysis tools have been created, such
as GSEA (73), GOstat (74), and Onto-express (75), among many others. The way they generate
results differs between each one, but the principle remains the same – map the input list of genes
to biological databases, sorting the results by annotations that are statistically the most enriched
with these genes (76). One such tool is the Database for Annotation, Visualization and Integrated
Discovery (DAVID) online computational tool created in 2003 (76). It compares the enrichment
of genes in a particular annotation database against a population background for a given species
(77). Amongst the many biological annotation databases that DAVID uses, one widely known
repository was formed by the Gene Ontology (GO) Consortium for the purpose of creating a
dynamic vocabulary of gene roles and products in any organism (77). The GO term, which is a
well-defined description of the gene relationships it contains, is in turn linked to many other
databases, such as SwissPROT, EMBL, etc, allowing the system to be dynamic and up to date
with rapidly changing biological knowledge (77). To better associate the GO terms together by
function, GO terms are further split into 3 categories: biological process, molecular function, and
cellular component (77). As illustrated in Figure 2, within each category, the GO terms form a
directed acyclic graph, where each term is a node in a network connected to each other with pre-
defined parent and children relationships (77). Each GO term may have more than one parent
node and the genes associated with each term are not exclusive.
18
Figure 2. Illustration of Gene Ontology hierarchy.
(Adapted from Ashburner et al. (77)). The illustration is for illustrative purposes – it has been simplified and may
also not be accurate due to the constantly changing nature of GO terms. The subset of terms shown belong to
biological processes category and depict the kind of connections and hierarchal structure that GO terms exhibit
(examples of genes listed are from the Saccharomyces Cerevisiae species). Furthermore, each node (GO term) can
have multiple parents (eg. “DNA ligation” is a sub-node of both “DNA recombination” and “DNA repair”) and
genes represented by these nodes are not exclusive (eg. CDC9 is part of both “DNA recombination” and “DNA
ligation”)
1.6.6 Feature selection and classifiers
A typical dataset is a collection of entries (such as patients), each with an array of values that
correspond to a set of features in that dataset (eg. age, gender, presence of symptoms, etc). The
values can be either continuous variables, categorical, or binary. If the learning done on a dataset
by a machine learning algorithm considers only the features, without any labels for each entry
(example of a label is disease outcome for each patient), then it falls under a branch of machine
learning called Unsupervised Learning, popular examples of which include Principle Component
Analysis (PCA) and K-means clustering (78). SNF falls into this category because its purpose is
to aggregate multiple datasets into a single patient network, without taking into consideration
patient outcome during the learning process (eg. Coronary outcome or IVIG responsiveness in
19
the case of KD) (72). As a result, even if the resulting patient network may identify clusters of
patients that correlate with clinical outcome measures, summarizing the features that contribute
to the cluster formation cannot be used as classifiers that predict cluster assignment for new
patients. In order to further narrow down and identify features that can later be used as
classifiers, feature selection methods can be used.
One promising method of feature selection is FeaLect developed by Zare et al (79). It is based on
the Least Absolute Shrinkage and Selection Operator (LASSO), which is a regularization
technique (prevention of overfitting by penalizing having too many features when learning a
model) for linear regression. Unlike the common ridge regularization method, LASSO tends to
shrink some coefficients to a value of 0, thus also effectively acting as a subset selection
algorithm for predictors (80). Simply put, it includes only a subset of input variables when
returning a fitted model, thus making the results much easier to interpret and assist in narrowing
down the variables that may best serve as classifiers. To increase consistency and reliability of
LASSO in selection of important features, there are modifications of the algorithm, such as
Bolasso (81), which run the algorithm on many subsamples of the data and consequently select
features based on multiple models that were generated. FeaLect, is a modification of LASSO as
well, but unlike Bolasso which is often too strict when it comes to feature selection, FeaLect is
less strict and has been demonstrated to perform well at selecting clinically relevant features in
real datasets (79).
1.6.7 Heterogeneity in KD
Heterogeneity exists in KD based on the clear differences between patients in coronary outcome
and IVIG resistance. However, it also manifests itself in clinical and biochemical measures. The
20
current set of AHA diagnostic criteria is far from being clear-cut and exhibits a lot of
heterogeneity, both in presentation and as variation in intensity, which is difficult to measure
objectively. Prime examples of this are polymorphous rash (which as mentioned before can take
many forms), fever (due to its high variability in duration), and cervical lymphadenopathy (the
number and size of affected nodes may vary with some being so dramatically enlarged and
inflamed that they are diagnosed as cervical adenitis) (7, 23). Additionally, each one of these
clinical features varies in terms of intensity/severity and the dichotomous measure of presence or
absence of a symptom (26). Aside from the clinical diagnostic criteria, additional layers of
information exist in the form of presence and severity of additional symptoms and supporting
features that KD patients may have. The fact that myocarditis, gastrointestinal complaints, and
hydrops of the gall bladder, just to name a few, manifest themselves only in a subset of patients,
further demonstrates the heterogeneity amongst KD patients due to the combinatorial
presentation of these symptoms (7). The same goes with laboratory tests – patients may or may
not have increased white blood cell counts, serum transaminases, and there may be varying
levels of elevated CRP and ESR at diagnosis (7). As a result, even though KD patients may
appear, by presence of diagnostic criteria, to belong to a homogeneous group having the classic
diagnostic features, this is far from the truth as they differ across many of the other aspects that
may reflect underlying pathobiology. The repercussions of the inability to identify these
heterogeneous groups may contribute to differences in treatment response and the remaining risk
of aneurysm formation.
1.6.8 Rationale
The clinical differences that we see in terms of coronary outcome, IVIG responsiveness, and
presentation at diagnosis, demonstrate that heterogeneity exists within KD, but the clinical
21
measures alone are not able to capture it. Previous attempts at coming up with scoring system
using only clinical features to capture heterogeneity in KD, such as the Kobayashi score
described earlier, failed to generalize to KD patients around the world. As has been suggested in
earlier studies (42), the solution to this problem is to develop new predictive models that are
more accurate and utilize new biomarkers and/or expanded biologic datasets that will bridge
ethnicities. After all, phenotype is driven by the underlying gene expression patterns, therefore
incorporating the large amounts of biologic data that we are now able to extract due to advances
in technology, may help us better identify homogenous groups of patients and further our
understanding of KD overall. In the recent times, genetic data has already proven its usefulness
in making new discoveries in KD via the several GWAS studies that have been conducted (66,
69, 70, 82) . Since the first sequencing of the genome, there has been a huge effort and progress
in bringing up the computational power and tools necessary for analyzing such big datasets. We
are now capable of processing such large amounts of data by identifying, annotating, and linking
variants to diseases with computational tools that are widely used by scientists today (83).
Clinical information with gene expression data, together with the ability and computational
power to process and analyze such a huge amount of data, can greatly contribute to gaining
better insight into KD and help develop better diagnostic tools.
In summary, though we may now have access and the resources to handle these big datasets,
making meaningful inference from analyzing multiple layers data is far from trivial. Aggregation
of multiple datasets is a challenging task due to the heterogeneity found between datasets.
Similarity Network Fusion (SNF) is a multiple data integration tool that effectively addresses
these problems and can therefore prove extremely useful in identification of homogenous groups
of patients in KD.
22
1.6.9 Hypothesis and objectives
Our hypothesis is that SNF can combine clinical and biologic data to identify homogenous
groups of KD patients. To test this hypothesis, our study objectives are:
1. To identify and determine robustness of homogenous clusters of KD patients based
on the SNF mediated fusion of clinical and biologic datasets
2. To characterize the unique gene expression and clinical profiles that define the
discovered clusters
3. To identify the subset of features that can be used as classifiers
23
2 Methods
2.1 KD Cohort
Children were included in these studies if they satisfied the American Heart Association (AHA)
diagnostic criteria for Kawasaki Disease (7). Informed consent for participation was obtained
from parents and informed consent or assent was obtained from patients as appropriate. The
patient cohorts consisted of 171 children with KD from Rady’s Children’s Hospital.
Standardized clinical data and echocardiographic measurements were prospectively collected for
all patients according to protocol. Clinical summaries of the KD cohort are detailed in Table 1
(representing 159 patients used for analysis after missing data removal outlined below). Whole
blood RNA was collected in PAXgene tubes during the acute phase, before administration of
IVIG. Contemporaneous blood sample was used for complete blood counts and the rest of the
laboratory testing. Z-worst, defined as the maximal z-score of the internal diameter of the left
anterior descending and right coronary arteries normalized to body surface area during the first 6
weeks after onset of illness, was used to describe coronary artery dimensions. See Table 1 for
summary of clinical characteristics and laboratory tests.
2.2 Gene expression microarray
Gene expression data was measured using the Illumina HumanHT-12 V4.0 expression beadchip
(47 000 probes targeting gene transcripts, where a given gene may have multiple transcripts),
scanned using the Illumina Bead Array Reader confocal scanner, and checked for quality using
the Illumina QC kit, as described in a previously published protocol in greater detail (82). Gene
expression data was normalized by a log10 transformation, followed by Z score conversion (82).
24
Both the raw and normalized datasets are publically available at the GEO database (Accession
number GSE63881).
2.3 Datasets
3 datasets were used for SNF: gene expression, continuous clinical data, and binary clinical data.
Illumina HT-12v4 chip was used to generate gene expression data covering 19,539 genes across
the genome. Continuous clinical features include age, duration of fever at diagnosis, and
laboratory tests (platelet count, Hb z-scores (haemoglobin concentrations normalized for age),
WBC, bands%, neutrophils%, lymphocytes%, monocytes%, eosinophils%, ESR, CRP, ALT,
GGT, and urinalysis WBC). Categorical clinical dataset includes proportion of males and
proportion of patients with the classical KD diagnostic features (conjunctivitis, oral changes,
rash, extremities, and lymphadenopathy)
2.4 Computational analysis workflow
Steps involved in the analysis of the KD cohort are summarized in Figure 3. Data is first pre-
processed via outlier removal and standardization of variables within the datasets. The datasets
are then fused using SNF and clusters are determined using spectral clustering of the shared
network. Spectral clustering is performed on the final fused network and is not part of the actual
SNF algorithm – it is just one of many methods of finding clusters from similarity matrices. The
algorithm itself is based on performing dimensionality reduction prior to running clustering
methods, such as K-nearest neighbors, which contributes to its high performance as a clustering
technique (84). The resulting clusters are annotated by comparing them across clinical variables
and functional groups of genes. Clinical and top biologic features are also used in FeaLect
analysis to determine classifiers that can predict cluster assignment. Lastly, performance of SNF
25
in analyzing the KD cohort is measured via robustness and feature sensitivity analysis. Each
method is described in more detail in sections below.
Figure 3. Steps involved in the computational analysis of the KD cohort.
Prior to the integration of clinical and biologic data using SNF, the datasets were pre-processed by removing outliers
and standardizing the features. Clusters of patients in the fused matrix were determined using spectral clustering.
The patterns between the groups of patients were then annotated by comparing the clusters across clinical variables
and functional gene groups (gene enrichment analysis). Clinical and top biologic features were also used for FeaLect
analysis to identify classifiers that can predict cluster assignment. Lastly, performance of SNF with respect to our
KD cohort was measured using robustness and sensitivity analysis.
2.5 Data pre-processing
Data were prepared for analysis first by removing any patients with >20% missing data across all
the variables within a dataset, followed by removal of extreme outliers (>Interquartile Range x3).
The variables were then standardized to have mean of 0 and standard deviation of 1.
2.6 Similarity network fusion
We used R statistical software v3.1.1 (www.r-project.org) with the “SNFtool” package v2.2
installed (cran.r-project.org/web/packages/SNFtool/) for running the SNF algorithm (72).
Patient-by-patient similarity networks were constructed for each individual dataset using an
26
Euclidian distance measure for continuous numerical data, or a chi-square distance for
categorical data, for every pairwise patient combination (72). The resulting matrices were used
as input for the network fusion algorithm where each network is simultaneously updated with
information from the other networks. In this manner, over the course of multiple iterations, all
the networks converge to a single shared network that integrates patterns derived from each of
the networks. Alternatively, these matrices can be visualized as similarity networks where nodes
correspond to individual patients and the edges represent patient-patient similarities (See Figure
1). Clustering of patients was done using spectral clustering.
2.7 Gene enrichment analysis
Before proceeding with further analysis, genes strongly correlated with gender were removed
(Mann-Whitney test p-value<0.05), since there are inherent physical differences between males
and females which may otherwise confound the results. Gene enrichment was performed using
the following analysis: top genes, ranked by Kruskal-wallis test between the clusters with an
adjusted p-value (Holm-Bonferroni) < 0.001, were hierarchally clustered. The identified clusters
were then used as input in the Database for Annotation, Visualization and Integrated Discovery
(DAVID) online computational tool to find enriched Gene Ontology (GO) terms. GO is a
collaborative initiative for creating consistent gene role and product descriptions using GO
terms, where the connections between the terms and their hierarchy are annotated (77). Genes
belonging to these groups were annotated using the RefSeq database (85).
2.8 Co-clustering probability
Co-clustering probability was used as a measure of robustness to removal of patients and
individual features. In lay terms, it is the likelihood of any given pair of patients to remain in the
27
same cluster after changes to the dataset. It is defined as the fraction of original pair-wise co-
clustering relationships between every pairwise combination of patients that remained intact after
re-running SNF and spectral clustering. Co-clustering probability was measured for subsets of
data with percent patients removed (5-60%). Each percent removal was repeated 10,000 times.
Similar method was applied for analyzing sensitivity of the clusters to each feature, where co-
clustering probability was also measured for subsets of data with each feature removed, one at a
time.
2.9 Statistical analysis
Kruskal-Wallis test was used for comparing clinical continuous numerical values between the
three clusters, while Fisher’s exact test was used for categorical variables. Ranking of genes was
done with Kruskal-Wallis test and p-values were adjusted using Holm-Bonferroni method
2.10 FeaLect feature selection
To identify possible classifiers, FeaLect, a method developed by Zare et al (79), was used to
narrow down informative features. FeaLect, based on the Least Absolute Shrinkage and
Selection Operator (LASSO) method for linear regression (80), assigns scores to each feature
based on their relevance in the models being generated (79). The logs of scores are plotted in
increasing order, producing a 3 segment graph, with the right-most non-linear portion containing
informative features. As a modification from the original protocol, features in this portion of the
graph were identified using the significance threshold of p<0.05 based on the frequency
distribution of the feature scores. FeaLect was carried once for each cluster separately, where the
cluster labels were converted to a binary form. “FeaLect” package v1.10 (https://cran.r-
project.org/web/packages/FeaLect/ ) was used for running the FeaLect algorithm. Missing data
28
was imputed using Least Square Adaptive (LSA) computation method (83). Any identified genes
were annotated using the RefSeq database (85).
29
3 Results
3.1 KD cohort and data pre-processing
The pre-processing steps, along with the summary of the rest of the analysis, can be seen in
Figure 3. Prior to running SNF analysis, the KD cohort had to undergo several steps of data pre-
processing. The data from the 171 KD patient cohort used for SNF analysis is comprised of
biologic and clinical variables, the latter being split into continuous and categorical variables
(Table 2). The clinical variables were further split into two categories – clinical laboratory-based
features and the classical AHA KD symptoms (along with gender). Furthermore, the former set
of features is comprised of continuous variables, whereas the latter is categorical, each requiring
different distance measures for calculating pairwise similarity during similarity matrix
construction in SNF (72). Splitting the data into relevant datasets for fusion is crucial to ensure
the SNF matrix produces meaningful results, since converting multi-dimensional data to a single
patient-patient similarity matrix compresses all the variables into a single similarity measure.
The next step in pre-processing was outlier removal (any value greater than three times the
Interquartile range) and subsequent removal of patients with too much data missing. As a result,
our dataset contained 12 patients with 20% or more data missing which have been excluded from
any further analysis. The remaining dataset of 159 patients is clinically representative of KD (86)
and is summarized in Table 1. The cohort is also comprised of patients of different self-reported
ethnicities, with 17% Asian, 25% Caucasian, 30% Hispanic, and a relatively large mixed
proportion (24%), amongst others (Table 1). Unlike the Kobayashi score that was effective only
in the Japanese population (42), running analysis on a mixed dataset may help uncover patterns
and generate statistical models that can be generalized to all ethnicities. As the last step in
30
preparing the data for SNF analysis, each feature was standardized to have a mean of 0 and a
standard deviation of 1.
Table 1. Laboratory measures and clinical characteristics for the KD cohort.
a Values are based on the cohort of 159 patients used for analysis, after patients with more than 20% missing data
have been removed from the initial 171 patient KD cohort
b Values are provided as number of patients (percent of cohort)
c Values are provided as median (range)
31
Table 2. Biologic and clinical datasets used for SNF analysis.
a Gene expression data is derived from a normalized Illumina HT-12 V4 chip that represents 19, 539 genes
3.2 Three unique clusters were identified after aggregation of clinical and gene expression datasets with SNF
After constructing similarity matrices for each dataset (gene expression dataset, continuous
clinical dataset with laboratory test variables, and categorical clinical dataset with classic AHA
KD criteria and gender), spectral clustering of each matrix showed varying patterns within each
data layer. Similarity matrices for the numerical clinical and gene expression data had 2 and 4
clusters respectively, while the categorical clinical had 5 distinct clusters of patients (Figure 4A).
Due to different patterns observed in each dataset alone, without a clear visual overlap between
the datasets, it is difficult to draw any conclusions about the overlaying pattern considering the
combination of all the datasets. SNF aggregation of the 3 datasets, where similarity matrices for
each dataset undergo simultaneous updates through several iterations until convergence to a
32
single network, was able to produce a unified similarity network that was able to incorporate the
patterns found in both the clinical and gene expression datasets. As a result, 3 clearly defined
clusters have been identified, comprising 78, 33 and 48 patients respectively (Figure 4B and C).
33
Figure 4. Three distinct clusters of patients recovered using SNF.
(A) A similarity matrix was generated from each individual dataset, showing different number of clusters for each
(2, 5, and 4 clusters respectively). Subsequent fusion of the 3 networks using SNF yielded a (B) fused matrix with 3
distinct clusters. (C) The outline representation of the fused matrix illustrates composition of each cluster - 78, 33,
and 48 patients for clusters 1 to 3 respectively. In all instances, patients were grouped using spectral clustering.
34
3.3 High robustness and low clinical feature sensitivity amongst the 3 clusters
To examine the robustness of the 3 identified clusters (in other words, how stable are these
clusters and whether or not they are driven by the patterns across the whole cohort or by only a
select few patients), we ran SNF on subsets of patients with increasing percentages of individuals
removed (5 to 60%, in increments of 5%). The measure for robustness was co-clustering
probability, which can be described as the likelihood of any pair of patients to remain in the same
cluster. Each percent removal was repeated 10,000 times to best represent the different
combinations of patients left over. Our analysis, summarized in Figure 5, showed that SNF is in
fact highly robust in identifying the 3 clusters, with 80% of co-clustering probability maintained
even when 40% of patients were removed.
Figure 5. SNF displays high robustness in identifying the three clusters in response to removal of patients.
Whiskers represent 2.5 and 97.5 percentiles. Each percent removal of patients was repeated 10,000 times and co-
clustering probability was measured relative to the original fused matrix. Removal of 40% of patients maintained
80% co-clustering probability.
35
To identify the sensitivity of our 3 clusters to removal of clinical features, a similar analysis was
performed, but by separately removing each one of the clinical variables and calculating co-
clustering probability after re-running SNF. As summarized in Figure 6, a majority of the
features did not impact the formation of the 3 clusters when removed, with the exception of
‘Proportion Male’ and ‘Lymphadenopathy’. These results show that our clusters were mostly
influenced by these 2 clinical variables, compared to the rest. Gender playing an important role
in the cluster formation is not surprising due to inherent differences between males and females,
and is therefore difficult to fully remove. However, we did attempt to remove the gender bias
when analyzing the clusters by removing any gene expression variables which are strongly
correlated with gender in our dataset (elaborated further in later sections).
Figure 6. The 3 clusters are most sensitive to removal of ‘Proportion Male’ and ‘Lymphadenopathy’ clinical
variables.
In order to assess the sensitivity of the 3 clusters to clinical variables, one at a time, variables were withheld from
SNF analysis and co-clustering probability was measured relative to the original fused matrix. ‘Proportion Male’
and ‘Lymphadenopathy’ appear to have the biggest impact on cluster formation.
36
3.4 Unique clinical profiles characterize the 3 clusters
Comparison of the three clusters across an array of clinical variables shows that the clusters are
clinically distinct from each other. According to the clinical and demographic features (Figure
7), cluster 1 appears to be composed of mostly older patients with a higher proportion of females.
The patients in cluster 2 appear to have longer duration of fever and lower incidence of the
principal diagnostic features of KD, including lower frequency of rash, oral changes, extremities
changes, and conjunctivitis (Figure 4A) (7). Cluster 3 patients are all boys and also differentiate
themselves with absence of cervical lymphadenopathy. Lastly, various ethnic profiles are
represented across the three clusters (Figure 8). Cluster 1 has a higher relative proportion of
children of Asian descent, while cluster 2 has a higher proportion of patients of African
American descent and a lower proportion of children of mixed ethnicities.
37
Figure 7. Unique clinical and demographic profiles characterize the three clusters.
Whiskers represent 2.5-97.5 percentiles. Clinical variables which are statistically significant (p < 0.05) are marked
with an asterisk (*). Based on clinical and demographic features, cluster 1 appears to be composed mostly of older
females, cluster 2 patients were diagnosed with a longer duration of fever with lower incidence of clinical
symptoms, and patients in cluster 3 were 100% male with no incidence of lymphadenopathy.
38
Figure 8. Ethnic group profiles across the 3 clusters
Patients in clusters 1,2, and 3 represent groups of Asian, African American, Caucasian, Hispanic, and Mixed
ethnicities. Cluster 1 appears to have a higher proportion of patients of Asian descent, relative to the other 2 clusters.
Cluster 2 has a higher proportion of African American patients and a lower proportion of mixed ethnicity patients,
relative to clusters 1 and 3.
Patterns across the routine clinical laboratory test measures in Figure 9 also appear to show
marked differences between the 3 clusters. Cluster 1 displays higher levels of CRP, but lower
levels of ESR, in contrast to the other groups of patients. Even though there are no significant
differences in total white blood cell count, the cellular composition is distinct, with cluster 1
having considerably lower percentages of monocytes, lymphocytes, but higher percentages of
neutrophils with respect to the other clusters. Inflammatory markers for cluster 2, in contrast,
show opposite patterns, with higher ESR, but decreased CRP. Discrepancies in CRP and ESR
measurements have been previously described in KD and have been attributed to CRP being a
direct measure of inflammation with a faster onset, while ESR is indirect and has much slower
kinetics (24). The patients in cluster 2 also presented with lowest hemoglobin levels, but highest
39
platelet counts compared to the other groups. The composition of white blood cells also showed
marked differences with lowest neutrophil percentages, but highest percentages of lymphocytes
with respect to the other clusters. All these features for cluster 2 are consistent with the longer
duration of disease (as measured by fever duration) in this cluster. Lastly, cluster 3 patients
displayed lower levels of ESR, but higher levels of CRP, much like that of cluster 1. Patterns
across other panels, such as neutrophil and lymphocyte percentages, were intermediary to
clusters 1 and 2. Cluster 3 patients, did however, appear to have increased ALT levels and
slightly higher urinalysis white blood counts and eosinophil percentages compared to the other
clusters (though the latter 2 were not significant).
40
Figure 9. Unique laboratory test profiles characterize the three clusters.
Whiskers represent 2.5-97.5 percentiles. Clinical variables which are statistically significant (p < 0.05) are marked
with an asterisk (*). According to laboratory tests, cluster 1 had high level of CRP, lower percentage of
lymphocytes, and higher number of neutrophils relative to the other groups. Cluster 2 had the highest platelet count
and lowest hemoglobin z-scores compared to the other 2 clusters. Cluster 2 also displayed lower levels of CRP,
lower neutrophils, but higher lymphocyte percentages. Cluster 3 had mostly intermediary levels relative to clusters 1
41
and 2 across most of the variables. Patients in this cluster did however have higher urinalysis WBC, ALT levels, and
slightly higher eosinophil percentages compared to the other 2 groups.
From the patterns seen in Figure 4, cluster 3 appears to be the most distinct cluster across most of
the variables. The contrasting ESR and CRP patterns between cluster 2 and clusters 1 and 3,
along with other laboratory measures such as lymphocyte and neutrophil percentages, appear to
correlate with the observed pattern of early stages of inflammation (innate) in clusters 1 and 3,
and later stages of disease (adaptive) in cluster 2.
3.5 Unique gene expression profiles characterize the 3 clusters
Unlike the clinical features, which have only few variables that can be easily summarized on a
single page with boxplots, we can’t do the same and compare clusters across every gene
expression variable. Our dataset contains 19,539 genes, and making sense of the results that we
get will be overwhelming and hard to interpret. To further aid us in extracting meaningful
results, we have removed 1,355 genes from further analysis that were strongly correlated with
gender (Mann-Whitney test p<0.05), as it can often introduce a lot of bias and confounding
effects into a dataset due to inherent differences between males and females in any mixed cohort
of patients (87). If the effects are not removed, the gene expression results may then be diluted
with genes that are linked to gender, thus potentially masking any important patterns in the
cohort. In order to proceed with the analysis, we identified the most significant genes based on
their ability to differentiate the clusters using Kruskal-Wallis test for each variable. Using this
method, we isolated 411 genes that had a p-value (Holm-Bonferroni adjusted) less than 0.001.
Due to the fact that performing the Kruskal-Wallis test does not discriminate whether a gene’s
42
expression goes up or down, we performed hierarchal clustering on these genes and identified 2
very distinct groups based on their patterns of expression in each patient cluster (Figure 10).
Group 1, the top part of the heatmap, contains genes where cluster 2 has a marked decrease in
expression relative to clusters 1 and 3. Genes in the group below display an opposite pattern
where clusters 1 and 3 have relatively lower expression in the genes that make up the group,
while cluster 2 is higher. To better characterize the genes that make up these 2 groups, each
group was used as input in DAVID online computational tool (a tool that identifies publically
available gene group annotations that best describe a given list of genes) to identify GO terms
that are enriched in each of these gene lists. It is important to note that GO terms vary in their
specificity of describing a particular biological process or molecular function, so their usefulness
in analysis of our data is only as good as how they are curated. Tables 3 and 4 lists the top GO
terms for the two groups respectively, with a p-value < 0.05 (Benjamini, a correction for multiple
hypothesis testing). For illustrative purposes, a subset of these functional groups (representative
of the overall patterns seen in these genes) were displayed as heatmaps in Figure 11 and Figure
12, results of which are described in detail in the following sections.
43
Figure 10. The 3 clusters are characterized by unique gene expression profiles.
(A) Heatmap showing the hierarchal clustering of the top 411 genes (after removal of genes highly correlated with
gender), Kruskal-Wallis adjusted p-value (Holm-Bonferroni) < 0.001, separated the genes into 2 groups based on
patterns of gene expression across the clusters.
In the first group of genes, identified by hierarchal clustering (Figure 11 and Table 3), the
functional groups related to inflammatory and innate immune response, and protein kinase
cascade, display a relative increase in expression in clusters 1 and 3 (more intense in cluster 1),
which leads us to believe that these patients have gene expression profile pointing to an active
innate immune response. The opposite is true for cluster 2, which appears to correlate with the
longer duration of fever in these patients, thus innate immune response may have decreased
activity, transitioning already to an adaptive response profile. The rest of the GO terms, such as
‘Plasma Membrane’ and ‘Insoluble Fraction’ are significantly enriched but are too generic to
44
draw meaningful conclusions from. These terms belong to the Cellular Component category of
GO terms (which are less useful by themselves, than Molecular Function and Biological Process
GO categories, when it comes to comparing groups of patients) and are also located closer to the
root of the GO terms hierarchy (the terms at the top have a much broader description and are
therefore more generic). As a result, even though they do tell us that gene expression patterns
differ amongst genes in the plasma membrane per se, without any functional descriptions in this
case, not much else can be inferred.
45
Figure 11. Group 1 genes are representative of inflammation and immune response related GO terms.
The genes in the first group of genes were further analyzed using DAVID online computational tool to find enriched
GO terms, which are annotated groups of genes with similar roles and descriptions. Cluster 2 had notably lower
levels of gene expression in GO terms related to innate immune response compared to the other clusters, while
cluster 1 had slightly higher levels compared to cluster 3. See Table 3 for the full list of significant GO terms.
46
Table 3. List of group 1 significant GO terms from the DAVID gene enrichment analysis.
Hierarchal clustering of the top 411 genes, Kruskal-Wallis adjusted p-value (Holm-Bonferroni) < 0.001, separated the genes into 2 groups (See Figure 5). GO terms
with an adjusted p-value <0.05 (Benjamini) are listed for group 1 genes.
Term Count P-Value List Total Benjamini
GO:0045087~innate immune response 11 4.59E-07 123 6.06E-04
IL18R1, CR1, NCF2, FCGR1C, CXCL16, FCGR1A, IL1RAP, VNN1, TLR5, TLR6, TLR8, RAB27A
GO:0006952~defense response 18 3.62E-05 123 2.36E-02
IL18R1, CR1, HIST1H2BC, NCF2, HIST1H2BE, STAT5B, FPR2, TLR5, TLR6, TLR8, MMP25, S100A12, HDAC4, NLRC4, FCGR1C, CXCL16, FCGR1A, IL1RAP, VNN1, RAB27A
GO:0000267~cell fraction 24 8.18E-05 116 1.61E-02
ATP6V0E1, CYP1B1, STX3, AQP9, LIMK2, FLOT1, MAN1A1, GYG1, GCLM, SOD2, S100A12, MCTP1, LIN7A, ACSL1, DGAT2, SH3GLB1, CD59, FAS, CEACAM4, ACSL4, LRRK2, CEACAM1, PSTPIP2, HIP1
GO:0009611~response to wounding 16 8.32E-05 123 3.60E-02
CR1, STAT5B, FPR2, TLR5, TLR6, TLR8, MMP25, S100A12, SOD2, HDAC4, NLRC4, CD59, IL1RAP, SERPINB2, VNN1, RAB27A
GO:0007243~protein kinase cascade 13 1.27E-04 123 4.12E-02
SOCS3, STAT5B, FPR1, TLR5, TLR6, TLR8, TANK, IFNAR1, TRIB1, OSM, TGFA, GADD45B, LRRK2
GO:0006955~immune response 18 1.49E-04 123 3.86E-02
IL1R2, IL18R1, CR1, AQP9, NCF2, BST1, NCF4, TLR5, TLR6, TLR8, OSM, FCGR1C, CXCL16, FCGR1A, IL1RAP, FCGR1B, VNN1, FAS, RAB27A
GO:0006954~inflammatory response 12 1.70E-04 123 3.67E-02
HDAC4, CR1, NLRC4, IL1RAP, STAT5B, VNN1, TLR5, FPR2, TLR6, TLR8, MMP25, S100A12
GO:0005626~insoluble fraction 19 4.81E-04 116 4.65E-02
ATP6V0E1, CYP1B1, STX3, AQP9, FLOT1, MAN1A1, S100A12, MCTP1, LIN7A, ACSL1, DGAT2, SH3GLB1, CD59, CEACAM4, ACSL4, LRRK2, CEACAM1, PSTPIP2, HIP1
GO:0005886~plasma membrane 51 1.00E-03 116 4.84E-02
GPR84, AQP9, TLR5, TLR6, MMP25, SLC2A3, IL1RAP, VNN1, TGFA, SV2A, CEACAM4, FAS, CEACAM1, RAB27A, PTPRJ, GPR97, STX3, NCF2, BST1, NCF4, FLOT1, IFNAR1, OSM, ARRB2, MGAM, LRRK2, SLC40A1, SLC2A14, FPR1, FPR2, KCNJ2, GPR141, ITGAM, ACSL1, FCGR1C, CD177, FCGR1A, FCGR1B, NUMB, ACSL4, IL18R1, CR1, TRIM25, S100A12, LIN7A, P2RY13, CXCL16, GNG10, CD59, RIT1, GK, FCGR2A
47
The second group of genes (Figure 12 and Table 4), where cluster 2 has relatively higher levels
of expression (whereas cluster 3 has slightly lower levels compared to cluster 1), correspond to a
number of GO terms that relate to metabolism, including translation, RNA processing, as well as
mitochondrial and ribonucleoprotein complex related genes. MHC protein binding group of
genes has also been identified as significant and has been previously associate with KD via
GWAS studies described earlier (70). Taken together, these GO terms implicate that cells in
these patients are undergoing increased levels of protein synthesis and MHC mediated
presentation, both of which are consistent with an increasingly more active adaptive immune
response.
Figure 12. Group 2 genes are representative of metabolism related GO terms.
The genes in the second group of genes were further analyzed using DAVID online computational tool to find
enriched GO terms, which are annotated groups of genes with similar roles and descriptions. Due to the large list of
GO terms, a representative sample was picked for illustrative purposes (see Table 4 for the full list). GO terms
related to metabolism displayed an increased pattern of expression in cluster 2, relative to clusters 1 and 3. Lower
levels of expression were observed in cluster 1 relative to cluster 3.
48
Table 4. List of group 2 significant GO terms from the DAVID gene enrichment analysis.
Hierarchal clustering of the top 411 genes, Kruskal-Wallis adjusted p-value (Holm-Bonferroni) < 0.001, separated the genes into 2 groups (See Figure 5). GO terms
with an adjusted p-value <0.05 (Benjamini) are listed for group 2 genes.
Term Count P-Value List Total Benjamini
GO:0070013~intracellular organelle lumen 60 6.89E-14 157 1.63E-11
MMS19, ATP5D, RNMT, LYAR, QARS, CDC16, PDHB, TMEM109, LONP1, MCCC1, SMARCD1, CIRH1A, PRPF31, ELP2, ANAPC5, BYSL, ERP29, RING1, MTA1, POLR1C, LAS1L, MCM3, RSL1D1, RPS15, EDF1, PCCB, NHP2, PMPCA, AARS2, GLTSCR2, POLR2G, SDAD1, TH1L, RPL36, C14ORF169, TRRAP, BOP1, BMS1, PRPF19, HNRNPM, RPA2, DDX47, REXO4, NAT10, GTF3C2, APEX1, SHMT2, TSR1, PHB, MPHOSPH10, SMAD3, ILF3, SF3A3, CDC25B, ILF2, SUMF2, ATP5A1, DDX54, PARP1, DAP3
GO:0031974~membrane-enclosed lumen 61 1.19E-13 157 1.41E-11
MMS19, ATP5D, RNMT, LYAR, QARS, CDC16, PDHB, TMEM109, LONP1, MCCC1, SMARCD1, TIMM9, CIRH1A, PRPF31, ELP2, ANAPC5, BYSL, ERP29, RING1, MTA1, POLR1C, LAS1L, MCM3, RSL1D1, RPS15, EDF1, PCCB, NHP2, PMPCA, AARS2, GLTSCR2, POLR2G, SDAD1, TH1L, RPL36, C14ORF169, TRRAP, BOP1, BMS1, PRPF19, HNRNPM, RPA2, DDX47, REXO4, NAT10, GTF3C2, APEX1, SHMT2, TSR1, PHB, MPHOSPH10, SMAD3, ILF3, SF3A3, CDC25B, ILF2, SUMF2, ATP5A1, DDX54, PARP1, DAP3
GO:0043233~organelle lumen 60 1.89E-13 157 1.50E-11
MMS19, ATP5D, RNMT, LYAR, QARS, CDC16, PDHB, TMEM109, LONP1, MCCC1, SMARCD1, CIRH1A, PRPF31, ELP2, ANAPC5, BYSL, ERP29, RING1, MTA1, POLR1C, LAS1L, MCM3, RSL1D1, RPS15, EDF1, PCCB, NHP2, PMPCA, AARS2, GLTSCR2, POLR2G, SDAD1, TH1L, RPL36, C14ORF169, TRRAP, BOP1, BMS1, PRPF19, HNRNPM, RPA2, DDX47, REXO4, NAT10, GTF3C2, APEX1, SHMT2, TSR1, PHB, MPHOSPH10, SMAD3, ILF3, SF3A3, CDC25B, ILF2, SUMF2, ATP5A1, DDX54, PARP1, DAP3
GO:0031981~nuclear lumen 47 5.20E-10 157 3.08E-08
MMS19, GLTSCR2, POLR2G, SDAD1, RNMT, LYAR, TH1L, RPL36, C14ORF169, TRRAP, BOP1, CDC16, BMS1, PRPF19, HNRNPM, RPA2, TMEM109, DDX47, REXO4, SMARCD1, NAT10, CIRH1A, GTF3C2, APEX1, PRPF31, ELP2, TSR1, ANAPC5, PHB, BYSL, RING1, MPHOSPH10, SMAD3, MTA1, POLR1C, LAS1L, ILF3, MCM3, SF3A3, CDC25B, RSL1D1, ILF2, RPS15, EDF1, DDX54, PARP1, NHP2
GO:0005730~nucleolus 29 2.12E-08 157 1.01E-06
GLTSCR2, SDAD1, LYAR, C14ORF169, RPL36, BOP1, BMS1, HNRNPM, DDX47, TMEM109, REXO4, SMARCD1, NAT10, CIRH1A, ELP2, TSR1, BYSL, RING1, MPHOSPH10, MTA1, ILF3, LAS1L, POLR1C, MCM3, RSL1D1, ILF2, DDX54, PARP1, NHP2
GO:0030529~ribonucleoprotein complex 23 2.87E-07 157 1.13E-05
49
MRPL2, MRPS27, SNRPA1, PRPF31, RPL19, PABPC4, MPHOSPH10, RPL36, EEF2, ILF3, BOP1, SF3A3, PRPF19, RSL1D1, HNRNPM, TARBP2, ILF2, RPS15, MRPL38, SNRNP40, APEX1, NHP2, DAP3
GO:0006412~translation 18 6.55E-07 167 8.13E-04
MRPL2, YARS, RPL19, PABPC4, RPL36, EEF2, QARS, VARS, RSL1D1, EIF3D, EIF3H, RPS15, EIF3F, EIF4A1, EIF3K, LGTN, AARS2, EIF2B4
GO:0006396~RNA processing 23 9.17E-07 167 5.70E-04
POLR2G, SNRPA1, PRPF31, RNMT, TSR2, PABPC4, MPHOSPH10, SMAD3, BOP1, RNMTL1, SF3A3, PRPF19, RSL1D1, HNRNPM, TARBP2, DDX39, DNAJC8, RPS15, SNRNP40, CPSF4, DDX54, NHP2, TYW1B
GO:0005654~nucleoplasm 28 7.79E-06 157 2.64E-04
MMS19, POLR2G, RNMT, TH1L, C14ORF169, TRRAP, BOP1, CDC16, PRPF19, RPA2, SMARCD1, GTF3C2, APEX1, PRPF31, ELP2, ANAPC5, PHB, RING1, SMAD3, MTA1, POLR1C, MCM3, CDC25B, SF3A3, RPS15, EDF1, PARP1, NHP2
GO:0003723~RNA binding 24 1.24E-05 155 4.37E-03
POLR2G, SNRPA1, PRPF31, YARS, RPUSD4, RNMT, RPL19, RPUSD2, PABPC4, ILF3, RNMTL1, RSL1D1, HNRNPM, TARBP2, DDX47, LONP1, DDX18, ILF2, EIF4A1, CPSF4, LGTN, DDX10, DDX54, NHP2
GO:0022613~ribonucleoprotein complex biogenesis 12 1.35E-05 167 5.56E-03
PRPF31, TARBP2, SDAD1, TSR1, TSR2, BYSL, RPS15, MPHOSPH10, BOP1, BMS1, NHP2, SF3A3
GO:0031967~organelle envelope 21 6.55E-05 157 1.94E-03
ATP5D, NXT1, SHMT2, NDUFB11, GIMAP5, SAMM50, NDUFB8, COX10, PHB, SMAD3, IPO9, TMEM109, NPIP, NUP205, MCCC1, TIMM9, ATP5A1, NDUFS3, PARP1, PMPCA, SLC25A17
GO:0031975~envelope 21 6.85E-05 157 1.80E-03
ATP5D, NXT1, SHMT2, NDUFB11, GIMAP5, SAMM50, NDUFB8, COX10, PHB, SMAD3, IPO9, TMEM109, NPIP, NUP205, MCCC1, TIMM9, ATP5A1, NDUFS3, PARP1, PMPCA, SLC25A17
GO:0003743~translation initiation factor activity 7 8.19E-05 155 1.44E-02
EIF3D, EIF3H, EIF4A1, EIF3F, EIF3K, LGTN, EIF2B4
GO:0044429~mitochondrial part 20 1.16E-04 157 2.73E-03
ATP5D, SHMT2, NDUFB11, GIMAP5, SAMM50, NDUFB8, COX10, PHB, QARS, PDHB, LONP1, MCCC1, TIMM9, ATP5A1, NDUFS3, AARS2, PCCB, PMPCA, DAP3, SLC25A17
GO:0005739~mitochondrion 29 1.17E-04 157 0.0025095
ATP5D, SAMM50, COX10, NDUFB8, QARS, VARS, PDHB, LONP1, MCCC1, TIMM9, MRPL38, NDUFS3, GTF3C2, RTN4IP1, MRPS27, MRPL2, NDUFB11, SHMT2, GIMAP5, PHB, ILF3, GLOD4, SMCR7L, ATP5A1, PMPCA, AARS2, PCCB, SLC25A17, DAP3
GO:0042254~ribosome biogenesis 9 1.28E-04 167 0.038887
50
SDAD1, TSR1, TSR2, BYSL, RPS15, MPHOSPH10, BOP1, BMS1, NHP2
GO:0008135~translation factor activity, nucleic acid binding 8 1.62E-04 155 0.0188477
EIF3D, EIF3H, EIF4A1, EIF3F, EIF3K, EEF2, LGTN, EIF2B4
GO:0042287~MHC protein binding 5 1.98E-04 155 0.0173521
TARP, ATP5A1, HLA-DMB, HLA-DMA, CD74
GO:0031980~mitochondrial lumen 11 4.68E-04 157 0.0092094
ATP5D, LONP1, SHMT2, MCCC1, QARS, ATP5A1, AARS2, PMPCA, PCCB, PDHB, DAP3
GO:0005759~mitochondrial matrix 11 4.68E-04 157 0.0092094
ATP5D, LONP1, SHMT2, MCCC1, QARS, ATP5A1, AARS2, PMPCA, PCCB, PDHB, DAP3
GO:0005852~eukaryotic translation initiation factor 3 complex 4 7.28E-04 157 0.013198
EIF3D, EIF3H, EIF3F, EIF3K
GO:0031966~mitochondrial membrane 14 0.001078 157 0.0180872
ATP5D, SHMT2, NDUFB11, GIMAP5, SAMM50, COX10, NDUFB8, PHB, MCCC1, TIMM9, ATP5A1, NDUFS3, PMPCA, SLC25A17
GO:0005740~mitochondrial envelope 14 0.001869 157 0.0291238
ATP5D, SHMT2, NDUFB11, GIMAP5, SAMM50, COX10, NDUFB8, PHB, MCCC1, TIMM9, ATP5A1, NDUFS3, PMPCA, SLC25A17
GO:0043228~non-membrane-bounded organelle 48 0.002223 157 0.0324264
GLTSCR2, SDAD1, RPL19, LYAR, RPL36, C14ORF169, BOP1, CDC16, BMS1, SUMO3, HNRNPM, RPA2, TMEM109, DDX47, LONP1, REXO4, CENPB, SMARCD1, NAT10, MRPL38, CIRH1A, APEX1, ZW10, MRPS27, MRPL2, SHMT2, ELP2, TSR1, BYSL, RING1, MPHOSPH10, MTA1, POLR1C, LAS1L, ILF3, MCM3, MPRIP, CDC25B, KLHDC3, RSL1D1, CCDC6, ILF2, RPS15, DDX54, BIN1, PARP1, NHP2, DAP3
GO:0043232~intracellular non-membrane-bounded organelle 48 0.002223 157 0.0324264
GLTSCR2, SDAD1, RPL19, LYAR, RPL36, C14ORF169, BOP1, CDC16, BMS1, SUMO3, HNRNPM, RPA2, TMEM109, DDX47, LONP1, REXO4, CENPB, SMARCD1, NAT10, MRPL38, CIRH1A, APEX1, ZW10, MRPS27, MRPL2, SHMT2, ELP2, TSR1, BYSL, RING1, MPHOSPH10, MTA1, POLR1C, LAS1L, ILF3, MCM3, MPRIP, CDC25B, KLHDC3, RSL1D1, CCDC6, ILF2, RPS15, DDX54, BIN1, PARP1, NHP2, DAP3
GO:0019866~organelle inner membrane 12 0.002347 157 0.0322239
ATP5D, NDUFB11, SHMT2, NDUFB8, PHB, MCCC1, TIMM9, SMAD3, ATP5A1, NDUFS3, PMPCA, SLC25A17
51
3.6 Variation in treatment response and coronary outcome across the 3 clusters
Figure 13 compares two clinical outcome measures highly relevant in KD – responsiveness to
IVIG treatment and coronary outcome. Looking at the differences between the 3 clusters for
these 2 variables, it seems very apparent that there is a trend between the groups of patients for
both variables. Looking first at clusters 1 and 3 (which appeared similar across most of the
features) - cluster 1, which was earlier described to have a higher proportion of females, had the
higher fraction of patients that responded to treatment and relatively better coronary outcomes.
Cluster 3, on the other hand, which was exclusively male, had higher IVIG non-responsiveness
compared to the other clusters, but also had the largest proportion of patients with a z-worst
score > 2.5. Cluster 2, on the other hand, had the lowest proportion of patients that were IVIG
resistant, compared to clusters 1 and 3. This result was surprising, considering the previously
reported association between longer duration of fever and relatively worse coronary outcome
(88-90). Despite displaying clear differences in clinical and gene expression patterns between the
3 clusters, these important trends in outcome measures for KD did not exhibit statistical
significance between the 3 clusters.
52
Figure 13. The 3 clusters vary with respect to treatment response and disease outcome.
All patients were treated with the identical therapeutic protocol, which included first line treatment with IVIG, the
current standard of care. IVIG non-responsiveness exhibits a trend, though not statistically significant (Fisher’s
exact test, p-value of 0.051), where cluster 2 had the lowest proportion of kids that were IVIG non-responsive, while
cluster 3 had higher rate of IVIG non-responsiveness compared to clusters 1 and 2. A reversed trend was seen for
clusters 1 and 3 in terms of coronary outcome (Fisher’s exact test, p-value of 0.73), where cluster 3 showed an
increase in patients with poor coronary outcome (Z-worst score > 2.5), while cluster 1 showed a decreased
proportion.
3.7 Unique clinical and biological classifiers for predicting cluster assignment
The clinical and gene expression profiles described in previous figures, describe only the three
groups of patients that are within our cohort. Further analysis is required to identify the specific
features that can classify a new patient into either of the 3 clusters. The supervised learning
approach to select informative features in our study is called FeaLect, which is based on the
LASSO method for linear regression, where it scores the features based on their relevance when
generating the models (79). To limit the number of features being tested, we used all the clinical
variables and the top 411 genes (as mentioned earlier) as input. FeaLect validates the results by
training data on 100 randomly generated subsamples (without replacement, each ¾ the size of
the original cohort). As part of the analysis, FeaLect generates a list of scores for each feature
53
from the input data, which if plotted as logs of total scores and arranged in increasing order,
produce a 3 segment graph with curved ends and a linear middle portion (79). The authors
hypothesized that the linear portion represents irrelevant features that most likely contribute to
overfitting, while the non-linear exponential curve represents the most informative variables that
can be used as classifiers (79). Figure 14 shows the feature score graphs that were generated for
our clusters, with the vertical line at the end of each plot pointing to the part of the exponential
curve we used for identifying our informative features. The placement of the lines, unlike the
spline-construction method in the original paper (79), was alternatively done by using the
standard significance threshold of p<0.05 based on the frequency distribution of the feature
scores. Figure 15 and Table 5 illustrate the informative features (for each cluster) that were
identified using FeaLect. Based on these results, classifiers that appear to be the most important
in identifying cluster 1 patients were gender, lymphadenopathy, extremity changes, Interferon
alpha and beta receptor subunit 1 (IFNAR1), amongst others. Cluster 2 patients were
differentiated by extremity changes, conjunctivitis, and rash, all of which showed largest
contrasting differences in the clinical profiles described earlier in Figure 7, as well as genes
relating to metabolism (eg. polymerase (RNA) II (DNA directed) polypeptide G (POLR2G) and
mitochondrial ribosomal protein L2 (MRPL2), which encode subunit of RNA polymerase II and
and a mitochondrial ribosomal protein, respectively) that correlate with transition to adaptive
immune response in these patients. Lastly, cluster 3 patients identified gender and
lymphadenopathy as informative features, both of which had very contrasting clinical profiles in
Figure 7 for this group of patients, as well some genes previously linked to KD, namely
S100A12 (S100A calcium binding protein family) (91), amongst others.
54
Figure 14. FeaLect total feature scores.
FeaLect feature scoring algorithm, using the top 411 genes from the gene expression dataset (Kruskal-Wallis
adjusted p-value (Holm-Bonferroni) < 0.001) and all the clinical variables, was performed 3 times, as a set of binary
regressions for each cluster assignment. Total feature scores (log-scale) were plotted for each cluster and
informative features were picked based on a p<0.05 statistical significance, as denoted by the vertical line at the end
of each graph. See Figure 15 and Table 5 for the list of extracted features.
Figure 15. Informative clinical and biologic features identified with FeaLect.
FeaLect feature scoring algorithm, using the top 411 genes from the gene expression dataset (Kruskal-Wallis
adjusted p-value (Holm-Bonferroni) < 0.001) and all the clinical variables, was performed 3 times, as a set of binary
regressions for each cluster assignment. The bar graphs represent the total feature scores (log-scale) informative
features (based on a p<0.05 statistical significance) that were extracted via FeaLect. See Table 5 for a list of all these
features along with a description of some biologic variables.
55
Table 5. Description of FeaLect classifiers for predicting cluster assignment.
Annotation of the features presented in Figure 15. Variables listed for each cluster represent the informative
features selected with FeaLect (based on a p<0.05 statistical significance) for each cluster. Relative expression of
some the biologic variables between the clusters is displayed in Figure 16.
56
Figure 16. Relative gene expression profiles of biologic variables for each set of features extracted with
FeaLect.
These heatmaps represent the relative expression of genes from the informative features identified with FeaLect in
Table 5 and Figure 15.
57
4 Discussion
Our SNF analysis (Figure 4 B and C) yielded 3 distinct clusters consisting of 78, 33, and 48
patients from merging of biologic and clinical datasets. What’s most important is these clusters
are the fused product of the patterns across each individual dataset. The problem with analyzing
multiple datasets separately is apparent in Figure 4A, where similarity matrices for each
individual datasets are shown. Even though the original datasets were converted to patient
similarity matrices, thus getting rid of the large contrast in number of features and allowing for
easier comparison between gene expression (19, 539 genes) and clinical data (20 variables), each
respective dataset is still showing varying patterns across the cohort. Gene expression dataset has
2 clusters, clinical categorical has 5, while clinical numerical has 4 clusters of patients when
clustered using spectral clustering. This makes it difficult to make inference about the patients in
the cohort due to the disagreement of the clusters between each dataset. Fusion of the similarity
networks, however, led us to clearly identify the 3 clusters while incorporating the patterns from
the 3 datasets in the final network. It achieves this by strengthening any patterns common
between datasets, while weakening the patterns that are not shared.
As with any other clustering method, an important question is whether the patterns we are seeing
are based on the entire cohort or only a small subset of patients that are driving the cluster
formation. Co-clustering probability in Figure 5 shows that even when 40% of patients were
removed, we were still able to maintain 80% co-clustering probability. In other words, 80% of
the original pairwise patient-patient co-clustering relationships remained the same after re-
running the analysis with trimmed datasets. This is evidence that SNF is highly robust with
respect to our KD study and the 3 clusters can be re-constructed even with smaller sample sizes.
58
A similar question was asked for the effect of variables on cluster formation – are the patterns
observed based on all the variables in the datasets or are they driven by only a select few? This
may not be relevant in gene expression because removing one gene out of 19,539 isn’t going to
have much of an effect, but it definitely plays a role in the clinical datasets, both continuous and
categorical (20 features in total), as any of these variables have a much stronger effect on the
pairwise similarities being calculated. Figure 6 illustrates that most of these variables have co-
clustering probability higher than 0.90, with the exception of gender and lymphadenopathy
which have co-clustering probabilities below 0.7 when removed. This does show up in our
clusters as cluster number 2 is largely unique in both of these features, but the probabilities are
still pretty high and demonstrate that the patterns our clusters represent are driven by a
combination of variables rather than just a few single features. The effect of gender on our
cluster formation is to be expected, since our cohort is 60% male (Table 1) and there are inherent
differences between males and females in any dataset (87). Furthermore, KD is known to have a
higher occurrence in males, thus gender playing a key role in cluster formation in KD is
expected. Due to the confounding nature of the gender related genes in gene expression data and
for the purpose of extracting more meaningful results, we did, however, remove any gender
correlated variables from our gene expression data before gene enrichment analysis and feature
selection with FeaLect.
SNF was able to identify homogenous clusters in our KD cohort, in a completely data-driven
way. Furthermore, the extracted clusters exhibit clinically meaningful results, both across the
clinical and gene expression datasets, in the context of KD. Our SNF analysis was able to discern
patients that appeared to be in different biologic stages of disease at presentation, namely in
innate (cluster 1 and 3) or adaptive immune responses (cluster 2). Importance of innate and
59
adaptive immune system in KD was established in previous studies and has been summarized
earlier (92).
The first line of evidence for different phases of the immune response can be observed in the
patterns across the clinical variables in Figure 7 and Figure 9, covering demographic and
laboratory test features respectively. Looking at the clinical features more closely, cluster 2 has
longer duration of fever and lower incidence of classical KD symptoms (rash, lymphadenopathy,
oral changes, extremities, and conjunctivitis), which goes along with a lower inflammatory
response (reflected by lower CRP values in Figure 9). The seemingly contradictory elevated ESR
values in these patients actually correlate with the longer duration of fever, as ESR is an indirect
measure of inflammation and has much slower kinetics than CRP (24). As would be expected,
the differential white blood cell counts are also in accordance with our claim – cluster 2 shows
decreased neutrophil and bands percentages, but increased lymphocytes. Lastly, the increased
platelet count that is seen in cluster 2 also correlates with increased duration of fever, as the 2
have been previously linked, and is consistent with the natural history of KD moving into the
subacute phase with an IL-6 driven increase in platelets (26).
Cluster 1, in contrast to cluster 2, shows signs of an earlier response involving the innate immune
system. Cluster 1 still has higher incidence of the classical symptoms of KD, and the duration of
fever indicates that the patients are earlier in the inflammatory response, as seen in Figure 7 and
Figure 9. The higher levels of neutrophils, but lower lymphocytes further support the innate
phase of the disease.
Based on the clinical patterns, cluster 3 does not appear to strongly stand out as the other 2
clusters. It is however, most similar to cluster 1 in terms of fever duration at diagnosis,
60
presentation of the majority of classical KD symptoms, and most of the laboratory test variables
(Figure 7 and Figure 9). Consequently, the patients in this cluster also appear to be in the early
innate immune response stage of the disease, much like cluster 1. The 2 variables that mostly
differentiate clusters 1 and 3 are gender and presence of lymphadenopathy – the 2 features that
were previously identified in Figure 6 as having the strongest effect on cluster formation relative
to all the other variables. However, despite the similarity across a lot of the features, cluster 1
may actually exhibit a slightly more pronounced inflammatory response, as cluster 3 appears to
not only have lower neutrophil percentages and ESR levels compared to cluster 1, but the
biologic patterns seen in Figure 11 (related to inflammation and innate immune response) show
the same pattern, but with less intensity as cluster 1.
Based on just the clinical results, patients in cluster 2 have been sick for longer than those in the
other groups, and are transitioning into the adaptive immune response, hence the lower
inflammation profile both across markers of inflammation and white blood differential counts.
Clusters 1 and 3, on the other hand, are still early in the innate immune response, which is
reflected in higher inflammation profiles and white blood counts skewed towards innate immune
cells.
The most interesting part of our SNF analysis came from the incorporation of biologic data. SNF
formed the 3 clusters that were not only based on the clinical patterns described earlier, but
actually took into account the underlying patterns in the thousands of features in the gene
expression dataset as well. The results we got were ranking of genes that were significantly
different amongst the groups of patients (measured with Kruskal-Wallis test and adjusted with
Holm-Bonferroni multiple hypothesis testing correction). After removal of genes that strongly
61
correlated with gender, gene expression of the top remaining significant genes (411 genes)
revealed 2 clear patterns when taking the 3 clusters into account – first group of genes had lower
expression in cluster 2 and higher in clusters 1 and 3, while the opposite pattern was observed for
the second group of genes (Figure 10). The gene enrichment analysis carried out on the two gene
sets revealed patterns in accordance with the clinical findings.
The GO terms representative of the genes that follow a pattern of decreased levels of expression
in cluster 2 (Figure 11 and Table 3), appear to correlate with the patterns of inflammation that we
have seen in the clinical variables across the 3 clusters. “Innate Immune response”, “Immune
Response”, “Protein Kinase Cascade”, and “Inflammatory Response” GO terms all communicate
the same thing and follow the same patterns we have seen in neutrophil % and CRP (Figure 4B).
The pattern is that clusters 1 and 3 have the highest level of inflammation (cluster 1 slightly
higher than 3) and cluster 2 has the lowest. These correlations are expected because the genes
identified are the ones driving the inflammation. Although the list of genes that make up these
groups is still rather large, some specific genes that are of interest are IL-1 related (IL18R1,
IL1RAP, IL1R2 in Figure 11) and Fc-gamma receptor genes (FCGR1C in Figure 11), which
have been previously linked to Kawasaki Disease (69, 93). Fc-gamma receptors, which bind IgG
antibodies and transduce downstream signaling cascades (94), may potentially be involved in
IVIG response due to the abundance of IgG in IVIG preparations. Furthermore, a previously
mentioned GWAS study has identified FCGR2A gene, encoding one of the Fc-gamma receptor,
as linked to KD susceptibility (69). IL-1 secretion is linked to KD with increased levels of the
cytokine found in KD patients during the acute phase of the disease (93). IL-1RAP is part of the
IL-1 receptor and Il-18R1 encodes part of the IL-18 receptor (the genes can be found under
‘Innate Immune Response’ GO term in Figure 11), both of which are inflammatory, show
62
increased patterns of expression in clusters 1 and 3, relative to cluster 2 (95). IL1R2 (under the
‘Immune Response’ GO term in Figure 11), though an inhibitory decoy receptor, is normally
expressed as well in order to modulate the inflammatory response, so its increased expression in
clusters 1 and 3 alongside the pro-inflammatory IL-1 and 18 components is not out of the
ordinary (95). Furthermore, a paper studying the same KD cohort against other pediatric
bacterial and viral infections, identified the IL-1 signaling pathway as a key signature in KD
compared to the other diseases, with implications for use in treatment (82). In fact, a case report
of a relapsing KD patient has previously demonstrated the beneficial effects of an IL-1 receptor
antagonist on disease outcome, further supporting the importance of IL-1 signaling in KD
pathogenesis (96). The remaining GO groups pertaining to cell fractions and plasma membrane
appear to be non-specific and do not clearly describe a molecular function, so it is hard to draw
any conclusions about their implications in our clusters.
GO terms describing group 2 genes in Figure 12 (full list in Table 4), such as “Translation”,
“RNA processing”, “Mitochondrion”, and “Ribonucleoprotein Complex”, are all terms related to
metabolism and protein synthesis. Higher relative expression in cluster 2 further supports the
initiation of adaptive immune response in these patients, while innate immune response is still
active in clusters 1 and 3 (cluster 1 expression appears to be slightly lower than cluster 3,
consistent with the opposite intensity for group 1 GO terms). Even though both innate and
adaptive immune responses undergo cell activation and subsequent proliferation, innate cells,
such as neutrophils, are more transient and short lived (97). Since prolonged inflammation is
damaging to the host, it is tightly regulated to be diminished following acute inflammation (98).
This is in concert with observations of increased lymphocytes and decreased neutrophils in
63
cluster 2, indicative of a transition to the adaptive immune response and increased metabolism to
support lymphocyte proliferation in adaptive immunity.
Aside from metabolism involvement, another important GO group describing a set of genes was
related to MHC protein binding – a component of the immune system that was previously
associated with KD (60, 70). Once again, cluster 2 here showed higher levels of expression in the
related genes, compared to clusters 1 and 3. The HLA region has been previously linked to
Kawasaki Disease, as in the GWAS study published in 2012 that identified 6p21.3 region to have
genome-wide significance association with KD, which happened to include HLA Class 2 genes,
such as HLA-DQA2 and HLA-DOB (70). Though HLA-DMA and HLA-DMB (encoding the
HLA-DM heterodimer) in Figure 12 are not part of this list, they are actually closely linked to
HLA-DO (one of the 2 chains is encoded by HLA-DOB), where HLA-DO is a modulator of
HLA-DM (99). HLA-DMA is an MHC-Class 2 protein that plays an important role in the
loading of peptides onto HLA molecules for antigen presentation, helping to exchange CLIP for
the antigen peptide (100). It is hypothesized to also impose a form of peptide selection that
creates a specific immune response and prevents cross-reactivity (100). HLA-DO is upregulated
in resting APCs, which in turn downregulates HLA-DM activity, consequently promoting a
broader low-abundance repertoire of antigens to be presented on the surface to diminish the
chance of a reaction against self-peptides. (101). During an immune response, however, when
APCs need to present foreign antigens to the adaptive immune system, HLA-DM activity is
increased to facilitate an immunodominant response (101). The pattern observed seems to
correlate with transition to the adaptive response in cluster 2, that our clinical results and the role
of MHC related molecules appear to support.
64
Clinical outcomes, shown in Figure 6, compare and contrast the 3 clusters across IVIG
responsiveness and coronary outcome in KD. Though not significant, there is an apparent trend
between the patient groups that makes sense in the light of the other findings that were presented
earlier. Cluster 1 has lower proportion of patients that did not responded to IVIG, compared to
cluster 3, which can be attributed to the fact that this cluster was mostly composed of females.
This makes sense as there is a gender bias in KD susceptibility, with higher incidence in males
(1.5-1.7:1 ratio) (7). For the same reason, it is quite possible that cluster 3 displays relatively
higher proportion of patients with coronary enlargements due to the fact that this cluster is 100%
male. Together, these clinical patterns make sense, as IVIG non-responsiveness is often
associated with worse coronary outcome. However, taking into account most of the clinical and
laboratory test variables in Figure 7 and Figure 9, the differences observed between clusters 1
and 3 are small. Gene expression results, on the other hand, especially in Figure 11, appear to
show a more visible difference between the two clusters, relating to innate immune response and
inflammation gene signatures. Comparing this to the contrasting trend between outcome
measures in these groups of patients further supports the notion that the current AHA criteria for
diagnosis of KD is not good enough at identifying at-risk patients without biological input.
Failure to identify and properly treat at-risk patients in time, may very well attribute to the
relatively worse coronary outcome and lower IVIG responsiveness of cluster 3 patients that was
observed in relation to cluster 1. In contrast to these observations, we have also observed a rather
unique pattern in IVIG non-responsiveness of our cluster 2 patients. This appears to be very
meaningful clinically and biologically with respect to IVIG non-responsiveness, as cluster 2 is
the cluster of patients with longer duration of fever at diagnosis. Prolonged fever has been
identified as a strong predictor of poor coronary outcome in numerous previous studies (88-90).
65
As a result, cluster 2 having a relatively lower incidence of patients with a z-worst score > 2.5
compared to clusters 1 and 3, and having a relatively lower IVIG non-responsiveness rate, is an
indication that the clinical variables are not able to capture the underlying biological differences
within these patients. That said, unfortunately we did not find statistical significance in the
outcome measures due to lack of statistical power, as the differences between the clusters were
small and would require a larger sample size to reject the null hypothesis.
Analyzing gene patterns amongst the 3 clusters serves the purpose of describing the similarities
and differences between the groups of patients that we have in our cohort, but it shouldn’t be
used to make generalizations about any future patients. That is, out of all the significant hits we
found, we cannot draw too many conclusions about a feature’s significance in placing a given
patient into a given cluster, without doing some kind of supervised learning or validation. Using
FeaLect as a method for feature selection (79), we identified features that can be used as
classifiers for assigning one of the three clusters to any given patient. Since this is a multi-class
problem, the algorithm was used on 3 separate models, each for one of the clusters. In other
words, in each model, the output variable was converted to a binary classification of the patient
belonging to a given cluster or not. As a result, Figure 15 and Table 5 list the 3 sets of features
that are important in classifying a patient into either one of the clusters. The top clinical features
across the 3 sets of variables in Figure 15 and Table 5 are gender and the classical KD
symptoms. This is not surprising as these variables previously showed contrasting levels across
the 3 clusters (eg. gender and lymphadenopathy are the two variables that differ the most
between clusters 1 and 3). A number of clinical laboratory test features have also made the list as
they also appeared to strongly differ amongst the different groups of patients. These include
66
‘Platelet Count’, ‘Duration of Fever’, and ‘Neutrophils’, amongst others, all of which play a big
role in drawing a line between adaptive and innate immunity that we observed in this cohort.
Moving on to the biological variables that have been identified as informative, IFNAR1
(Interferon alpha and beta Receptor subunit 1) is one of the several genes that stands out in
relation to KD. This gene encodes one of the subunits of the IFNAR receptor, which is known to
bind type I interferons (102). Type I interferons, such as IFN-α and IFN-β, often induce anti-viral
and immune system modulating effects in response to viral or bacterial infections (102). This
subclass of cytokines also happens to be on the other end of the balance for autoinflammation
from IL-1 signaling (103), which we know is an important immune axis in KD (93). As they
counter-regulate each other, balance between Type I interferon activity and IL-1 signaling may
have implications for KD pathology. Figure 16 shows the relative expression of IFNAR1
between the clusters, with clusters 1 and 3 showing higher relative expression, is in accordance
with their innate immune response state. Previous studies have also found increased interferon
type I induced gene regulation in coronary arteries of KD patients (104).
Another gene in Figure 15 and Table 5 (with heatmap showing expression in Figure 16), that is
worth noting is GPD1L. GPD1L encodes a sodium channel interacting protein that is expressed
in cardiac tissues (105). It is thought to affect levels of a sodium channel SCN5A on the cell
membrane, with mutations in the GPD1L gene linked to inherited arrhythmias, such as Brugada
syndrome (105). Even though no exact association has been previously reported between this
gene and KD, low sodium concentrations were previously linked to IVIG unresponsiveness in
the Kobayashi score that was previously described (41).
67
Another important feature identified by FeaLect that is interesting is S100A12 (under cluster 3
classifiers in Figure 15 and Table 5). Its relative expression in cluster 2 is lower, compared to the
other two clusters, as seen in Figure 16. The importance of S100A12 in our list of classifiers is
not surprising as it was previously linked to the acute stages of KD (106). The S100 proteins
belong to the calcium binding family, including S100A8 and S100A9, which together with
S100A12, are found in phagocytes (106). S100A12 in particular, is known for binding the RAGE
protein (a pro-inflammatory pattern recognition receptor), and leading to downstream production
of pro-inflammatory markers such as TNF-α and IL-1β (106). S100A12 being a good indicator
of inflammation (106) is supported in our results, as cluster 2 does have decreased inflammation.
Lastly, POLR2G, MRLP2, and VARS encoding the polymerase II polypeptide G and
mitochondrial ribosomal protein L2, mitochondrial ribosomal protein L2, and valyl-tRNA
synthase, respectively, are worth noting as they appear under the cluster 2 set of classifiers in
Table 5. They show higher levels of expression and are most likely part of the features selected
by FeaLect due to the increased metabolic profile that was identified in cluster 2 earlier (Figure
12).
68
5 Study Limitations
Several components of the experimental design and analysis can be improved in future studies of
KD utilizing SNF. First and foremost, increasing the number of patients can increase the power
to detect significant patterns in coronary outcome and IVIG unresponsiveness in KD patients.
Another consideration during experimental design would be to include other datasets, such as
methylation data, protein expression, and other clinical variables in SNF analysis. Using other
large datasets may uncover relevant patterns related to KD pathology that SNF can incorporate in
its fused matrix. Acquiring gene expression data from coronary tissue, in contrast to whole blood
used in our study, may also provide more insight into KD. Similarly, adjusting the composition
of the smaller clinical datasets (where each variable may have a larger effect on the final
network, compared to a single feature from a large gene expression dataset) should be taken into
consideration. On the same note, removing effects of potentially confounding variables, such as
gender, may improve the discoverability of important new patterns. Removing gender effects
from the datasets, before running SNF, may aid in generating patient clusters with less
confounding effects. Lastly, validating findings in another KD cohort and carrying out laboratory
experiments to confirm effects of some of the genes identified can further improve our
understanding of KD in future studies.
69
6 Conclusions
Our hypothesis stated that SNF can combine clinical and biological data to identify homogenous
groups of KD patients, and our analysis showed this to be true. We have identified 3 unique
clusters that not only were based on the clinical variables, but were generated based on
contributions from the gene expression dataset as well. Taken together, our results showed that
cluster 2 patients were uniquely identified as being in the later stages of disease with transition to
the adaptive immune response phase, in the form of decreased inflammation markers, increase in
lymphocytes, and decrease in neutrophils. Gene expression results for cluster 2 reflected the
clinical manifestations with increase in metabolic profile, protein synthesis, and HLA expression,
as well as decrease in genes related to innate immune response and inflammation. Clusters 1 and
3, displayed a much earlier stage of the disease that was reflective of an innate immune response,
with higher neutrophil percentages and increased levels of inflammatory markers and genes.
Despite being similar to each other across most of the clinical and laboratory measures, clusters 1
and 3 differed across the intensity of their gene expression patterns, with cluster 1 showing a
slightly more pronounced inflammatory response. An even more pronounced difference was
observed across IVIG non-responsiveness and coronary outcome, exhibiting a higher risk trend
for cluster 3 patients. Overall, our findings have identified patterns that were previously
associated with KD and show the power of modern computational techniques in bringing
together multiple datasets to drive analysis. In a completely data driven way, SNF identified
clinically meaningful homogenous groups of patients in a KD cohort, further confirming the
heterogeneity of the disease and giving us a new toolset for future studies of KD.
70
7 References
1. Taubert, K. A., A. H. Rowley, and S. T. Shulman. 1991. Nationwide survey of Kawasaki
disease and acute rheumatic fever. The Journal of pediatrics 119: 279-282.
2. Kato, H., T. Sugimura, T. Akagi, N. Sato, K. Hashino, Y. Maeno, T. Kazue, G. Eto, and
R. Yamakawa. 1996. Long-term consequences of Kawasaki disease. A 10- to 21-year
follow-up study of 594 patients. Circulation 94: 1379-1385.
3. Uehara, R., and E. D. Belay. 2012. Epidemiology of Kawasaki disease in Asia, Europe,
and the United States. Journal of epidemiology / Japan Epidemiological Association 22:
79-85.
4. Joffe, A., A. Kabani, and T. Jadavji. 1995. Atypical and complicated Kawasaki disease in
infants. Do we need criteria? The Western journal of medicine 162: 322-327.
5. Tseng, C. F., Y. C. Fu, L. S. Fu, H. Betau, and C. S. Chi. 2001. Clinical spectrum of
Kawasaki disease in infants. Zhonghua yi xue za zhi = Chinese medical journal; Free
China ed 64: 168-173.
6. Witt, M. T., L. L. Minich, J. F. Bohnsack, and P. C. Young. 1999. Kawasaki disease:
more patients are being diagnosed who do not meet American Heart Association criteria.
Pediatrics 104: e10.
7. Newburger, J. W., M. Takahashi, M. A. Gerber, M. H. Gewitz, L. Y. Tani, J. C. Burns, S.
T. Shulman, A. F. Bolger, P. Ferrieri, R. S. Baltimore, W. R. Wilson, L. M. Baddour, M.
E. Levison, T. J. Pallasch, D. A. Falace, K. A. Taubert, E. Committee on Rheumatic
Fever, and C. o. C. D. i. t. Y. A. H. A. Kawasaki Disease. 2004. Diagnosis, treatment, and
long-term management of Kawasaki disease: a statement for health professionals from
the Committee on Rheumatic Fever, Endocarditis, and Kawasaki Disease, Council on
Cardiovascular Disease in the Young, American Heart Association. Pediatrics 114: 1708-
1733.
8. Holman, R. C., A. T. Curns, E. D. Belay, C. A. Steiner, and L. B. Schonberger. 2003.
Kawasaki syndrome hospitalizations in the United States, 1997 and 2000. Pediatrics 112:
495-501.
9. Chang, R. K. 2003. The incidence of Kawasaki disease in the United States did not
increase between 1988 and 1997. Pediatrics 111: 1124-1125.
10. Kim, G. B., J. W. Han, Y. W. Park, M. S. Song, Y. M. Hong, S. H. Cha, D. S. Kim, and
S. Park. 2014. Epidemiologic features of Kawasaki disease in South Korea: data from
nationwide survey, 2009-2011. The Pediatric infectious disease journal 33: 24-27.
11. Lue, H. C., L. R. Chen, M. T. Lin, L. Y. Chang, J. K. Wang, C. Y. Lee, and M. H. Wu.
2014. Epidemiological features of Kawasaki disease in Taiwan, 1976-2007: results of
five nationwide questionnaire hospital surveys. Pediatrics and neonatology 55: 92-96.
71
12. Chen, J. J., X. J. Ma, F. Liu, W. L. Yan, M. R. Huang, M. Huang, G. Y. Huang, and G.
Shanghai Kawasaki Disease Research. 2016. Epidemiologic Features of Kawasaki
Disease in Shanghai From 2008 Through 2012. The Pediatric infectious disease journal
35: 7-12.
13. Holman, R. C., E. D. Belay, K. Y. Christensen, A. M. Folkema, C. A. Steiner, and L. B.
Schonberger. 2010. Hospitalizations for Kawasaki syndrome among children in the
United States, 1997-2007. The Pediatric infectious disease journal 29: 483-488.
14. Holman, R. C., A. T. Curns, E. D. Belay, C. A. Steiner, P. V. Effler, K. L. Yorita, J.
Miyamura, S. Forbes, L. B. Schonberger, and M. Melish. 2005. Kawasaki syndrome in
Hawaii. The Pediatric infectious disease journal 24: 429-433.
15. Fujita, Y., Y. Nakamura, K. Sakata, N. Hara, M. Kobayashi, M. Nagai, H. Yanagawa,
and T. Kawasaki. 1989. Kawasaki disease in families. Pediatrics 84: 666-669.
16. Harada, F., M. Sada, T. Kamiya, Y. Yanase, T. Kawasaki, and T. Sasazuki. 1986. Genetic
analysis of Kawasaki syndrome. American journal of human genetics 39: 537-539.
17. Park, Y. W., J. W. Han, Y. M. Hong, J. S. Ma, S. H. Cha, T. C. Kwon, S. B. Lee, C. H.
Kim, J. S. Lee, and C. H. Kim. 2011. Epidemiological features of Kawasaki disease in
Korea, 2006-2008. Pediatrics international : official journal of the Japan Pediatric
Society 53: 36-39.
18. Huang, W. C., L. M. Huang, I. S. Chang, L. Y. Chang, B. L. Chiang, P. J. Chen, M. H.
Wu, H. C. Lue, C. Y. Lee, and G. Kawasaki Disease Research. 2009. Epidemiologic
features of Kawasaki disease in Taiwan, 2003-2006. Pediatrics 123: e401-405.
19. Ma, X. J., C. Y. Yu, M. Huang, S. B. Chen, M. R. Huang, G. Y. Huang, and G. Shanghai
Kawasaki Research. 2010. Epidemiologic features of Kawasaki disease in Shanghai from
2003 through 2007. Chinese medical journal 123: 2629-2634.
20. Kawasaki, T. 2006. Kawasaki disease. Proceedings of the Japan Academy. Series B,
Physical and biological sciences 82: 59-71.
21. Gerding, R. 2011. Kawasaki disease: a review. Journal of pediatric health care : official
publication of National Association of Pediatric Nurse Associates & Practitioners 25:
379-387.
22. Son, M. B., and R. P. Sundel. 2016. Chapter 35 - Kawasaki Disease A2 - Wedderburn,
Ross E. PettyRonald M. LaxerCarol B. LindsleyLucy R. In Textbook of Pediatric
Rheumatology (Seventh Edition). W.B. Saunders, Philadelphia. 467-483.e466.
23. Tashiro, N., T. Matsubara, M. Uchida, K. Katayama, T. Ichiyama, and S. Furukawa.
2002. Ultrasonographic evaluation of cervical lymph nodes in Kawasaki disease.
Pediatrics 109: E77-77.
72
24. Anderson, M. S., J. Burns, T. A. Treadwell, B. A. Pietra, and M. P. Glode. 2001.
Erythrocyte sedimentation rate and C-reactive protein discrepancy and high prevalence of
coronary artery abnormalities in Kawasaki disease. The Pediatric infectious disease
journal 20: 698-702.
25. Burns, J. C., W. H. Mason, M. P. Glode, S. T. Shulman, M. E. Melish, C. Meissner, J.
Bastian, A. S. Beiser, H. M. Meyerson, and J. W. Newburger. 1991. Clinical and
epidemiologic characteristics of patients referred for evaluation of possible Kawasaki
disease. United States Multicenter Kawasaki Disease Study Group. The Journal of
pediatrics 118: 680-686.
26. Yeung, R. S. 2007. Phenotype and coronary outcome in Kawasaki's disease. Lancet 369:
85-87.
27. Suddleson, E. A., B. Reid, M. M. Woolley, and M. Takahashi. 1987. Hydrops of the
gallbladder associated with Kawasaki syndrome. Journal of pediatric surgery 22: 956-
959.
28. de Zorzi, A., S. D. Colan, K. Gauvreau, A. L. Baker, R. P. Sundel, and J. W. Newburger.
1998. Coronary artery dimensions may be misclassified as normal in Kawasaki disease.
The Journal of pediatrics 133: 254-258.
29. Kurotobi, S., T. Nagai, N. Kawakami, and T. Sano. 2002. Coronary diameter in normal
infants, children and patients with Kawasaki disease. Pediatrics international : official
journal of the Japan Pediatric Society 44: 1-4.
30. Bayry, J., V. S. Negi, and S. V. Kaveri. 2011. Intravenous immunoglobulin therapy in
rheumatic diseases. Nature reviews. Rheumatology 7: 349-359.
31. Gelfand, E. W. 2012. Intravenous immune globulin in autoimmune and inflammatory
diseases. The New England journal of medicine 367: 2015-2025.
32. Tremoulet, A. H., B. M. Best, S. Song, S. Wang, E. Corinaldesi, J. R. Eichenfield, D. D.
Martin, J. W. Newburger, and J. C. Burns. 2008. Resistance to intravenous
immunoglobulin in children with Kawasaki disease. The Journal of pediatrics 153: 117-
121.
33. Durongpisitkul, K., V. J. Gururaj, J. M. Park, and C. F. Martin. 1995. The prevention of
coronary artery aneurysm in Kawasaki disease: a meta-analysis on the efficacy of aspirin
and immunoglobulin treatment. Pediatrics 96: 1057-1061.
34. Tremoulet, A. H., J. Dutkowski, Y. Sato, J. T. Kanegaye, X. B. Ling, and J. C. Burns.
2015. Novel data-mining approach identifies biomarkers for diagnosis of Kawasaki
disease. Pediatric research.
73
35. Sudo, D., Y. Monobe, M. Yashiro, M. N. Mieno, R. Uehara, K. Tsuchiya, T. Sonobe, and
Y. Nakamura. 2012. Coronary artery lesions of incomplete Kawasaki disease: a
nationwide survey in Japan. European journal of pediatrics 171: 651-656.
36. Asai, T. 1983. Evaluation Method for the degree of seriousness in Kawasaki Disease.
Pediatrics International 25: 170-175.
37. Nakano, H., K. Ueda, A. Saito, Y. Tsuchitani, J. Kawamori, T. Miyake, and T. Yoshida.
1986. Scoring method for identifying patients with Kawasaki disease at high risk of
coronary artery aneurysms. The American journal of cardiology 58: 739-742.
38. Iwasa, M., K. Sugiyama, T. Ando, H. Nomura, T. Katoh, and Y. Wada. 1987. Selection
of high-risk children for immunoglobulin therapy in Kawasaki disease. Progress in
clinical and biological research 250: 543-544.
39. Harada, K. 1991. Intravenous gamma-globulin treatment in Kawasaki disease. Acta
paediatrica Japonica; Overseas edition 33: 805-810.
40. Beiser, A. S., M. Takahashi, A. L. Baker, R. P. Sundel, and J. W. Newburger. 1998. A
predictive instrument for coronary artery aneurysms in Kawasaki disease. US Multicenter
Kawasaki Disease Study Group. The American journal of cardiology 81: 1116-1120.
41. Kobayashi, T., Y. Inoue, K. Takeuchi, Y. Okada, K. Tamura, T. Tomomasa, T.
Kobayashi, and A. Morikawa. 2006. Prediction of intravenous immunoglobulin
unresponsiveness in patients with Kawasaki disease. Circulation 113: 2606-2612.
42. Kobayashi, T., T. Saji, T. Otani, K. Takeuchi, T. Nakamura, H. Arakawa, T. Kato, T.
Hara, K. Hamaoka, S. Ogawa, M. Miura, Y. Nomura, S. Fuse, F. Ichida, M. Seki, R.
Fukazawa, C. Ogawa, K. Furuno, H. Tokunaga, S. Takatsuki, S. Hara, A. Morikawa, and
R. s. g. investigators. 2012. Efficacy of immunoglobulin plus prednisolone for prevention
of coronary artery abnormalities in severe Kawasaki disease (RAISE study): a
randomised, open-label, blinded-endpoints trial. Lancet 379: 1613-1620.
43. Hui-Yuen, J. S., T. T. Duong, and R. S. Yeung. 2006. TNF-alpha is necessary for
induction of coronary artery inflammation and aneurysm formation in an animal model of
Kawasaki disease. Journal of immunology 176: 6294-6301.
44. Rowley, A. H., S. T. Shulman, B. T. Spike, C. A. Mask, and S. C. Baker. 2001.
Oligoclonal IgA response in the vascular wall in acute Kawasaki disease. Journal of
immunology 166: 1334-1343.
45. Brown, T. J., S. E. Crawford, M. L. Cornwall, F. Garcia, S. T. Shulman, and A. H.
Rowley. 2001. CD8 T lymphocytes and macrophages infiltrate coronary artery
aneurysms in acute Kawasaki disease. The Journal of infectious diseases 184: 940-943.
74
46. Lau, A. C., T. T. Duong, S. Ito, and R. S. Yeung. 2008. Matrix metalloproteinase 9
activity leads to elastin breakdown in an animal model of Kawasaki disease. Arthritis and
rheumatism 58: 854-863.
47. Kikuta, H., Y. Sakiyama, S. Matsumoto, I. Hamada, M. Yazaki, T. Iwaki, and M.
Nakano. 1993. Detection of Epstein-Barr virus DNA in cardiac and aortic tissues from
chronic, active Epstein-Barr virus infection associated with Kawasaki disease-like
coronary artery aneurysms. The Journal of pediatrics 123: 90-92.
48. Matsuno, S., E. Utagawa, and A. Sugiura. 1983. Association of rotavirus infection with
Kawasaki syndrome. The Journal of infectious diseases 148: 177.
49. Holm, J. M., L. K. Hansen, and H. Oxhoj. 1995. Kawasaki disease associated with
parvovirus B19 infection. European journal of pediatrics 154: 633-634.
50. Okano, M., G. M. Thiele, Y. Sakiyama, S. Matsumoto, and D. T. Purtilo. 1990.
Adenovirus infection in patients with Kawasaki disease. Journal of medical virology 32:
53-57.
51. Normann, E., J. Naas, J. Gnarpe, H. Backman, and H. Gnarpe. 1999. Demonstration of
Chlamydia pneumoniae in cardiovascular tissues from children with Kawasaki disease.
The Pediatric infectious disease journal 18: 72-73.
52. Lehman, T. J., S. M. Walker, V. Mahnovski, and D. McCurdy. 1985. Coronary arteritis in
mice following the systemic injection of group B Lactobacillus casei cell walls in
aqueous suspension. Arthritis and rheumatism 28: 652-659.
53. Duong, T. T., E. D. Silverman, M. V. Bissessar, and R. S. Yeung. 2003. Superantigenic
activity is responsible for induction of coronary arteritis in mice: an animal model of
Kawasaki disease. International immunology 15: 79-89.
54. Onouchi, Y. 2009. Molecular genetics of Kawasaki disease. Pediatric research 65: 46R-
54R.
55. Pulst, S. M. 1999. Genetic linkage analysis. Archives of neurology 56: 667-672.
56. Bush, W. S., and J. H. Moore. 2012. Chapter 11: Genome-wide association studies. PLoS
computational biology 8: e1002822.
57. Reich, D. E., and E. S. Lander. 2001. On the allelic spectrum of human disease. Trends in
genetics : TIG 17: 502-510.
58. Kuhn, K., S. C. Baker, E. Chudin, M. H. Lieu, S. Oeser, H. Bennett, P. Rigault, D.
Barker, T. K. McDaniel, and M. S. Chee. 2004. A novel, high-performance random array
platform for quantitative gene expression profiling. Genome research 14: 2347-2356.
75
59. Robinson, M. D., and T. P. Speed. 2007. A comparison of Affymetrix gene expression
arrays. BMC bioinformatics 8: 449.
60. Kato, S., M. Kimura, K. Tsuji, S. Kusakawa, T. Asai, T. Juji, and T. Kawasaki. 1978.
HLA antigens in Kawasaki disease. Pediatrics 61: 252-255.
61. Kamizono, S., A. Yamada, T. Higuchi, H. Kato, and K. Itoh. 1999. Analysis of tumor
necrosis factor-alpha production and polymorphisms of the tumor necrosis factor-alpha
gene in individuals with a history of Kawasaki disease. Pediatrics international : official
journal of the Japan Pediatric Society 41: 341-345.
62. Burns, J. C., C. Shimizu, H. Shike, J. W. Newburger, R. P. Sundel, A. L. Baker, T.
Matsubara, Y. Ishikawa, V. A. Brophy, S. Cheng, M. A. Grow, L. L. Steiner, N. Kono,
and R. M. Cantor. 2005. Family-based association analysis implicates IL-4 in
susceptibility to Kawasaki disease. Genes and immunity 6: 438-444.
63. Ohno, T., H. Igarashi, K. Inoue, K. Akazawa, K. Joho, and T. Hara. 2000. Serum
vascular endothelial growth factor: a new predictive indicator for the occurrence of
coronary artery lesions in Kawasaki disease. European journal of pediatrics 159: 424-
429.
64. Senzaki, H., S. Masutani, J. Kobayashi, T. Kobayashi, H. Nakano, H. Nagasaka, N.
Sasaki, H. Asano, S. Kyo, and Y. Yokote. 2001. Circulating matrix metalloproteinases
and their inhibitors in patients with Kawasaki disease. Circulation 104: 860-863.
65. Cheung, Y. F., G. Y. Huang, S. B. Chen, X. Q. Liu, L. Xi, X. C. Liang, M. R. Huang, S.
Chen, L. S. Huang, X. Q. Liu, K. W. Chan, and Y. L. Lau. 2008. Inflammatory gene
polymorphisms and susceptibility to kawasaki disease and its arterial sequelae. Pediatrics
122: e608-614.
66. Onouchi, Y., M. Tamari, A. Takahashi, T. Tsunoda, M. Yashiro, Y. Nakamura, H.
Yanagawa, K. Wakui, Y. Fukushima, T. Kawasaki, Y. Nakamura, and A. Hata. 2007. A
genomewide linkage analysis of Kawasaki disease: evidence for linkage to chromosome
12. Journal of human genetics 52: 179-190.
67. Onouchi, Y., T. Gunji, J. C. Burns, C. Shimizu, J. W. Newburger, M. Yashiro, Y.
Nakamura, H. Yanagawa, K. Wakui, Y. Fukushima, F. Kishi, K. Hamamoto, M. Terai,
Y. Sato, K. Ouchi, T. Saji, A. Nariai, Y. Kaburagi, T. Yoshikawa, K. Suzuki, T. Tanaka,
T. Nagai, H. Cho, A. Fujino, A. Sekine, R. Nakamichi, T. Tsunoda, T. Kawasaki, Y.
Nakamura, and A. Hata. 2008. ITPKC functional polymorphism associated with
Kawasaki disease susceptibility and formation of coronary artery aneurysms. Nature
genetics 40: 35-42.
68. Onouchi, Y., K. Ozaki, J. C. Buns, C. Shimizu, H. Hamada, T. Honda, M. Terai, A.
Honda, T. Takeuchi, S. Shibuta, T. Suenaga, H. Suzuki, K. Higashi, K. Yasukawa, Y.
Suzuki, K. Sasago, Y. Kemmotsu, S. Takatsuki, T. Saji, T. Yoshikawa, T. Nagai, K.
Hamamoto, F. Kishi, K. Ouchi, Y. Sato, J. W. Newburger, A. L. Baker, S. T. Shulman,
76
A. H. Rowley, M. Yashiro, Y. Nakamura, K. Wakui, Y. Fukushima, A. Fujino, T.
Tsunoda, T. Kawasaki, A. Hata, Y. Nakamura, and T. Tanaka. 2010. Common variants in
CASP3 confer susceptibility to Kawasaki disease. Human molecular genetics 19: 2898-
2906.
69. Khor, C. C., S. Davila, W. B. Breunis, Y. C. Lee, C. Shimizu, V. J. Wright, R. S. Yeung,
D. E. Tan, K. S. Sim, J. J. Wang, T. Y. Wong, J. Pang, P. Mitchell, R. Cimaz, N. Dahdah,
Y. F. Cheung, G. Y. Huang, W. Yang, I. S. Park, J. K. Lee, J. Y. Wu, M. Levin, J. C.
Burns, D. Burgner, T. W. Kuijpers, M. L. Hibberd, C. Hong Kong-Shanghai Kawasaki
Disease Genetics, C. Korean Kawasaki Disease Genetics, C. Taiwan Kawasaki Disease
Genetics, C. International Kawasaki Disease Genetics, U. S. K. D. G. Consortium, and S.
Blue Mountains Eye. 2011. Genome-wide association study identifies FCGR2A as a
susceptibility locus for Kawasaki disease. Nature genetics 43: 1241-1246.
70. Onouchi, Y., K. Ozaki, J. C. Burns, C. Shimizu, M. Terai, H. Hamada, T. Honda, H.
Suzuki, T. Suenaga, T. Takeuchi, N. Yoshikawa, Y. Suzuki, K. Yasukawa, R. Ebata, K.
Higashi, T. Saji, Y. Kemmotsu, S. Takatsuki, K. Ouchi, F. Kishi, T. Yoshikawa, T.
Nagai, K. Hamamoto, Y. Sato, A. Honda, H. Kobayashi, J. Sato, S. Shibuta, M.
Miyawaki, K. Oishi, H. Yamaga, N. Aoyagi, S. Iwahashi, R. Miyashita, Y. Murata, K.
Sasago, A. Takahashi, N. Kamatani, M. Kubo, T. Tsunoda, A. Hata, Y. Nakamura, T.
Tanaka, C. Japan Kawasaki Disease Genome, and U. S. K. D. G. Consortium. 2012. A
genome-wide association study identifies three new risk loci for Kawasaki disease.
Nature genetics 44: 517-521.
71. Lee, Y. C., H. C. Kuo, J. S. Chang, L. Y. Chang, L. M. Huang, M. R. Chen, C. D. Liang,
H. Chi, F. Y. Huang, M. L. Lee, Y. C. Huang, B. Hwang, N. C. Chiu, K. P. Hwang, P. C.
Lee, L. C. Chang, Y. M. Liu, Y. J. Chen, C. H. Chen, I. D. A. Taiwan Pediatric, Y. T.
Chen, F. J. Tsai, and J. Y. Wu. 2012. Two new susceptibility loci for Kawasaki disease
identified through genome-wide association analysis. Nature genetics 44: 522-525.
72. Wang, B., A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains, and
A. Goldenberg. 2014. Similarity network fusion for aggregating data types on a genomic
scale. Nature methods 11: 333-337.
73. Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette,
A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. 2005. Gene
set enrichment analysis: a knowledge-based approach for interpreting genome-wide
expression profiles. Proceedings of the National Academy of Sciences of the United
States of America 102: 15545-15550.
74. Beissbarth, T., and T. P. Speed. 2004. GOstat: find statistically overrepresented Gene
Ontologies within a group of genes. Bioinformatics 20: 1464-1465.
75. Khatri, P., P. Bhavsar, G. Bawa, and S. Draghici. 2004. Onto-Tools: an ensemble of web-
accessible, ontology-based tools for the functional design and interpretation of high-
throughput gene expression experiments. Nucleic acids research 32: W449-456.
77
76. Huang da, W., B. T. Sherman, and R. A. Lempicki. 2009. Systematic and integrative
analysis of large gene lists using DAVID bioinformatics resources. Nature protocols 4:
44-57.
77. Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis,
K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A.
Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G.
Sherlock. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nature genetics 25: 25-29.
78. Bousquet, O., G. Raetsch, and U. von Luxburg. 2004. Advanced Lectures on Machine
Learning. Springer-Verlag Berlin Heidelberg.
79. Zare, H., G. Haffari, A. Gupta, and R. R. Brinkman. 2013. Scoring relevancy of features
based on combinatorial analysis of Lasso with application to lymphoma diagnosis. BMC
Genomics 14: 1-9.
80. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (Methodological): 267-288.
81. Bach, F. R. 2008. Bolasso: model consistent Lasso estimation through the bootstrap.
CoRR abs/0804.1302.
82. Hoang, L. T., C. Shimizu, L. Ling, A. N. Naim, C. C. Khor, A. H. Tremoulet, V. Wright,
M. Levin, M. L. Hibberd, and J. C. Burns. 2014. Global gene expression profiling
identifies new therapeutic targets in acute Kawasaki disease. Genome medicine 6: 541.
83. Bao, R., L. Huang, J. Andrade, W. Tan, W. A. Kibbe, H. Jiang, and G. Feng. 2014.
Review of current methods, applications, and data management for the bioinformatics
analysis of whole exome sequencing. Cancer informatics 13: 67-82.
84. von Luxburg, U. 2007. A tutorial on spectral clustering. Statistics and Computing 17:
395-416.
85. Pruitt, K. D., T. Tatusova, and D. R. Maglott. 2005. NCBI Reference Sequence (RefSeq):
a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic
acids research 33: D501-504.
86. Saguil, A., M. Fargo, and S. Grogan. 2015. Diagnosis and management of kawasaki
disease. American family physician 91: 365-371.
87. Hajian Tilaki, K. 2012. Methodological issues of confounding in analytical
epidemiologic studies. Caspian journal of internal medicine 3: 488-495.
88. Daniels, S. R., B. Specker, T. E. Capannari, D. C. Schwartz, M. J. Burke, and S. Kaplan.
1987. Correlates of coronary artery aneurysm formation in patients with Kawasaki
disease. American journal of diseases of children 141: 205-207.
78
89. Ichida, F., N. S. Fatica, M. A. Engle, J. E. O'Loughlin, A. A. Klein, M. S. Snyder, K. H.
Ehlers, and A. R. Levin. 1987. Coronary artery involvement in Kawasaki syndrome in
Manhattan, New York: risk factors and role of aspirin. Pediatrics 80: 828-835.
90. Koren, G., S. Lavi, V. Rose, and R. Rowe. 1986. Kawasaki disease: review of risk factors
for coronary aneurysms. The Journal of pediatrics 108: 388-392.
91. Foell, D., F. Ichida, T. Vogl, X. Yu, R. Chen, T. Miyawaki, C. Sorg, and J. Roth. 2003.
S100A12 (EN-RAGE) in monitoring Kawasaki disease. Lancet 361: 1270-1272.
92. Schulte, D. J., A. Yilmaz, K. Shimada, M. C. Fishbein, E. L. Lowe, S. Chen, M. Wong,
T. M. Doherty, T. Lehman, T. R. Crother, R. Sorrentino, and M. Arditi. 2009.
Involvement of innate and adaptive immunity in a murine model of coronary arteritis
mimicking Kawasaki disease. Journal of immunology 183: 5311-5318.
93. Leung, D. Y., R. S. Cotran, E. Kurt-Jones, J. C. Burns, J. W. Newburger, and J. S. Pober.
1989. Endothelial cell activation and high interleukin-1 secretion in the pathogenesis of
acute Kawasaki disease. Lancet 2: 1298-1302.
94. Nimmerjahn, F., and J. V. Ravetch. 2008. Fcgamma receptors as regulators of immune
responses. Nature reviews. Immunology 8: 34-47.
95. Boraschi, D., and A. Tagliabue. 2013. The interleukin-1 receptor family. Seminars in
immunology 25: 394-407.
96. Cohen, S., C. E. Tacke, B. Straver, N. Meijer, I. M. Kuipers, and T. W. Kuijpers. 2012. A
child with severe relapsing Kawasaki disease rescued by IL-1 receptor blockade and
extracorporeal membrane oxygenation. Annals of the rheumatic diseases 71: 2059-2061.
97. Ganeshan, K., and A. Chawla. 2014. Metabolic regulation of immune responses. Annual
review of immunology 32: 609-634.
98. Medzhitov, R., and T. Horng. 2009. Transcriptional control of the inflammatory
response. Nature reviews. Immunology 9: 692-703.
99. Mellins, E. D., and L. J. Stern. 2014. HLA-DM and HLA-DO, key regulators of MHC-II
processing and presentation. Current opinion in immunology 26: 115-122.
100. Sadegh-Nasseri, S., M. Chen, K. Narayan, and M. Bouvier. 2008. The convergent roles
of tapasin and HLA-DM in antigen presentation. Trends in immunology 29: 141-147.
101. Denzin, L. K. 2013. Inhibition of HLA-DM Mediated MHC Class II Peptide Loading by
HLA-DO Promotes Self Tolerance. Frontiers in immunology 4: 465.
102. McNab, F., K. Mayer-Barber, A. Sher, A. Wack, and A. O'Garra. 2015. Type I
interferons in infectious disease. Nature reviews. Immunology 15: 87-103.
79
103. van Kempen, T. S., M. H. Wenink, E. F. Leijten, T. R. Radstake, and M. Boes. 2015.
Perception of self: distinguishing autoimmunity from autoinflammation. Nature reviews.
Rheumatology 11: 483-492.
104. Rowley, A. H., K. M. Wylie, K. Y. Kim, A. J. Pink, A. Yang, R. Reindel, S. C. Baker, S.
T. Shulman, J. M. Orenstein, M. W. Lingen, G. M. Weinstock, and T. N. Wylie. 2015.
The transcriptional profile of coronary arteritis in Kawasaki disease. BMC Genomics 16:
1076.
105. London, B., M. Michalec, H. Mehdi, X. Zhu, L. Kerchner, S. Sanyal, P. C. Viswanathan,
A. E. Pfahnl, L. L. Shang, M. Madhusudanan, C. J. Baty, S. Lagana, R. Aleong, R.
Gutmann, M. J. Ackerman, D. M. McNamara, R. Weiss, and S. C. Dudley, Jr. 2007.
Mutation in glycerol-3-phosphate dehydrogenase 1 like gene (GPD1-L) decreases cardiac
Na+ current and causes inherited arrhythmias. Circulation 116: 2260-2268.
106. Foell, D., and J. Roth. 2004. Proinflammatory S100 proteins in arthritis and autoimmune
disease. Arthritis and rheumatism 50: 3762-3771.