segmentation-based detection of mosaic chromosomal …€¦ · · 2014-09-031 segmentation-based...
TRANSCRIPT
1
Segmentation-Based Detection of Mosaic Chromosomal
Abnormality in Bladder Cancer Cells Using Whole Genome SNP Arrays
Weiyin Zhou
Cancer Genomics Research Laboratory (CGR)
Leidos Biomedical Research, Inc
Texas A&M Masters of Science Degree Candidate – Statistics
Texas A&M Department of Statistics College Station, TX
February 2014
2
Acknowledgements
I would like to express my deepest gratitude to my advisor, Dr. Alan Dabney, for his excellent guidance,
understanding, patience, and encouragement. I would also like to thank my other committee members: Dr. William
B. Smith and Dr. Bruce Lowe for taking time from their busy schedule to review my work; and to thank Ms. Penny
Jackson and Kim Ritchie for their procedural and technical support. My gratitude extends to my current employer,
Leidos Biomedical Research, Inc., which paid the majority of my tuition under its education assistance program.
I would like to thank my husband Dr. Lisheng Cai, my son, Robert, and my daughter, Kimberly, for their love and
support. They always cheer me up during the entire process. Quite naturally, my parents laid the foundation for all
this effort through years of caring and teaching.
3
Abstract
The purpose of this project is to investigate the relationship between mosaic chromosomal abnormality and bladder
cancer. DNA of 3,239 individuals consisted of 1,673 bladder cancer cases and 1,566 cancer-free controls have been
examined for evidence of mosaicism of the autosomes using genome-wide SNP array data generated from bladder
cancer genome wide association analysis. DNA samples were extracted from blood or buccal (mouth) samples and
were genotyped on Illumina Infinium HumanHap 610 quad SNP array. Two segmentation-based methods have been
used to detect three types of mosaic events. 193 mosaic duplication (gain), mosaic deletion (loss), and mosaic copy-
neutral loss of heterozygosity (CNLOH) events, defined as being of > 0.5 Mb in size, in autosomes of 163 individuals
(5%), with abnormal cell proportions of between 3.84% and 96.64%, was observed. Mosaic autosomal
abnormalities were more common in the bladder cancer individuals (5.86%) compared with cancer-free persons
(4.12%). Mosaic chromosomal abnormalities were statistically significantly positively associated with bladder
cancer for male (OR = 1.55; P = 0.014). In cancer-free male individuals, mosaic chromosomal abnormality frequency
increased with age, from 3.39% under 60 years to 7.51% between 76 and 89 years (P = 0.035).
4
Contents
1. Introduction 5 2. Data Description 5 3. Illumina SNP Genotyping and Normalization 8
4. Mosaic Chromosomal Abnormalities Detection Methods 10 4.1. Log R Ratio and B Allele Frequency 10
4.2. Re-estimate Log R Ratio and B Allele Frequency 12
4.3. Segmentation Methods 14
4.4. Mosaic Events Calling Methods 15 4.5. Proportion of Mosaicism 16
4.6. Examples of Mosaic Events 16
5. Data Analysis 20
6. Results and Discussion 21 6.1. Characteristics of Mosaic Events 21 6.2. Mosaic Chromosomal Abnormalities and Age at DNA 25 6.3. Mosaic Chromosomal Abnormalities and Gender 26 6.4. Mosaic Chromosomal Abnormalities and Cancer Risk 26
6.4.1. Mosaic Chromosomal Abnormalities and Cancer Risk for All Subjects 26 6.4.2. Mosaic Chromosomal Abnormalities and Cancer Risk for Male Only 29 6.4.3. Mosaic Chromosomal Abnormalities and Cancer Risk for Female Only 30
7. Conclusions and Further Studies 31 8. References 35 9. Appendix A: SAS Code 36
Appendix B: Normalization Pipeline 42
Appendix C: GADA Segmentation pipeline 43
Appendix D: BAF Segmentation Pipeline 45
5
1. Introduction
Genetic mosaicism is defined as the coexistence of cells with different genetic composition within an individual
caused by postzygotic event during development that can occur in both somatic (affecting only non-reproductive
cells) and germline cells (with the potential of being passed on to any offspring) despite being the product of a single
fertilization. Mosaicism can be caused by DNA mutations, epigenetic alterations of DNA, chromosomal
abnormalities, and the spontaneous reversion of inherited mutations [1,2]. Somatic mosaicism has been established
as a cause of mental retardation, birth defects, spontaneous abortion, and cancer [3-7]. The unequal distribution of
DNA to daughter cells upon mitosis (chromosome instability) may lead to aneuploidy, the duplication or deletion of
chromosomes or segments of chromosomes, and reciprocal duplication and deletion events that appear as copy-
neutral loss of heterozygosity or acquired uniparental disomy. Mosaic chromosomal abnormalities have been
defined as the presence of both normal karyotypes as well as those with large structural genomic events resulting in
alteration of copy number or loss of heterozygosity in distinct and detectable subpopulations of cells [8].
The development of microarray technology has had a significant impact on the genetic analysis of human disease.
The whole-genome single nucleotide polymorphism (SNP) genotyping arrays have become an important tool for
discovering variants that contribute to human diseases and phenotypes. The two most applications of this
technology are genome-wide association studies (GWAS) and copy number variant (CNV) analysis. The SNP array
offers researchers the flexibility to genotype samples with hundreds of thousands to millions of markers that deliver
dense genome-wide coverage with the most up-to-date content to provide maximum coverage of genome for both
association testing and copy number detection. Data from genome-wide association studies have been used for
association between single SNP and disease status. It also provides an opportunity to detect chromosome variation
and to investigate the association of mosaicism with disease status.
In this project, the SNP microarray data generated on Illumina Infinium HumanHap 610 quad SNP array for the
bladder cancer GWAS were subsequently used to uncover mosaic genomic copy number gains, losses, and copy-
neutral loss of heterozygosity in the sutosomes of 5% subjects. Two segmentation-based algorithms have been
used to detect 193 mosaic events of > 0.5 Mb in size. The type of the chromosomal abnormalities detected has
been characterized. The relationship between chromosomal abnormalities and cancer risk, age at NDA correction,
and gender has been investigated. The frequency of mosaic chromosomal abnormalities was positively associated
with bladder cancer for male subject. The frequency was increased with age for cancer-free individuals.
2. Data Sets Description The data used for this project consists of 3,239 individuals in bladder cancer genome-wide association studies
(GWASs) from three cohorts: Beta-Carotene Cancer Prevention Study (ATBC), Prostate, Lung, Colorectal, Ovarian
Cancer Screening Trial (PLCO), and Cancer Prevention Study-II (CPS-II). The summary of the study by case and
control are listed in Table 1. 1,673 cases that had been diagnosed with urothelial cell carcinoma of the bladder and
1,566 controls that were cancer-free. There were 2,734 males and 505 females. The summary of the gender by
case and control are listed in Table 2. The mean age at DNA withdrawal is 67 years for all subjects (range 21–89
years, s.d. = 8.83). The mean age is 67 years for male (range 21-89, s.d. = 8.94) and 69 years for female (range 24-87
s.d. = 7.89). The mean age is 66 years for case (range 21-87, s.d. = 10.2) and 68 years for control (range 54-89 s.d. =
6
6.99). The study was approved by the institutional ethics committees of each participating hospital and the
institutional review board (IRB) of the National Cancer Institute (NCI, USA). Written informed consent was obtained
from all individuals. DNA was extracted from peripheral blood (58.6%) and mouthwash samples (41.4%). Genomic
DNA was screened and analyzed at the National Cancer Institute according to the sample handling process of the
Cancer Genomics Research Laboratory (CGR), Division of Cancer Epidemiology and Genetics (DCEG) before being
genotyped to the HumanHap 610 Quad BeadChip (Illumina, Inc.) via the Infinium Assay.
Overall, 3.3% of samples were performed in duplicate for the reproducibility checking, with SNP genotype calling
concordance rate greater than 99.98% between two technical duplicates. The completion rate, defined as the
proportion of frequency of non-missing genotypes for sample, were calculated by taking the number of called
genotype SNP probes and dividing it by the total number of SNP probes on the array using the GLU qc.summary
module (http://code.google.com/p/glu-genetics/). The overall completion rate for the study samples is 97.87%. The
distribution of the completion rates by sample and by locus is shown in Figure 1. A brief summary of the
sample/locus counts at the 100th, 99th, 95th, 90th and 50th quantiles are provided as insert in the Figure 1.
Ancestry was estimated for the 3,239 study subjects using a set of population informative SNPs [9] and data from
HapMap build 27. These SNPs used are common to the commercially available Affymetrix 500K, Illumina 317K, and
550K chips. Admixture coefficients were estimated for each subject using the GLU struct.admix module, the
HapMap CEU, YRI, ASA (JPT+CHB) samples were used as the fixed reference populations. A total of 3205 subjects
were detected to have European ancestry. The 34 subjects were detected to have less than 80% of European
ancestry, as shown in Figure 2 and are summarized in Table 3.
Table 1 Summary of Study Cohorts by Cancer Status
Cancer Status ATBC PLCO CPSII Total
Bladder Cancer 416 563 694 1673
Control 722 114 730 1566
Total 1138 677 1424 3239
Table 2 Summary of Gender by Cancer Status
Cancer Status Male Female Total
Bladder Cancer 1436 237 1673
Control 1298 268 1566
Total 2734 505 3239
Table 3 Summary of Population Structure by Cancer Status
Study Cohort Cancer Status
Imputed Ancestry
CEU ADMIXED CEU ASA ASA,CEU CEU,YRI YRI Total
ATBC Bladder Cancer 416 416
Control 722 722
PLCO Bladder Cancer 540 1 7 5 7 3 563
Control 110 1 3 114
CPS Bladder Cancer 687 1 3 1 1 1 694
Control 730 730
Total 3205 3 13 6 8 4 3239
7
Figure 1 Bladder completion rate by sample (left) and by locus (right).
Figure 2 Population structure
8
3. Illumina SNP Genotyping and Normalization SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between
members of a species. SNPs are one of the most common types of genetic variation. A SNP is a single base pair
mutation at a specific site in DNA, usually consisting of two alleles that makes up the individual’s genotype. Illumina
DNA Analysis BeadChips using the infinium Assay provide researchers genomic access and analyzing genetic
variation. The infinium is a two color channels assay, with the data consisting of two intensity values (X, Y) for each
SNP. There is one intensity channel for each of two fluorescent dyes associated with the two alleles of the SNP. The
alleles measured by the X channel (Cy5 dye) are called the A allele, whereas the alleles measured by the Y channel
(Cy3 dye) are called the B allele. Each SNP is analyzed independently to identify genotypes. Illumina’s standard
normalization algorithm is implemented as the first step in SNP genotyping data analysis. The intensity data are
normalized using Illumina’s self-normalization algorithm which draws on information contained in the array itself
and to convert raw X and Y (allele A and allele B) signal intensities to normalized values. Normalized values then are
used to analyze standard genotyping calls, Loss of Heterozygosity (LOH), and Copy Number (CN).
In a diploid genome without CNVs, the three possible genotype calls are AA, AB, and BB, respectively. The raw signal
intensity values measured for the A and B alleles are subject to an Illunina’s standard five-step standard
normalization procedure to determine six parameters: offset_X, offset_Y, theta, shear, scale_X, scale_Y. The
normalization algorithm is designed to adjust for nominal intensity variations observed in the two color channels,
background differences between the two channels, possible crosstalk between the dyes, global intensity difference,
and to scale the data [10]. Figure 3 depicts the 5 steps of the normalization process.
Step1: Outlier removal (Figure 3-A): Outlier SNPs are removed from consideration during the normalization
parameter estimation. They will not be excluded from downstream analysis.
Step2: Background estimation (offset_X, offset_Y) (Figure 3-B): Identify candidate homozygote A alleles along X-axis
and candidate homozygote B alleles along Y-axis. Two straight lines are fit into homozygote A and B alleles
respectively. Offset_X and offset_Y parameters are the intercepts from these two lines. The points are corrected
for translation.
Step3: Rotational estimation (theta) (Figure 3-C): Identify a set of control points by X-axis. A straight line is fit into
the control points. The theta parameter is the angle between this line and the X-axis and defines the amount of
rotation in the data. The points are corrected for rotation.
Step4: Shear estimation (shear) (Figure 3-D): Identify a set of control points by Y-Axis. A straight line is fit into the
control points. The shear parameter is the angle between this line and the Y-axis. The points are corrected for
shear.
Step5: Scaling estimation (scale_X, scale_Y) (Figure 3-E): A statistical method used to determine scale_X and scale_Y
parameters.
Figure 3-F is the final set of normalized data points. The points along X-axis represent AA alleles, points along Y-axis
represent BB alleles, and points along 45 degree represent AB alleles.
9
Illumina then uses these 6 estimated parameter values to convert raw coordinates (X raw and Y raw) to normalized
coordinates (X normalized and Y normalized) for each SNP, representing the experiment-wide normalized signal
intensity on the A and B alleles, respectively.
Figure 3 Five- step normalization procedure. Figure 3-F is the final normalized data points for a particular SNP. The
points on X-axis represent AA alleles. The points along Y axis represent BB alleles. The points along approximate 45
degree are AB alleles.
To visualize the data after normalization, the genotyping data are transformed from Cartesian coordinates (Figure 4
left) to a polar coordinate plot (Figure 4 right). Cartesian coordinates use the X axis to represent the intensity of A
allele and the Y axis to represent the intensity of B allele. The polar coordinates use the X axis to represent
normalized theta (the angle deviation from pure A signal, where 0 represents pure A signal and 1.0 represents pure
B signal), and Y axis to represent the distance of point to origin. The theta and R are calculated by equations:
Theta = (2/pi) * arctan(Ynorm/Xnorm)
R = Xnorm + Ynorm
Where Xnorm and Ynorm represent transformed normalized signals from alleles A and B for a particular locus (SNP).
10
Figure 4 SNP Graphs: Cartesian Coordinates (left) & Polar Coordinates (Right). It displays all samples for the
currently selected SNP. Samples are colored according to their genotype. From right graph, for this particular SNP
(ID = rs17159012), 420 samples (red cluster) are called as AA alleles, 124 samples (purple cluster) called as AB alleles,
and 13 samples (blue cluster) called as BB alleles.
4. Mosaic Chromosomal Abnormalities Detection Methods
4.1. Log R Ratio and B Allele Frequency
The main goal of this project is to investigate the relationship between mosaic chromosomal abnormalities and
bladder cancer risk by first identify regions of the genome that are aberrant in copy number, more specifically, the
mosaic copy number variation on autosomal chromosomes in bladder cancer and cancer-free subjects. The
detection of autosomal mosaic events was based on assessment of allelic imbalance and copy number changes. The
chromosomal abnormalities were detected using two infinium high density assay outputs: the log R ratio (LRR) and B
allele frequency (BAF). The LRR and BAF values are originally developed on the Illumina platform. For the Illumina
SNP arrays, the LRR and BAF values can be directly calculated and exported from Illumina’s GenomeStudio software.
The Log R ratio (LRR) value is the normalized measure of total signal intensity and provides data on relative copy
number. For each SNP, let the normalized signal intensities for the A and B alleles be denoted as Xnorm and Ynorm,
respectively. We can then calculate the R-value as Robserved = Xnorm + Ynorm as a normalized measure of total signal
intensity. Log R ratio is then calculated as LRR = log2(Robserved / Rexpected), where Rexpected is computed from linear
interpolation of the genotype clusters (Figure 5 Left). The three cluster positions are generated from a large set of
samples that passed completion rate cutoff. The LRR value for a SNP is a measure of the difference between the
signal intensity of the test sample and a pool of reference samples of the same SNP genotype. Since LRR is the
logged ratio of observed probe intensity to expected intensity, deviation from zero is evidence for copy number
change.
The B allele frequency (BAF) derived from the ratio of allelic probe intensity is the proportion of hybridized sample
that carries the B allele as designated by the infinium Assay. The B allele frequency can also be referred to as “copy
angle” or “allelic composition”. It shows the relative presence of each of the two alternative nucleotides A and B at
11
each SNP locus profiled. BAF for a sample shows the theta value for a SNP corrected for cluster position. The theta-
value θ = (2/pi)*arctan(Ynorm/Xnorm). The BAF value is calculated by equation:
θAA, θAB, θBB are the θ values for three genotype clusters generated from a large set of samples that passed
completion rate cutoff (Figure 5 Right). In the right figure, D1 = (θ - θAB)) and D2 = (θAB - θBB ). In a normal sample,
discrete BAFs of 0.0, 0.5, and 1.0 are expected for each locus that representing AA, AB, BB alleles. Deviations from
this expectation are indicative of aberrant copy number. For example, if a locus has a BAF = 1/3, this might indicate
that there are 1 copy of the B allele and 2 copies of A alleles present in the sample because 1/(1+2) = 1/3.
Analyzing both of LRR and BAF metrics provides strong resolution for detecting true copy number changes and allelic imbalance (Table 4).
Figure 5 Log R Ratios (LRR) and Allelic Intensity Ratio (BAF).
Table 4 Summary of Copy Numbers, Genotypes, Expected LRR, and Expected BAF.
Total Copy Numbers CNV Genotypes Expected LRR Expected BAFs
Deletion of Two Copy Null < 0 N/A
Deletion of One Copy A, B < 0 0, 1
Normal Copy AA, AB, BB 0 0, 0.5, 1
Copy-Neutral LOH AA, BB 0 0, 1
Single Copy Duplication (Trisomy) AAA, AAB, ABB, BBB > 0 0, 1/3, 2/3, 1
Double Copy Duplication AAAA, AAAB, AABB, ABB, BBBB > 0 0, 1/4, 2/4, 3/4, 1
Mosaic Deletion mixed (AA, AB, BB) and (A, B) < 0 4 BAF bands
Mosaic Copy-Neutral LOH mixed (AA, AB, BB) and (AA, BB) 0 4 BAF bands
Mosaic Duplication mixed (AA, AB, BB) and (AAA, AAB, ABB, BBB) > 0 0, > 1/3, < 2/3, 1
12
Genetic mosaicism is the presence of cells within an organism that have a different genetic composition despite
being the product of a single fertilization event. For this project, three mosaic types were investigated: mosaic
deletion (loss), mosaic copy-neutral LOH, and mosaic duplication (gain) as defined below:
Mosaic deletion is the coexistence of cells with normal copy and deletion of one copy. It is characterized by LRR < 0
and two heterozygous BAF bands.
Mosaic copy-neutral LOH is the coexistence of cells with normal copy and copy-neutral LOH. It is characterized by
LRR = 0 and two heterozygous BAF bands.
Mosaic duplication is the coexistence of cells with normal copy and duplication of one copy. It is characterized by
LRR > 0 and two heterozygous BAF bands between (1/3, 2/3). Notice, if two heterozygous BAF bands = 1/3 (AAB)
and 2/3 (ABB), then it is pure duplication of one copy (trisomy).
4.2. Re-estimate LRR and BAF
LRR and BAF were estimated by the GenomeStudio software. However, there are two sources of biases that are not
overcome by Illumina’s five-step normalization method: dye bias and GC/CpG wave bias. There is an asymmetry in
the detection of the two alleles for each SNP, caused by a remaining bias between two dyes used in the Infinium II
assay after used Illumina’s normalization method. The dye intensity bias can reduce precision in estimating copy-
number and allelic imbalance. GC/CpG waves can be present when using incorrectly quantified DNA in the Infinium
assay, or they might be present in regions of high or low GC content. The presence of GC/CpG waves creates
artificial gains and losses in signal intensities for SNPs, and may lead to spurious copy-number variation calls. A four-
step custom software pipeline was implemented to the data exported from GenomeStudio, that contains called
genotype, genotype call quality score, genotype probe intensities (Xnorm, Ynorm), log R ratio (LRR), and B allele
frequency (BAF) for each assay, for further normalization to re-estimate LRR and BAF.
Step 1: Quantile normalization was applied [11] to Xnorm and Ynorm that were generated from Illumina’s
GenomeStudio software and resulted Xqnorm and Yqnorm. This procedure removes dye bias and improves the
asymmetry in the detection of the two alleles for each SNP, which influences both allelic proportions and copy
number estimates.
Step 2: Re-estimate genotype specific cluster centers (AA, AB, BB) for each SNP using Xqnorm and Yqnorm values from
assays with completion rate and genotype quality score greater than predefined thresholds so only SNPs from high
quality samples were used to generate each cluster position.
Step3: GC/CpG wave correction model was applied to each genotyped sample to get GC/CpG corrected allelic
composition theta = (2/pi) * arctan (Yqnorm / Xqnorm) and total intensity R, which was estimated as a linear combination
of (Xqnorm, Yqnorm, GC content in probes) [12]. GC/CpG correction reduces the wavy patterns of signal intensities and
improves the accuracy of copy-number variation detection.
Step4: Finally, LRR and BAF were recomputed using the resulting quantile-normalized and GC/CpG corrected values,
as described in [13].
13
Reduction in variance of the LRR values after applied above 4 steps is demonstrated in Figure 6 for one cluster group
of Illumina HumanHap610 assays from this project.
Figure 6 Variance of log2 R ratio (LRR) before and after normalization procedure for one cluster group of Illumina
HumanHap610 assays.
The reduction in GC/CpG waves is obvious in Figure 7.1 (sample without any chromosome abnormality) and Figure
7.2 (duplication) by plotting the signal intensity patterns before and after wave adjustment for the two samples
from this project.
Figure 7.1 Pre-normalization (left) and post-normalization (right) for a subject without chromosomal abnormality.
Each dot in the figure represents one SNP. Red dots represent B-allele frequency (BAF, scale on the right side), while
black dots show LRR values (LRR, scale on the left side). Three red bands represent BAF values for AA, AB, BB
genotypes along the entire chromosome. One black band in middle (overlap with red AB BAF band) represents LRR
values along the entire chromosome. There are wavy patterns with peaks and troughs for the LRR values across
entire chromosome 13 for pre-normalized data.
14
Figure 7.2 Pre-normalization (left) and post-normalization (right) for subject with duplication. There are wavy
patterns with peaks and troughs for the LRR values across entire chromosome 17 for pre-normalized data.
4.3. Segmentation Methods
In this project, two open-source packages, Genomic Alteration Detection Analysis (GADA) [14] and BAF
Segmentation [15], have been applied to the same data set for detecting breaking points on each chromosome.
GADA software uses Sparse Bayesian Learning (SBL) segmentation algorithm, and BAF Segmentation software uses
Circular Binary Segmentation (SBC) algorithm [16]. Resulting mosaic events in samples from both methods then
were combined.
There were two large mosaicism studies conducted by two independent research groups, Gene-Environment
Association Studies consortium (GENEVA) and Cancer Genome Research Lab (CGR), the results from both groups
were published at Nature Genetics at May 2012 [8,17]. Two lung cancer study data sets from Environment and
Genetics in Lung Cancer Etiology Study (EAGLO) and Prostate, Lung, Colorectal, Ovarian Cancer Screening Trial
(PLCO) were used by both groups. The GENENA group used CBS algorithm and CGR group used SBL algorithm. The
resulting mosaic events were then compared by both groups. The concordance rate was 75%. There were mosaic
events detected by one group and missed by other group for both groups. To minimize false negative rate, this
project used both segmentation algorithms to detect mosaic chromosomal abnormality and then combined the
results.
Here are main steps implemented for the Genomic Alteration Detection Analysis (GADA) software [14]:
Breakpoints detection is based on SBL (Sparse Bayesian Learning) algorithm.
The method detects segments where B deviation is different from 0. The B deviation is the observed BAF
value that is deviated from the expected BAF value of 0.5 for heterozygous SNPs.
Essential steps:
o Load the quantile normalized and GC/CpG wave corrected LRR and BAF.
o Sparse Bayesian Learning model (SBL) was used to discover the most likely genomic locations and
magnitudes for a CNV segment. The sparseness hyperparameter controls the SBL prior distribution
which is uninformative about the location and amplitude of the CNV breakpoints but imposes a penalty
15
on the number of CNV breakpoints. A higher aAlpha implies that less breakpoints are expected a priori
and results with fewer true CNV detected, yet fewer false positives.
o Backward Elimination (BE) is used to rank the statistical significance of each breakpoint obtained from
SBL and to remove sequentially the least significant breakpoints using two parameters T statistic and
MinSegLen. The T argument is the critical value of BE algorithm for the statistical score tm that
associated to the break point m. The breakpoints with tm lower than T are discarded. The score tm is the
difference between the sample averages of the probes falling on the left and right segment, divided by a
pooled estimation of the standard error. T can be efficiently adjusted to controls for the False Discovery
Rate (FDR). The argument MinSegLen indicates the number of consecutive probes (SNP markers) each
CNV segment must contain that have a BAF-deviation different from 0. As T and MinSegLen increase,
the number of the CNV breakpoints decreases.
Here are main steps implemented for the BAF Segmentation software [15]:
Breakpoints detection is based on CBS (Circular Binary Segmentation) algorithm [16].
The method detects segments where mBAF is different from 0.5 since expected BAF = 0.5 for AB allele in a
diploid genome without CNVs. BAF data is reflected into mBAF along the 0.5 axis by the transformation
mBAF = abs(BAF – 0.5) + 0.5, where abs stands for taking the absolute value.
Essential steps:
o Load the quantile normalized and GC/CpG wave corrected LRR and BAF
o Convert BAF data to mBAF. So homozygous SNPs (AA and BB) are positioned at 1, and heterozygous
SNPs without CNVs are positioned at 0.5.
o The homozygous SNPs are uninformative for determination of the total copy number. Remove
homozygous SNPs from mBAF profile based on a fixed mBAF threshold. SNPs above the threshold are
considered non-informative and removed.
o Triplet filtering is next applied to the mBAF threshold filtered data to future improve the removal. For
each SNP the absolute sum of the difference in mBAF between the investigated SNP and the pre- and
succeeding SNP was calculated and added to the SNPs distance from the 0.5 baseline. For a SNP with
index i: triplet sum[i] = abs(mBAF[i - 1] - mBAF[i]) + abs(mBAF[i + 1] - mBAF[i]) + mBAF[i] - 0.5
o Triplet sums are compared against a threshold. SNPs with triplet sums above the threshold were
considered outliers and removed. The triplet filtering is designed to remove non-informative
homozygous SNPs due to experimental noise, obtain mBAF values lower than the mBAF threshold.
o Applied Circular Binary Segmentation model to mBAF profiles after removal of no-informative
homozygous SNPs to discover the most likely genomic locations and magnitudes for a CNV segment (the
total number of breakpoints) controlled by alpha, the significance level for accepting change-points.
4.4. Mosaic Event Type Calling Method
Each event was assigned a copy-number state based on the median LRR value for the segment:
State = mosaic duplication (gain) if median(LRR) > 0.2s LRR
State = mosaic deletion (loss) if median(LRR) < 0.2s LRR
State = mosaic copy neutral LOH otherwise
Where s LRR is the standard deviation of the segment LRR values.
16
After application of each segmentation method to same data set, the output file contains start and end of detected
segmented region, chromosome, median LRR, and standard deviation of the LRR within the segmentation region.
For each sample, the adjacent events were merged if the event types were identical and distance between segments
was less than 1 Mbp. After merging, a minimum event size of length < 0.5 Mbps was excluded, as the false-positive
rate increased rapidly for events of smaller size. Most of the false-positives were due to noisy data (high LRR and
BAF variance) and non-mosaic CNVs were detected as being potentially mosaic.
4.5. Proportion of Mosaicism
For each segment that was identified by SBL/CBS, a Gaussian mixture model was fit to the normalized BAF values of
each segment with 2-4 Gaussian components and the best fitting model was chosen using the Akaike information
criterion (AIC). 2-4 components represent 2-4 possible BAF bands. A two component model (2 BAF bands,
represents AA and BB or A and B) will best fit for segments that have complete loss of heterozygosity or copy-
neutral or loss with mosaic proportions of nearly 100%. Three component models (3 BAF bands for AA, AB, and BB)
should be the best fit for segments that are normal or with very low mosaic proportions. For segments where two
or three component models are chosen, mosaic proportions are assigned manually when there was sufficient
evidence of existing of mosaicism after manually reviewing the combined LRR and BAF plot.
Segments where the four component model was the best fit (4 BAF bands: AA/A, BB/B, AB/A, and AB/B for mosaic
deletion; AA/AA, BB/BB, AB/AA, and AB/BB for mosaic CNLOH; AA/AAA, BB/BBB, AB/AAB, and AB/ABB for mosaic
duplication, see last three rows at Table 4) were assigned mosaic proportions based on the inferred state and
location of the estimated heterozygote BAF bands (mu1, mu2). The mu1 and mu2 are mean of the BAF values across
the segment for each of the two heterozygote BAF bands.
The mosaic proportions were calculated based on the inferred mosaic state and location of the estimated
heterozygote BAF (mu1, mu2) with formulas similar to [18]:
D = mu1 - mu2
Proportion of cells with a deletion = 2D / (1 + D)
Proportion of cells with a duplication = 2D / (1 - D)
Proportion of cells with copy number neutral loss of heterozygosity = D
4.6. Example of Mosaic duplication, Deletion, and CNLOH
Figure 8a is LRR and BAF plot for a normal sample. Figure 8b-g are LRR and BAF plots of six representative mosaic
chromosomal abnormality examples of different types of mosaic rearrangements selected from this project. The
plots show the signal intensity Log R ratio (LRR) (black dots, scale on the left side) and B allele frequency (BAF) (red
dots, scale on the right side) values along the entire chromosome carrying the rearrangements in selected samples.
17
Figure 8a Example of one subject with normal copy for chromosome 13. Each dot in the figure represents one SNP.
Red dots represent B allele frequency (BAF, scale on the right side), while black dots show Log R ratio values (LRR,
scale on the left side). Three red bands represent BAF values for AA (bottom red band), AB (middle red band), BB
(top red band) genotypes across entire chromosome 13. One black band in middle (overlap with red AB BAF band)
represents LRR values (around 0) along the entire chromosome 13.
Figure 8b Interstitial mosaic duplication at p arm of chromosome 16 characterized by increased Log R ratio (mean of
LRR within segment (blue line) > 0) and abnormal heterozygous BAF. The vertical gray lines indicate the
breakpoint(s) of the event segment. A non-mosaic trisomy would have a wider BAF split as 1/3 (AAB) and 2/3 (ABB)
and a larger elevation of LRR.
18
Figure 8c Mosaic duplication for entire chromosome 8. It is characterized by increased Log R ratio (mean of LRR
within segment > 0) and abnormal heterozygous BAF. The degree of mosaicism in figure 8c is less than in figure 8b
because it has a narrow split in the intermediate heterozygous BAF bands along with a smaller increase in LRR.
Figure 8d Mosaic copy neutral loss of heterozygosity (CNLON) for entire q arm of chromosome 1. It is characterized
by unchanged Log R ratio (mean of LRR within segment close to 0) and abnormal heterozygous BAF. The p arm is in
normal state. A non-mosaic CNLOH would have only two BAF bands (AA and BB) and LRR close to 0.
19
Figure 8e Mosaic copy neutral loss of heterozygosity (CNLON) for entire chromosome 14. It is characterized by
unchanged Log R ratio (mean of LRR within segment close to 0) and abnormal heterozygous BAF. The degree of
mosaicism in figure 8e is greater than in figure 8d because it has wider split in the intermediate BAF bands.
Figure 8f Two small interstitial mosaic heterozygous deletions at p arm of chromosome 2. It is characterized by
decreased Log R ratio (mean of LRR within segment < 0) and abnormal heterozygous BAF. A non-mosaic
heterozygous deletion would have no intermediate BAF bands and a larger decrease in LRR.
20
Figure 8g Large mosaic heterozygous deletions at q arm of chromosome 9. It is characterized by decreased Log R
ratio (mean of LRR within segment < 0) and abnormal heterozygous BAF. The mosaic deletion in 8g has a less
proportion of cells containing the deletion than the one in 8f because it has narrow split in the intermediate BAF
bands along with a smaller decrease in LRR.
5. Data Analysis
Analysis was started by loading sample intensity files (two files per sample, for red and green channels) into
Illumina’s GenomeStudio software. The intensity data were normalized using Illumina’s five-step self-normalization
procedures (see description at section 3. Illumina SNP Genotyping and Normalization) which drew on information
contained in the array itself to convert raw X and Y (allele A and allele B) signal intensities to normalized values.
Data on called genotype, genotype calls quality score, raw and normalized genotype probe intensities, LRR, and BAF
for each assay were exported from GenomeStudio software using its “Genotype Final Report” (GFR) format. Using
GFR file as input, the array dataset in the GFR file was converted into a high-performance binary file format (GDAT)
using the GLU software package (http://code.google.com/p/glu-genetics/ ) that was developed at Cancer Genomics
Research Laboratory (CGR).
A GC/CpG model file (GCM file) was generated using a copy of the reference genome UCSC hg18 and Illumina binary
manifest file “Human610-Quadv1_B.bpm”. Within GDAT, a four-step custom software pipeline (see description at
section 4.2. Re-estimate Log R Ratio and B Allele Frequency) was implemented. The information in GCM file was
used for GC/CPG correction. The LRR and BAF were re-estimated on the quantile-normalized and GC/CpG corrected
values and written directly into the GDAT file as a new data table. All of these procedures were implemented using
GLU software package. The renormalized LRR and BAF values from qualifying assay (completion rate >= 90%) were
then analyzed using two custom software pipelines that involved GADA and BAF Segmentation packages to detect
21
whole-chromosome and large segmental events greater than 0.5Mb in size to minimize the false discovery (see
description at section 4.3. Segmentation Methods).
We applied the GADA method with the following setting for the related parameters: SBL sparseness hyperparameter
to discover the total number of breakpoints: aAlpha = 0.85; the critical value of backward elimination algorithm for
the statistical score associated to a break point: T statistic = 10; the minimum number of SNPs each CNV segment
must contain: MinSegLen = 200. We applied the BAF Segmentation method with the following setting for the
related parameters: the threshold in mBAF for calling regions of mosaic event based on segmented mBAF values: ai_
threshold = 0.56 (default); the minimal number of SNPs a segmented region should contain to be allowed to be
called as mosaic event: ai_size = 45; the threshold in mBAF for removing putatively non-informative SNPs:
informative_threshold = 0.97 (default); the threshold for thriplet filtering used to improve removal of putatively
non-informative homozygous SNPs: triplet_threshold = 0.8 (default). The significant level for accepting change-
points: alpha = 0.001, using CBS to identify breakpoints of genomic regions.
For each sample, adjacent events were merged if the event types were identical and distance between segments
was less than 1Mbp. After merging, events of length < 0.5 Mbps were excluded. All events were then plotted.
False positive calls due to noisy assay data and non-mosaic copy-number variants and loss of heterozygosity due to
the hemizygous deletion (deletion of one-copy) and events inherited by descent (IBD) and uniparental disomy (UPD)
were also excluded from analysis base on manual review on each plot. These events were excluded because they
are not mosaic events. The segment boundaries were manually corrected for some of the events. Each event
detected was classified as mosaic duplication (gain), mosaic deletion (loss), and mosaic copy-neutral loss of
heterozygosity. Mosaic proportion of abnormal cells was estimated (see descripted at section 4.5. proportion of
mosacism). The magnitude of BAF differences for single-copy duplication events was one-third of the magnitude of
that for copy-neutral LOH or copy-deletion events reducing the sensitivity for calling copy-duplication events. For
mosaic duplication event, only proportion of abnormal cells <= 0.9 will be kept because as the proportion of mosaic
estimation > 0.9, it is difficult to distinguish between mosaic and non-mosaic duplication reliably.
To view the characteristics of mosaic events, mosaic events were plotted by proportion of abnormal cells and LRR
using Microsoft office excel software (Figure 9). Two circular genomic plots by bladder cancer and control for three
tracks of mosaic events for the autosomes 1 to 22 was generated using circos software (http://circos.ca/software/ )
(Figure 10). The frequency of mosaic events by age and cancer status on all subjects and male only plots were
generated using Microsoft office excel software (Figure 12). Logistic regression models were fit using SAS software
package to determine the relationship between individuals having mosaic event(s) and their age at DNA collection,
gender, and cancer diagnosis.
6. Results and Discussion
6.1. Characteristics of Mosaic Events
193 mosaic segments of size greater than 0.5 Mb on autosomal chromosomes in 163 individuals, for an overall
frequency of individuals with mosaicism of 5% were observed. 118 mosaic events were from bladder cancer
individuals (61.14%) and 75 mosaic events were from cancer-free controls (38.86%). Mosaic autosomal
abnormalities were more common in the bladder cancer individuals (98/1673 = 5.86%) compared with cancer-free
22
persons (65/1577 = 4.12%). The most frequent chromosome of event observed was chromosome 17 for bladder
cancer individuals (6.74%) and chromosome 2 and 4 for control individuals (4.15%). Combining case and control
together, the most frequent chromosome of event observed was chromosome 2, 10, and 17 (8.29%) (Table 5),
which may imply instability for these three chromosomes. The most frequent type of event observed was mosaic
duplication (55.96%), whereas mosaic deletion and mosaic CNLOH constituted 12.44% and 31.61 % of mosaic
events, respectively (Table 6). The segment size for CNLOH was the largest and mosaic duplication was the smallest.
Median lengths were 0.82 Mb for mosaic duplications, 2.32 Mb for mosaic deletions, and 21.05 Mb for mosaic
CNLOHs. The abnormal cell proportions are between 20.88% and 89.86% for mosaic duplication; 25.45% and
96.64% for mosaic deletion; 3.84% and 95% for mosaic CNLOH (Figure 9).
Table 5 Frequency of Mosaic Chromosomal Events by Chromosome and Case-Control status.
Mosaic Chromosome Count Mosaic Chromosome Frequency (%)
Chromosome Bladder Cancer Control Total Bladder Cancer Control Total
1 5 4 9 2.59 2.07 4.66
2 8 8 16 4.15 4.15 8.29
3 3 2 5 1.55 1.04 2.59
4 4 4 8 2.07 2.07 4.15
5 3 2 5 1.55 1.04 2.59
6 1 8 9 0.52 4.15 4.66
7 7 1 8 3.63 0.52 4.15
8 7 4 11 3.63 2.07 5.70
9 7 3 10 3.63 1.55 5.18
10 10 6 16 5.18 3.11 8.29
11 2 2 4 1.04 1.04 2.07
12 4 3 7 2.07 1.55 3.63
13 8 4 12 4.15 2.07 6.22
14 6 3 9 3.11 1.55 4.66
15 5 4 9 2.59 2.07 4.66
16 9 3 12 4.66 1.55 6.22
17 13 3 16 6.74 1.55 8.29
18 3 3 6 1.55 1.55 3.11
19 3 1 4 1.55 0.52 2.07
20 3 1 4 1.55 0.52 2.07
21 5 2 7 2.59 1.04 3.63
22 2 4 6 1.04 2.07 3.11
Total 118 75 193 61.14 38.86 100
Table 6 Frequency of Mosaic Chromosomal Events by Event Type and Location.
Mosaic chromosome count Mosaic chromosome frequency (%)
Event location Gain Loss CN LOH Total Gain Loss CN LOH Total
Chromosome 4 0 5 9 2.07 0.00 2.59 4.66
Telomeric p 7 1 10 18 3.63 0.52 5.18 9.33
Telomeric q 10 2 17 29 5.18 1.04 8.81 15.03
Interstitial 87 21 29 137 45.08 10.88 15.03 70.98
Telomeric (p + q) 17 3 27 47 8.81 1.55 13.99 24.35
Total 108 24 61 193 55.96 12.44 31.61 100
23
Figure 9 Characteristics of mosaic events. Mosaic events plotted by proportion of abnormal cells (P) and LRR for 193
events in 163 individuals. A blue dot represents P and LRR values for a mosaic duplication event. A green dot is for
mosaic CNLOH event. A red dot is for mosaic deletion events.
Of mosaic chromosomal events being detected by GADA and BAF methods, 4.66% spanned the entire chromosome,
including 4 whole chromosome mosaic trisomy events on chromosome 8, 12, 18 and 21, with 3 of 4 events were
carried by one subject; 5 whole chromosome mosaic CNLOH events on chromosome 6, 9, 18, and 19 (2 events) that
were carried by 4 different subjects. There was no whole chromosome mosaic deletion event being detected
(Figure 10). We found that 9.33% of mosaic chromosomal events began at a telomere p arm and 15.03% of mosaic
chromosomal events end at a telomere q arm. The most mosaic chromosomal events were interstitial (70.98%),
spanning no telomere. The majority of telomeric events (p + q) were mosaic copy-neutral LOH (27 / 47 = 57.45%)
followed by mosaic duplication (17 / 47 = 36.17%). The majority of interstitial events were mosaic duplication (87 /
137 = 63.5%) followed by mosaic CNLOH (29 / 137 = 21%) (Table 6).
There are 16 individuals (9 bladder cases and 7 controls) having mosaic events on at least two chromosomes.
Among control individuals, the greatest number of mosaic chromosomal events observed for a single subject was 4
from ATBC cohort study and located at whole chromosome 12, 18, 19 and entire q arm of chromosome 21 and all of
them are mosaic duplication (Figure 11). Among bladder cancer individuals, the greatest number of mosaic
chromosomal events observed for a single subject was 11 from PLCO cohort study and located at 9 different
chromosomes, including 2 events on chromosome 12 and 13, and all of them are mosaic copy-neutral LOH with
very higher degree of mosaicism.
24
Figure 10 Circular plots display genomic location of mosaic events. Outer rings are the autosomes 1 to 22. Yellow
track for events of mosaic copy-neutral LOH; blue track for mosaic duplication events; red track for mosaic deletion
events. Left plot are events detected from bladder cancer subjects. Right plot are events detected from cancer-free
controls.
Figure 11 Mosaic duplication across entire of chromosome 12, 18, 19 and entire q arm of chromosome 21 for a
control individual. It is characterized by increased Log R ratio (mean of LRR within segment > 0) and abnormal
heterozygous BAF.
25
6.2. Mosaic Chromosomal Abnormalities and Age at DNA
The effect of increased age on the frequency of mosaic events across three cohort studies that predominantly
included individuals over the age of 60 has been examined. 27 individuals have missing ages. For remaining 3212
subjects, the frequency of control individuals with mosaic events increased with age from 3.21% for those under 60
to 4.53% for those between the ages of 66-70 to 5.96% for those between the ages of 76 and 89 years (P = 0.11).
The frequency of bladder cancer individuals with mosaic events was almost constant with age from 6.95% for those
under 60 to 7.24% for those between the ages of 66-70, then to 6.14% for those between the ages of 76 and 89 (P =
0.32). The frequency of mosaic events was higher in bladder cancer individuals than control individuals in first four
age groups. However, the frequency of mosaic events was very similar for bladder cancer individuals and control
individuals for those between the ages of 76-89 (Figure 12 Top).
The frequency of male control individuals with mosaic events increased with age from 3.39% for those under 60 to
4.41% for those between the ages of 66-70 to 7.51% for those between the ages of 76 and 89 years (P = 0.035). The
frequency of male bladder cancer individuals with mosaic events was almost constant with age from 7.75% for those
under 60 to 6.81% for those between the ages of 66-70, then to 6.81% for those between the ages of 76 and 89 (P =
0.26). The frequency of mosaic events was higher in bladder cancer male individuals than male control individuals
in first four age groups. However, the frequency of mosaic events was low for male bladder cancer individuals than
male control individuals for those between the ages of 76 and 89 (Figure 12 Bottom).
For female, there were no mosaic events for those under age 60 and very few events for other four age categories
and therefore it cannot provide reliable summary for each age category (Table 7).
Figure 12 Frequency of mosaic events by age and cancer status for all individuals (Top) and male only (Bottom).
26
Table 7 Mosaic Event Counts by Five Age Categories and Cancer Status.
Male + Female Male Female
AGE
Bladder Cancer
Control Bladder Cancer
Control Bladder Cancer
Control
Yes No Yes No Yes No Yes No Yes No Yes No
<=60 21 302 8 249 21 271 8 236 0 31 0 13
61-65 17 310 10 322 14 263 7 283 3 47 3 39
61-70 27 373 16 353 23 321 13 295 4 52 3 58
71-75 19 341 17 336 17 283 12 253 2 58 5 83
76-89 14 228 14 235 13 191 13 173 1 37 1 62
Total 98 1554 65 1495 88 1329 53 1240 10 225 12 255
6.3. Mosaic Chromosomal Abnormalities and Gender
The effect of gender on the frequency of mosaic events across three cohort studies by bladder cancer and control
has also been examined. For bladder cancer, the mosaic events were more frequent in males than females with
male = 6.13% and female = 4.22%. For control, the frequency of mosaic events was almost the same for male and
female with male = 4.08% and female = 4.48%. Logistic regression models were fit to the data with mosaic status
on gender, adjusting for age for (1) control and (2) case. For control, the OR = 0.994 with P-value = 0.986. For case,
the OR = 1.470 with P-value = 0.260. Logistic regression model was also fit to all subjects with mosaic status on
gender, adjusting for age and cancer status, and got OR = 1.173 with P-value = 0.498. So we did not observe any
significant gender effect for all three models (Table 8).
Table 8 Frequency of Mosaic Events by Gender and Cancer Status.
Mosaic event frequency (%) Adjusted Logistic Model
Male Female OR 95% P-value
Bladder Cancer 6.13 4.22 1.470 (0.752-2.872) 0.260
Control 4.08 4.48 0.994 (0.519-1.904) 0.986
Overall 5.16 4.36 1.173 (0.739-1.862) 0.498
6.4. Mosaic Chromosomal Abnormalities and Cancer Risk
6.4.1. Mosaic Chromosomal Abnormalities and Cancer Risk for All Subjects
To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer, various logistic
regression models were fit to the data with mosaic status as the response variable and (1) cancer status; (2) cancer
status + gender; (3) cancer status + age; (4) cancer status + age + gender as predictors for a test of partial
independence of cancer status and mosaic status, controlling for gender or/and age. For each model, cancer status
= 1 if bladder cancer and = 0 if control; gender = 1 if male and = 0 if female; and age as continuous variable.
There are modest evidence of positive relationship between mosaic event and bladder cancer (OR=1.44, 95% CI =
1.04–1.98; P = 0.027) for model (1); OR=1.43, 95% CI = 1.04–1.97; P = 0.029) for model (2); OR=1.45, 95% CI = 1.05–
27
2.00; P = 0.025) for model (3); OR=1.44, 95% CI = 0.99–2.49; P = 0.026) for model (4). All four models show very
similar P-values for cancer status. There is no significant evidence of gender effect and age effect for the models
having gender/age as predictor(s) (Table 9.1).
All four main effect models fit the data adequately. The P-value = 0.298 from Pearson’s goodness-of fit test for
model (2) and P-vales = 0.808 and 0.575 from Hosmer-Lemeshow goodness-of-fit test for model (3) and (4). Age is a
continuous variable, which caused models (3) and (4) have very large number of unique profiles (Table 9.2). In this
situation, the data is too sparse to use Pearson and deviance goodness-of-fit test and Hosmer-Lemeshow goodness-
of-fit test is more suitable for the situation where there are a large number of settings of the predictors. All four
models have very similar AIC. The difference of -2 Log L between model (1) and any of other 3 models < 3.84 = χ2 1,
0.05. None of the more complex models significantly improves upon the simplest model, the model (1). However,
we are interested not only whether there is evidence of cancer status effect but also age effect and gender effect on
mosaic status, so model (4) (Table 9.3) gives us more meaningful and interpretable results to answer our
scientifically as well as statistically important questions.
To test for equality of odds ratios between cancer status and mosaic status for various ages, we add the interaction
term with cancer_status + age + gender + cancer_status*age as predictors. We have almost significant evidence to
indicate that the odds ratios for bladder cancer and mosaic event are differ among various ages (P = 0.054). So it is
reasonable to assume that the log(OR) between the cancer status levels at a given age x close to a linear function of
x. This model fits the data very well and has smallest AIC value among all tested models (Table 9.4).
We also examined weather DNA source effect and study effect were significant predictors of mosaic status using
univariate analysis for each nominal variable. It is notable that DNA source (58.6% from blood, 41.4% from buccal)
was not a significant predictor (P-value = 0.578). The study (ATBS, PLCO, CPS) was also not a significant predictor (P-
value = 0.377).
Table 9.1 Summary Logistic Regression Models on All Subjects.
Model STATUS GENDER AGE STATUS*AGE
Setting: presence of mosaic event = 1; OR
95% Wald CI
Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq case vs control; male vs female
mosaic_event=status+error 1.44 (1.04-1.98) 0.027
mosaic_event=status+gender+error 1.43 (1.04-1.97) 0.029 0.503
mosaic_event=status+age+error 1.45 (1.05-2.00) 0.025 0.931
mosaic_event=status+gender+age+error 1.44 (1.05-1.99) 0.026 0.498 0.98
mosaic_event=status+age+gender+status*age+error 0.031 0.408 0.095 0.054
Table 9.2 Summary Goodness-of-Fit Results for Logistic Regression Models on All Subjects.
Model -2 Log L AIC Pearson GOF HL GOF
Setting: presence of mosaic event = 1; case vs control; male vs female
Pr > ChiSq Pr > ChiSq
mosaic_event=status+error 1287.16 1291.19 . .
mosaic_event=status+gender+error 1286.72 1292.72 0.298
mosaic_event=status+age+error 1284.12 1290.12 0.808
mosaic_event=status+gender+age+error 1283.64 1291.64 0.575
mosaic_event=status+age+gender+status*age+error 1279.96 1289.96 0.958
28
Table 9.3 SAS Output From Logistic Regression Main Effect Model on All Subjects.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 -3.2544 0.6667 23.8264 <.0001
AGE_DNA 1 -0.00022 0.00891 0.0006 0.9802
SEX_Code 1 0.1598 0.2357 0.4597 0.4978
STATUS_Code 1 0.367 0.1649 4.957 0.026
Table 9.4 SAS Output From Logistic Regression Model With Interaction Term on All Subjects.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Wald
Pr > ChiSq Error Chi-Square
Intercept 1 -5.3793 1.299 17.1479 <.0001
STATUS_Code 1 3.0518 1.4109 4.6788 0.0305
AGE_DNA 1 0.0303 0.0181 2.7921 0.0947
SEX_Code 1 0.1957 0.2366 0.6844 0.4081
STATUS_Code*AGE_DNA 1 -0.0395 0.0205 3.7226 0.0537
To reduce the number of predictors when age was in the model, we divided age into five age_groups: <=60, 61-65,
66-70, 71-75, and >=76. We did logistic regression of mosaic status on cancer status, gender, and age_group for
testing of partial independence of mosaic status and cancer status controlling for gender and 5 age groups. Again
there is modest evidence of positive correlation between bladder cancer and mosaic status with P-value = 0.025,
which is smaller than the P-value if age is continuous variable, but not by much (P-value = 0.026). There was no any
significant difference between any of last four age groups vs. first age group (Table 9.5).
Table 9.5 SAS Output From Logistic Regression Model with Age Split to Five Groups on All Subjects.
Type 3 Analysis of Effects
Effect DF Wald Pr > ChiSq
Chi-Square
STATUS_Code 1 5.0302 0.0249
SEX_Code 1 0.55 0.4583
AGE_GROUP 4 2.108 0.7159
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 -3.3269 0.3066 117.7724 <.0001
STATUS_Code 1 0.3686 0.1643 5.0302 0.0249
SEX_Code 1 0.1753 0.2364 0.55 0.4583
AGE_GROUP 61--65 1 -0.1693 0.2744 0.3808 0.5372
AGE_GROUP 66--70 1 0.1418 0.2476 0.3282 0.5667
AGE_GROUP 71--75 1 0.0507 0.2576 0.0388 0.8439
AGE_GROUP >=76 1 0.182 0.2739 0.4416 0.5063
29
Odds Ratio Estimates
Effect Point Estimate 95% Wald Confidence Limits
STATUS_Code 1.446 1.048 1.995
SEX_Code 1.192 0.75 1.894
AGE_GROUP 61--65 vs <=60 0.844 0.493 1.446
AGE_GROUP 66--70 vs <=60 1.152 0.709 1.872
AGE_GROUP 71--75 vs <=60 1.052 0.635 1.743
AGE_GROUP >=76 vs <=60 1.2 0.701 2.052
6.4.2. Mosaic Chromosomal Abnormalities and Cancer Risk for Men Only
Bladder cancer is the fourth most common cancer diagnosed in men. Men are about 3 to 4 times more likely to get
bladder cancer during their lifetime than women. Overall, the chance men will develop this cancer during their life
is about 1 in 26. For women, the chance is about 1 in 90 [19].
To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer for male only, two
logistic regression models were fit to the data with mosaic status as the response variable and (1) cancer status; (2)
cancer status + age as predictors for a test of partial independence of mosaic status and cancer status, controlling
for age. For each model, cancer status = 1 if bladder cancer and = 0 if control and age as continuous variable.
There are modest evidence of positive relationship between mosaic event and bladder cancer risk for male
individuals (OR=1.53, 95% CI = 1.08–2.17; P = 0.017) for model (1) and (OR=1.55, 95% CI = 1.09–2.20; P = 0.014) for
model (2) (Table 10.1). There is no significant evidence of age effect for the model (2) (Table 10.2).
To test for equality of odds ratios between mosaic event and bladder cancer for the various ages for male
individuals, we add the interaction term with cancer status + age + cancer status*age as predictors. We have
significant evidence of unequal odds ratios between mosaic event and bladder cancer for the various ages (P =
0.017). This model fits the data very well and has smallest AIC value among all tested models (Table 10.3).
Table 10.1 Summary Logistic Regression Models On Male Subjects.
Model STATUS AGE STATUS*AGE HL GOF
Setting: presence of mosaic event = 1; case vs control
OR 95% Wald
CI Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq
Mosaic_event=status+error 1.53 (1.08-2.17) 0.017 .
Mosaic_event=status+age+error 1.55 (1.09-2.20) 0.014 0.998 0.547
Mosaic_event=status+age+status*age+error 0.008 0.035 0.017 0.906
Table 10.2 SAS Output From Logistic Regression Main Effect Model on Male Subjects.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 -3.1538 0.6499 23.5469 <.0001
STATUS_Code 1 0.4378 0.1788 5.992 0.0144
AGE_DNA 1 0.000018 0.0094 0 0.9985
30
Table 10.3 SAS Output From Logistic Regression Model With Interaction Term on Male Subjects.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Wald
Pr > ChiSq Error Chi-Square
Intercept 1 -6.0032 1.3769 19.008 <.0001
STATUS_Code 1 4.0163 1.5229 6.9552 0.0084
AGE_DNA 1 0.0416 0.0197 4.453 0.0348
STATUS_Code*AGE_DNA 1 -0.0527 0.0221 5.7159 0.0168
6.4.3. Mosaic Chromosomal Abnormalities and Cancer Risk for Female Only
To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer for female only, two
logistic regression models were fit to the data with the mosaic event as the response variable and (1) cancer status;
(2) cancer status + age as predictors for a test of partial independence of mosaic event and cancer status, controlling
for age. For each model, cancer status = 1 if bladder cancer and = 0 if control and age as continuous variable.
There are no significant evidence of cancer risk for mosaic female individuals (OR = 0.94, 95% CI = 0.40–2.22; P =
0.887) for model (1) and (OR = 0.93, 95% CI = 0.38–2.23; P = 0.865) for model (2) (Table 11.1). There is no
significant evidence of age effect for the model (2) (Table 11.2).
To test for equality of odds ratios between mosaic event and cancer status for the various ages for female
individuals, we add the interaction term with cancer status + age + cancer status*age as predictors. There is no
significant evidence of unequal odds ratios between mosaic event and bladder cancer for the various ages (P = 0.27)
(Table 11.2).
Table 11.1 Summary Logistic Models On Female Subjects.
Model STATUS AGE STATUS*AGE HL GOF
Setting: presence of mosaic event = 1; case vs. control
OR 95% Wald
CI Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq
mosaic_event=status+error 0.94 (0.40-2.22) 0.887
mosaic_event=status+age+error 0.93 (0.38-2.23) 0.865 0.839 0.318
mosaic_event=status+age+status*age+error 0.264 0.309 0.27 0.064
Table 11.2 SAS Output From Logistic Regression Main Effect Model on Female Subjects.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 -2.6583 1.9801 1.8023 0.1794
STATUS_Code 1 -0.0766 0.449 0.0291 0.8645
AGE_DNA 1 -0.0056 0.0277 0.0412 0.8391
31
Table 11.3 SAS Output From Logistic Regression Model with Interaction Term on Female Subjects.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 0.319 3.2881 0.0094 0.9227
STATUS_Code 1 -4.8062 4.3008 1.2489 0.2638
AGE_DNA 1 -0.0482 0.0473 1.0368 0.3086
STATUS_Code*AGE_DNA 1 0.0683 0.062 1.2167 0.27
7. Conclusion and Further Studies
In this project, DNA of 3,239 individuals with 1,673 bladder cancer cases and 1,566 cancer-free controls have been
used to investigate for evidence of mosaicism of the autosomes using Illumina’s genome-wide SNP array data
generated from bladder cancer genome wide association analysis. 193 mosaic duplication, mosaic deletion, and
mosaic copy-neutral loss of heterozygosity (CNLOH) events with size > 0.5 Mb in autosomes of 163 study subjects
(5%), with abnormal cell proportions of between 3.84% and 96.64%, have been observed. Mosaic autosomal
abnormalities were statistically significantly positively associated with bladder cancer for male (OR = 1.55; P = 0.014)
but not for female. The frequency of mosaicism increased with age for male control subjects, ranging from 3.39% in
individuals under age 60 to 7.51% between 76 and 89 years old (P = 0.035). Mosaic autosomal abnormalities were
more common in the bladder cancer individuals (5.86%) compared with cancer-free persons (4.12%). The mosaic
events were more frequent in males than females with male = 6.13% and female = 4.22% for bladder cancer
individuals but similar for cancer-free persons with male = 4.08% and female = 4.48%. The most frequent class of
autosomal abnormality detected was mosaic duplication, representing 55.96% of mosaic events. The most frequent
autosomal of mosaic event observed was chromosome 17 for bladder cancer individuals (6.74%) and chromosome 2
and 4 for control individuals (4.15%). Combining case and control together, the most frequent chromosome of
mosaic event observed was chromosome 2, 10, and 17 (8.29%).
Mosaicism in older cancer-free male individuals suggests that age-related genomic instability could be due to
increased rates of somatic mutation or diminished capacity for genomic maintenance, such as with telomere
attrition, leading to proliferation of somatically altered cell populations [20].
This project can be investigated further in several areas if we can get additional information in the future. For all
subjects with mosaic events, it will be very interesting to assess the characters and behavior of mosaic events over
time and to determine these individuals number of mosaic events or the proportion of observed mosaicism change
over time. In this project, we calculated proportion of mosacism for each mosaic events but did not go further. If
we have additional data points collected with respect to specific time of diagnosis in case, we can investigate the
hypothesis of positive association between proportion of mosaicism and severity of the cancer (early to later stage).
Just very recently, we accidently identified one subject participated two studies (Non-Hodgkin’s Lymphoma (NHL)
and Ovarian cancer). We had her genotyped data from two DNA samples that were drawn at age 59 (NHL) and 63
years (NHL + Ovrian cancer). We did mosaic chromosomal abnormality analysis on both samples and found mosaic
events at multiple chromosomes. At age 59, there were mosaic events observed on chromosome 3, 8, 10, 13, 20,
and X (Figure 13 left). At age 63, there were mosaic events observed on chromosome 3, 4, 8, 9, 13, 20 and X (Figure
32
13 right). All of the mosaic events at 59 were observed at age 63 and with increased proportion of mosaicism
except the mosaic event on chromosome 10 at age 59 but unobserved at age 63. The disappearance of this event
may be the result of the cancer treatment. There were two new events detected at chromosome 4 and 9 at age 63,
which may be due to later stage NHL cancer or ovarian specific cancer.
33
Figure 13 Mosaic events observed on multiple chromosomes for a female subject having Non-Hodgkin’s Lymphoma
and Ovarian cancer. Left figures are for DNA samples drew at age 59 (NHL). The right figures are for DNA samples
drew at 63 years (NHL + Ovarian cancer).
34
We can also possible to determine observed mosaic events’ development origin (germline or somatic cells) if we
have blood, tumor tissue, and normal tissue data available. A germline mutation is one that was passed on to
offspring. An example of gemline mutations linked to cancer is the ones that occur in cancer susceptibility genes,
increasing a person's risk for the disease. Somatic mutations are not passed on to the next generation. By
distinguishing the origin of the mutation, we may able to discover cancer susceptibility genes such as well-known
BRCA1 and BRCA2 genes for breast cancer. We recently did mosaic analysis on several TCGA (The Cancer Genome
Atlas, http://cancergenome.nih.gov/) SNP data sets. Figure 14 shows possible gemline deletion at chr7: 54500000 –
55200000 for subject 1087 who had glioblastoma multiforme (GBM) with both blood and tumor samples taken at
the same time. Same mutation existed at both blood and tumor samples which means this is not tumor specific
mutation. We also observed some mutations existed at blood sample not tumor sample, which implies the
mutation in blood is somatic mutation instead of germline mutation (Figure 15).
Figure 14 Deletion at chromosome 7 (54500000 – 55200000) detected from blood sample (Top) and primary solid
tumor sample (bottom) for one GBM subject from TCGA data. The deletion in the blood sample is very likely a
gemline mutation because same event was existed at both DNA sources.
35
Figure 15 Mosaic CNLOH at whole chromosome 3 detected from blood sample (Top) for ovarian cancer subject 1877
from TCGA data. Bottom figure is mutation detected from primary solid tumor sample of same subject. The
mutation in blood sample is somatic mutation because of very different mutation types between two DNA sources.
36
8. Reference 1. Youssoufian, H. & Pyeritz, R.E. Mechanisms and consequences of somatic mosaicism in humans. Nat. Rev. Genet. 3, 748–758 (2002). 2. Notini, A.J., Craig, J.M. & White, S.J. Copy number variation and mosaicism. Cytogenet. Genome Res. 123, 270–277 (2008). 3. Menten, B. et al. Emerging patterns of cryptic chromosomal imbalance in patients with idiopathic mental retardation and multiple congenital anomalies: a new series of 140 patients and review of published reports. J. Med. Genet. 43, 625–633 (2006). 4. Lu, X.Y. et al. Genomic imbalances in neonates with birth defects: high detection rates by using chromosomal microarray analysis. Pediatrics 122, 1310–1318 (2008). 5. Conlin, L.K. et al. Mechanisms of mosaicism, chimerism and uniparental disomy identified by single nucleotide polymorphism array analysis. Hum. Mol. Genet. 19, 1263–1275 (2010). 6. Vorsanova, S. et al. Evidence for High Frequency of Chromosomal Mosaicism in Spontaneous Abortions Revealed by Interphase FISH Analysis. J. Histochem Cytochem. 53, 375-380 (2005). 7. Solomon, D.A. et al. Mutational inactivation of STAG2 causes aneuploidy in human cancer. Science 333, 1039–1043 (2011). 8. Jacobs, K.B, Yeager M, Zhou W. et al. Detectable clonal mosaicism and its relationship to aging and cancer. Nat. Genet. 44:651–658 (2012). 9. Kai Yu. Et al. Population Substructure and Control Selection in Genome-Wide Association Studies. PLoS ONE (2008). 10. http://dnatech.genomecenter.ucdavis.edu/documents/illumina_gt_normalization.pdf 11. Staaf, J. et al. Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios. BMC Bioinformatics 9, 409 (2008). 12. Diskin, S.J. et al. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 36, e126 (2008). 13. Peiffer, D.A. et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole- genome genotyping. Genome Res. 16, 1136–1148 (2006). 14. Pique-Regi, R., Caceres, A. & Gonzalez, J.R. R-Gada: a fast and flexible pipeline for copy number analysis in association studies. BMC Bioinformatics 11, 380 (2010). 15. Staaf J, et al. Segmentation-based detection of allelic imbalance and loss-of heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol 9:R136 (2008). 16. Olshen, A.B. et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557−572 (2004). 17. Laurie, C.C. et al. Detectable Detectable clonal mosaicism from birth to old age and its relationship to cancer. Nat. Genet. 44, 642–650 (2012). 18. Rodíguez-Santiago, B. et al. Mosaic uniparental disomies and aneuploidies as large structural variants of the human genome. Am. J. Hum. Genet. 87, 129–138 (2010). 19. http://www.cancer.org/cancer/bladdercancer/detailedguide/bladder-cancer-key-statistics 20. Sahin E, Depinho RA. Linking functional decline of telomeres, mitochondria and stem cells during ageing. Nature 464:520–8 (2010).
37
9. Appendix A: SAS Code
/* The analysis was done on 7 data sets:
(1) All subjects.
(2) Male subjects.
(3) Female subjects.
(4) Bladder cancer subjects.
(5) Control subjects.
(6) Male bladder cancer subject.
(7) Male control subjects. */
#libname Bladder
'T:\DCEG\Home\zhouw\CNV_Analysis\Bladder\GADA+BAF\association';
libname Bladder 'C:\Users\wyzhou22\Desktop\TAMU\STAT685\association';
proc format; value agefmt
1='<=60' 2='61--65' 3='66--70' 4='71--75' 5='>=76';
data Bladder.gada_baf_master_all;
set Bladder.gada_baf_master_1_3_1;
if AGE_DNA=. then AGE_GROUP=.;
else if AGE_DNA <= 60 then AGE_GROUP=1;
else if AGE_DNA <= 65 then AGE_GROUP=2;
else if AGE_DNA <= 70 then AGE_GROUP=3;
else if AGE_DNA <= 75 then AGE_GROUP=4;
else if AGE_DNA >= 76 then AGE_GROUP=5;
format AGE_GROUP agefmt.;
run;
proc print data=Bladder.gada_baf_master_all;
run;
/* (1) All Subjects */
/* SEX_Code: P-value=0.450 */
proc logistic desc data=Bladder.gada_baf_master_all;
model new_MOSAIC_EVENT_Y=SEX_Code /lackfit aggregate scale=none;
title 'Gender Main Effect Model on All Subjects';
run;
/* AGE_DNA P-value=0.745 */
proc logistic desc data=Bladder.gada_baf_master_all;
model new_MOSAIC_EVENT_Y=AGE_DNA /lackfit aggregate scale=none;
title 'Age Main Effect Model on All Subjects';
run;
/* AGE_DNA to 5 groups: overall P-value=0.922 */
proc logistic desc data=Bladder.gada_baf_master_all;
class AGE_GROUP(ref="<=60")/param=ref;
model new_MOSAIC_EVENT_Y=AGE_GROUP/lackfit aggregate scale=none;
title 'Age_group Main Effect Model on All Subjects';
run;
/* DNA_source: P-value=0.578 */
proc logistic desc data=Bladder.gada_baf_master_all;
38
class DNA_source;
model new_MOSAIC_EVENT_Y=DNA_source /lackfit aggregate scale=none;
title 'DNA_source Main Effect Model on All Subjects';
run;
/* STUDY_ID: P-value=0.377 */
proc logistic desc data=Bladder.gada_baf_master_all;
class STUDY_ID;
model new_MOSAIC_EVENT_Y=STUDY_ID /lackfit aggregate scale=none;
title 'Study Cohort Main Effect Model on All Subjects';
run;
/* STATUS_Code: P-value=0.027 */
proc logistic desc data=Bladder.gada_baf_master_all;
model new_MOSAIC_EVENT_Y=STATUS_Code /lackfit aggregate scale=none;
title 'Cancer Status Main Effect Model on All Subjects';
run;
/* STATUS_Code: P-value=0.029; SEX_code: P-value=0.503 */
proc logistic desc data=Bladder.gada_baf_master_all;
model new_MOSAIC_EVENT_Y=STATUS_Code SEX_Code /lackfit aggregate scale=none;
title 'Cancer Status and Gender Main Effects Model on All Subjects';
run;
/* STATUS_Code: P-value=0.025; AGE_DNA: P-value=0.931 */
proc logistic desc data=Bladder.gada_baf_master_all;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA /lackfit aggregate scale=none;
title 'Cancer Status and Age Main Effects Model on All Subjects';
run;
/* STATUS_Code: P-value=0.026; SEX_code: P-value=0.498; AGE_DNA: P-value=0.980
*/
proc logistic desc data=Bladder.gada_baf_master_all;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA SEX_Code /lackfit aggregate
scale=none;
title 'Cancer Status and Gender and Age Main Effects Model on All Subjects';
run;
/* STATUS_Code: P-value=0.036; AGE_DNA: P-value=0.11; STATUS_Code*AGE_DNA: P-
value=0.062 */
proc logistic desc data=Bladder.gada_baf_master_all;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA STATUS_Code*AGE_DNA/lackfit
aggregate scale=none;
title 'Cancer Status and Age and 2-WAY Interaction Model on All Subjects';
run;
/* STATUS_Code*AGE_DNA: P-value = 0.054, best fitted model */
proc logistic desc data=Bladder.gada_baf_master_all;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA SEX_Code
STATUS_Code*AGE_DNA/lackfit aggregate scale=none;
title 'Cancer Status and Age and Gender and 2-WAY Interaction Model on All
Subjects';
39
run;
/* AGE_DNA to 5 groups: STATUS_Code: P-value=0.025; AGE_GROUP: P-value=0.716;
SEX_Code: P-value=0.456 */
proc logistic desc data=Bladder.gada_baf_master_all;
class AGE_GROUP(ref="<=60")/param=ref;
model new_MOSAIC_EVENT_Y=STATUS_Code SEX_code AGE_GROUP/lackfit aggregate
scale=none;
title 'Cancer Status and Gender and Age_group Main Effects Model on All
Subjects';
run;
/* AGE_DNA to 5 groups: STATUS_Code: P-value=0.0658; AGE_GROUP: P-value=0.434;
SEX_Code: P-value=0.392; STATUS_Code* AGE_GROUP: P-value=0.596 */
proc logistic desc data=Bladder.gada_baf_master_all;
class AGE_GROUP(ref="<=60")/param=ref;
model new_MOSAIC_EVENT_Y=STATUS_Code SEX_Code AGE_GROUP
STATUS_Code*AGE_GROUP/lackfit aggregate scale=none;
title 'STATUS and Gender and AGE_GROUP and 2-WAY Interaction Model on All
Subjects';
run;
/* (2) Male Only*/
data Bladder.gada_baf_master_male;
set Bladder.gada_baf_master_all;
if GENDER="MALE";
run;
proc print data=Bladder.gada_baf_master_male;
run;
/* STATUS_Code: P-value=0.017 */
proc logistic des data=Bladder.gada_baf_master_male;
model new_MOSAIC_EVENT_Y=STATUS_Code/lackfit aggregate scale=none;
title 'Cancer Status Main Effect Model on Male Subjects';
run;
/* STATUS_Code: P-value=0.014; AGE_DNA: P-value=0.998 */
proc logistic des data=Bladder.gada_baf_master_male;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA/lackfit aggregate scale=none;
title 'Cancer Status and Age Main Effects Model on Male Subjects';
run;
/* STATUS_Code: P-value=0.084; AGE_DNA: P-value=0.035; STATUS_Code*AGE_DNA: P-
value=0.0168 */
proc logistic des data=Bladder.gada_baf_master_male;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA STATUS_Code*AGE_DNA/lackfit
aggregate scale=none;
title 'Cancer Status and Age and 2-WAY Interaction Model on Male Subjects';
run;
40
/* STATUS_Code: P-value=0.052; AGE_GROUP: P-value=0.179; STATUS_Code*AGE_DNA:
P-value=0.495 */
proc logistic des data=Bladder.gada_baf_master_male;
class AGE_GROUP(ref="<=60")/param=ref;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_GROUP STATUS_Code*AGE_GROUP/lackfit
aggregate scale=none;
title 'Cancer Status and Age_group and 2-WAY Interaction Model on Male
Subjects';
run;
/* (3) Female Only*/
data Bladder.gada_baf_master_female;
set Bladder.gada_baf_master_all;
if GENDER="FEMALE";
run;
proc print data=Bladder.gada_baf_master_female;
run;
/* STATUS_Code: P-value=0.887 */
proc logistic des data=Bladder.gada_baf_master_female;
model new_MOSAIC_EVENT_Y=STATUS_Code/lackfit aggregate scale=none;
title 'Cancer Status Main Effect Model on Female Subjects';
run;
/* STATUS_Code: P-value=0.865; AGE_DNA: P-value=0.839 */
proc logistic des data=Bladder.gada_baf_master_female;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA/lackfit aggregate scale=none;
title 'Cancer Status and Age Main Effects Model on Female Subjects';
run;
/* STATUS_Code: P-value=0.264; AGE_DNA: P-value=0.309; STATUS_Code*AGE_DNA: P-
value=0.27 */
proc logistic des data=Bladder.gada_baf_master_female;
model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA STATUS_Code*AGE_DNA/lackfit
aggregate scale=none;
title 'Age Status and Age and 2-WAY Interaction Model on Female Subjects';
run;
/* (4) Bladder Cancer only */
data Bladder.gada_baf_master_case;
set Bladder.gada_baf_master_all;
if STATUS="CASE";
run;
proc print data=Bladder.gada_baf_master_case;
run;
/* AGE_DNA: P-value=0.32 */
41
proc logistic des data=Bladder.gada_baf_master_case;
model new_MOSAIC_EVENT_Y=AGE_DNA/lackfit aggregate scale=none;
title 'Age Main Effect Model on Cancer Subjects';
run;
/* SEX_Code: P-value=0.260; AGE_DNA: P-value=0.351; SEX_Code: P-value=0.26 */
proc logistic des data=Bladder.gada_baf_master_case;
model new_MOSAIC_EVENT_Y=AGE_DNA SEX_Code/lackfit aggregate scale=none;
title 'Age and Gender Main Effects Model on Cancer Subjects';
run;
/* (5) Control only */
data Bladder.gada_baf_master_control;
set Bladder.gada_baf_master_all;
if STATUS="CONTROL";
run;
proc print data=Bladder.gada_baf_master_control;
run;
/* AGE_DNA: P-value=0.11 */
proc logistic des data=Bladder.gada_baf_master_control;
model new_MOSAIC_EVENT_Y=AGE_DNA/lackfit aggregate scale=none;
title 'Age Main Effect Model on Control Subjects';
run;
/* AGE_DNA: P-value=0.119; SEX_Code: P-value=0.986 (table 7 control) */
proc logistic des data=Bladder.gada_baf_master_control;
model new_MOSAIC_EVENT_Y=AGE_DNA SEX_Code /lackfit aggregate scale=none;
title 'Age and Gender Main Effects Model on Control Subjects';
run;
/* AGE_DNA: P-value=0.309 SEX_Code: P-value=0.0761; AGE_DNA*SEX_Code: P-
value=0.0799 */
proc logistic des data=Bladder.gada_baf_master_control;
model new_MOSAIC_EVENT_Y=AGE_DNA SEX_Code AGE_DNA*SEX_Code/lackfit aggregate
scale=none;
title 'Age and Gender and 2-Way Interaction Model on Control Subjects;
run;
/* (6) Male Bladder Cancer Only */
data Bladder.gada_baf_master_male_case;
set Bladder.gada_baf_master_1_3_1;
if GENDER="MALE" and STATUS="CASE";
run;
proc print data=Bladder.gada_baf_master_male_case;
run;
/* AGE_DNA: P-value=0.26 */
42
proc logistic des data=Bladder.gada_baf_master_male_case;
model new_MOSAIC_EVENT_Y=AGE_DNA/lackfit aggregate scale=none;
title 'Age Main Effect Model on Male Cancer Subjects';
run;
/* (7) Male Control Only */
data Bladder.gada_baf_master_male_control;
set Bladder.gada_baf_master_all;
if STATUS="CONTROL" and GENDER="MALE";
run;
proc print data=Bladder.gada_baf_master_male_control;
run;
/* AGE_DNA: P-value=0.0348 */
proc logistic des data=Bladder.gada_baf_master_control;
model new_MOSAIC_EVENT_Y=AGE_DNA/lackfit aggregate scale=none;
title 'Age Main Effect Model on Male Control Subjects';
run;
43
Appendix B: Normalization Pipeline ## (1) Create GC model for Human610-Quadv1_B.bpm qrun -V -l walltime=24:00:00 glu cnv.make_gcmodel -r $GENOME/hg19.fa \ $ MANIFEST / Human610-Quadv1_B.bpm \ -o $GCMODELS/HumanOmni25M-8v1-1_B.gcm ## (2) Create GDAT file from Illumina’s genome file report file (gfr) qrun -V -l walltime=24:0:0 glu -v convert.from_gfr $MANIFEST/ Human610-Quadv1_B.bpm\ $GFR/ATBC_Hap610_vi_working_FinalReport.txt.bz2 \ -o $GDAT/ATBC_OmniEx_vi_working.gdat & qrun -V -l walltime=24:0:0 glu -v convert.from_gfr $MANIFEST/ Human610-Quadv1_B.bpm\ $GFR/PLCO_Hap610_vi_working_FinalReport.txt.bz2 \ -o $GDAT/PLCO_OmniEx_vi_working.gdat & qrun -V -l walltime=24:0:0 glu -v convert.from_gfr $MANIFEST/ Human610-Quadv1_B.bpm\ $GFR/CPS_Hap610_vi_working_FinalReport.txt.bz2 \ -o $GDAT/CPS_OmniEx_vi_working.gdat & ## (3) Normalization and re-estimate LRR and BAF glu cnv.normalize --gcmodeldir=/CGF/Resources/Data/Infinium/GCmodels \ $GDAT/ATBC_OmniEx_vi_working.gdat & glu cnv.normalize --gcmodeldir=/CGF/Resources/Data/Infinium/GCmodels \ $GDAT/PLCO_OmniEx_vi_working.gdat & glu cnv.normalize --gcmodeldir=/CGF/Resources/Data/Infinium/GCmodels \ $GDAT/CPS_OmniEx_vi_working.gdat &
44
Appendix C: GADA Segmentation Pipeline
## (1) run.sh export GLU_PYTHON_VERSION="2.7" export PYTHONPATH="/home/zhouw/src" BASE=/home/zhouw/CNV_Analysis/Bladder/GADA INCLUDE="--includes=$BASE/include_list/include.lst" EXCLUDE="--excludes=$DEFS/redacted_samples.lst" BATCHES=$BASE/work mkdir -p $BATCHES 2> /dev/null rm -Rf $BATCHES # Generate sample lists use for each batch with size = 20 samples per batch python2.7 setup_batches.py $GDAT/index.db gdat.lst --outdir=$BATCHES --size=20 $EXCLUDE $INCLUDE # Submit batches for i in $BATCHES/*/*; do if [ \! -f $i/segments.txt ]; then echo $i qrun --nowait -l walltime=72:0:0 sh run_batch_GADA.sh $i #sleep 3 fi done ## (2) run.sh call run_batch_GADA.sh to run GADA for each batch #!/bin/bash BPATH=$1 BATCH=`dirname $BPATH` GDAT=$GDAT/`basename $BATCH`.gdat BATCH=`basename $BPATH` BASE=`pwd` cd $BPATH mkdir rawData SBL 2> /dev/null # Generate files can be used for GADA-R glu convert.to_gada $GDAT --includesamples=sample.lst -o rawData/.txt # Call GADA-R package R --vanilla < $BASE/GADAKBJ.R # Clean the data folders rm -Rf rawData SBL
45
## (3) run_batch_GADA.sh call GADAKBJ.R to run GADA segmentation library(gadaKBJ) # Select the working directory (it must contain a folder called rawData with the samples) # in our case ... setwd(".") # Import data data <-setupParGADA.B.deviation(NumCols=6, GenoCol=6, BAFcol=5, log2ratioCol=4, dev.type="all") # Segmentation procedure and calling parSBL(data, estim.sigma2=TRUE, aAlpha=0.85) parBE.B.deviation(data, T=10, MinSegLen=100) # Writes segments to file exportSegments2File(data, file="segments.txt")
46
Appendix D: BAF Segmentation Pipeline
## (1) config.sh # GLU source tree source /home/zhouw/src # Tool location export BSEG=/home/zhouw/BAFsegmentation-1.2.0 # Batch size export SIZE=20 # GDAT files export GDATS=”$GDAT/ATBC_Hap610_v1_working.gdat $GDAT/PLCO_Hap610_v1_working.gdat $GDAT/CPS_Hap610_v1_working.gdat” export INCLUDE=/home/zhouw/CNV_Analysis/Bladder/BAF/include_list/include.lst export EXCLUDE=/home/zhouw/CNV_Analysis/Bladder/BAF/redacted_samples.lst ## (2) run.sh call run_batch_BAF.sh to run BAF Segmentation for each batch source config.sh # Generate the batch sample list files # Combine multiple excludes into one master list file SAMLIST=/home/zhouw/CNV_Analysis/Bladder/BAF/sample_list for GDAT in $GDATS; do DIR=`basename $GDAT | cut -f1 -d.` print $DIR WD=/home/zhouw/CNV_Analysis/Bladder/BAF/$DIR SAMPLE=$WD/SAMPLE mkdir -p $WD/SAMPLE 2>/dev/null # Generate sample lists use for each batch with size = 20 samples per batch python2.7 split.py --directory $WD/SAMPLE --size $SIZE --excludes $EXCLUDE --includes \ $INCLUDE $SAMLIST/$DIR.lst # submit batches for i in $WD/SAMPLE/*; do BATCH=`basename $i` mkdir $WD/$BATCH 2>/dev/null ./run_batch_BAF.sh $WD $BATCH $GDAT & done done
47
## (3) run_batch_BAF.sh call run_BAF.sh to run BAF Segmentation source config.sh export INPUT=$DIR/$BATCH/extracted export PLOTS=$DIR/$BATCH/plots export OUTPUT=$DIR/$BATCH/segmented mkdir $INPUT $PLOTS $OUTPUT 2>/dev/null if [ \! -s $OUTPUT/*.txt ]; then # Generate the per sample input file $RGLU_S convert.to_bseg $GDAT --includesamples $SAMPLE/$BATCH \ -o $INPUT/.txt > $WD/convert.log 2>&1 & wait # Generate sample_name.txt table based on files in extracted folder echo "Assay Filename IGV_index" | tr ' ' "\t" > $INPUT/sample_names.txt index=1 (for j in $INPUT/*.txt; do SAM=`basename $j | cut -f1 -d.` if [ "$SAM" != "sample_names" ]; then echo "$SAM $SAM.txt $index" | tr ' ' '\t' >> $INPUT/sample_names.txt index=`expr $index + 1` fi done) & wait # Submit batch qrun -l walltime=24:0:0 ./run_BAF.sh $INPUT $OUTPUT $PLOTS & fi # Take a quick nap sleep 1 wait ## clean the data folders rm -Rf $INPUT/* & ## (4) run_BAF.sh to run BAF Segmentation source ./config.sh INPUT=$1 OUTPUT=$2 PLOTS=$3 cd $BSEG
48
perl BAF_segment_samples.pl \ --input_directory=$INPUT \ --output_directory=$OUTPUT \ --plot_directory=$PLOTS \ --do_plot=FALSE --ai_size=45 cd -