segmentation-based detection of mosaic chromosomal …€¦ · · 2014-09-031 segmentation-based...

1

Segmentation-Based Detection of Mosaic Chromosomal

Abnormality in Bladder Cancer Cells Using Whole Genome SNP Arrays

Weiyin Zhou

Cancer Genomics Research Laboratory (CGR)

Leidos Biomedical Research, Inc

Texas A&M Masters of Science Degree Candidate – Statistics

Texas A&M Department of Statistics College Station, TX

February 2014

2

Acknowledgements

I would like to express my deepest gratitude to my advisor, Dr. Alan Dabney, for his excellent guidance,

understanding, patience, and encouragement. I would also like to thank my other committee members: Dr. William

B. Smith and Dr. Bruce Lowe for taking time from their busy schedule to review my work; and to thank Ms. Penny

Jackson and Kim Ritchie for their procedural and technical support. My gratitude extends to my current employer,

Leidos Biomedical Research, Inc., which paid the majority of my tuition under its education assistance program.

I would like to thank my husband Dr. Lisheng Cai, my son, Robert, and my daughter, Kimberly, for their love and

support. They always cheer me up during the entire process. Quite naturally, my parents laid the foundation for all

this effort through years of caring and teaching.

3

Abstract

The purpose of this project is to investigate the relationship between mosaic chromosomal abnormality and bladder

cancer. DNA of 3,239 individuals consisted of 1,673 bladder cancer cases and 1,566 cancer-free controls have been

examined for evidence of mosaicism of the autosomes using genome-wide SNP array data generated from bladder

cancer genome wide association analysis. DNA samples were extracted from blood or buccal (mouth) samples and

were genotyped on Illumina Infinium HumanHap 610 quad SNP array. Two segmentation-based methods have been

used to detect three types of mosaic events. 193 mosaic duplication (gain), mosaic deletion (loss), and mosaic copy-

neutral loss of heterozygosity (CNLOH) events, defined as being of > 0.5 Mb in size, in autosomes of 163 individuals

(5%), with abnormal cell proportions of between 3.84% and 96.64%, was observed. Mosaic autosomal

abnormalities were more common in the bladder cancer individuals (5.86%) compared with cancer-free persons

(4.12%). Mosaic chromosomal abnormalities were statistically significantly positively associated with bladder

cancer for male (OR = 1.55; P = 0.014). In cancer-free male individuals, mosaic chromosomal abnormality frequency

increased with age, from 3.39% under 60 years to 7.51% between 76 and 89 years (P = 0.035).

4

Contents

1. Introduction 5 2. Data Description 5 3. Illumina SNP Genotyping and Normalization 8

4. Mosaic Chromosomal Abnormalities Detection Methods 10 4.1. Log R Ratio and B Allele Frequency 10

4.2. Re-estimate Log R Ratio and B Allele Frequency 12

4.3. Segmentation Methods 14

4.4. Mosaic Events Calling Methods 15 4.5. Proportion of Mosaicism 16

4.6. Examples of Mosaic Events 16

5. Data Analysis 20

6. Results and Discussion 21 6.1. Characteristics of Mosaic Events 21 6.2. Mosaic Chromosomal Abnormalities and Age at DNA 25 6.3. Mosaic Chromosomal Abnormalities and Gender 26 6.4. Mosaic Chromosomal Abnormalities and Cancer Risk 26

6.4.1. Mosaic Chromosomal Abnormalities and Cancer Risk for All Subjects 26 6.4.2. Mosaic Chromosomal Abnormalities and Cancer Risk for Male Only 29 6.4.3. Mosaic Chromosomal Abnormalities and Cancer Risk for Female Only 30

7. Conclusions and Further Studies 31 8. References 35 9. Appendix A: SAS Code 36

Appendix B: Normalization Pipeline 42

Appendix C: GADA Segmentation pipeline 43

Appendix D: BAF Segmentation Pipeline 45

5

1. Introduction

Genetic mosaicism is defined as the coexistence of cells with different genetic composition within an individual

caused by postzygotic event during development that can occur in both somatic (affecting only non-reproductive

cells) and germline cells (with the potential of being passed on to any offspring) despite being the product of a single

fertilization. Mosaicism can be caused by DNA mutations, epigenetic alterations of DNA, chromosomal

abnormalities, and the spontaneous reversion of inherited mutations [1,2]. Somatic mosaicism has been established

as a cause of mental retardation, birth defects, spontaneous abortion, and cancer [3-7]. The unequal distribution of

DNA to daughter cells upon mitosis (chromosome instability) may lead to aneuploidy, the duplication or deletion of

chromosomes or segments of chromosomes, and reciprocal duplication and deletion events that appear as copy-

neutral loss of heterozygosity or acquired uniparental disomy. Mosaic chromosomal abnormalities have been

defined as the presence of both normal karyotypes as well as those with large structural genomic events resulting in

alteration of copy number or loss of heterozygosity in distinct and detectable subpopulations of cells [8].

The development of microarray technology has had a significant impact on the genetic analysis of human disease.

The whole-genome single nucleotide polymorphism (SNP) genotyping arrays have become an important tool for

discovering variants that contribute to human diseases and phenotypes. The two most applications of this

technology are genome-wide association studies (GWAS) and copy number variant (CNV) analysis. The SNP array

offers researchers the flexibility to genotype samples with hundreds of thousands to millions of markers that deliver

dense genome-wide coverage with the most up-to-date content to provide maximum coverage of genome for both

association testing and copy number detection. Data from genome-wide association studies have been used for

association between single SNP and disease status. It also provides an opportunity to detect chromosome variation

and to investigate the association of mosaicism with disease status.

In this project, the SNP microarray data generated on Illumina Infinium HumanHap 610 quad SNP array for the

bladder cancer GWAS were subsequently used to uncover mosaic genomic copy number gains, losses, and copy-

neutral loss of heterozygosity in the sutosomes of 5% subjects. Two segmentation-based algorithms have been

used to detect 193 mosaic events of > 0.5 Mb in size. The type of the chromosomal abnormalities detected has

been characterized. The relationship between chromosomal abnormalities and cancer risk, age at NDA correction,

and gender has been investigated. The frequency of mosaic chromosomal abnormalities was positively associated

with bladder cancer for male subject. The frequency was increased with age for cancer-free individuals.

2. Data Sets Description The data used for this project consists of 3,239 individuals in bladder cancer genome-wide association studies

(GWASs) from three cohorts: Beta-Carotene Cancer Prevention Study (ATBC), Prostate, Lung, Colorectal, Ovarian

Cancer Screening Trial (PLCO), and Cancer Prevention Study-II (CPS-II). The summary of the study by case and

control are listed in Table 1. 1,673 cases that had been diagnosed with urothelial cell carcinoma of the bladder and

1,566 controls that were cancer-free. There were 2,734 males and 505 females. The summary of the gender by

case and control are listed in Table 2. The mean age at DNA withdrawal is 67 years for all subjects (range 21–89

years, s.d. = 8.83). The mean age is 67 years for male (range 21-89, s.d. = 8.94) and 69 years for female (range 24-87

s.d. = 7.89). The mean age is 66 years for case (range 21-87, s.d. = 10.2) and 68 years for control (range 54-89 s.d. =

6

6.99). The study was approved by the institutional ethics committees of each participating hospital and the

institutional review board (IRB) of the National Cancer Institute (NCI, USA). Written informed consent was obtained

from all individuals. DNA was extracted from peripheral blood (58.6%) and mouthwash samples (41.4%). Genomic

DNA was screened and analyzed at the National Cancer Institute according to the sample handling process of the

Cancer Genomics Research Laboratory (CGR), Division of Cancer Epidemiology and Genetics (DCEG) before being

genotyped to the HumanHap 610 Quad BeadChip (Illumina, Inc.) via the Infinium Assay.

Overall, 3.3% of samples were performed in duplicate for the reproducibility checking, with SNP genotype calling

concordance rate greater than 99.98% between two technical duplicates. The completion rate, defined as the

proportion of frequency of non-missing genotypes for sample, were calculated by taking the number of called

genotype SNP probes and dividing it by the total number of SNP probes on the array using the GLU qc.summary

module (http://code.google.com/p/glu-genetics/). The overall completion rate for the study samples is 97.87%. The

distribution of the completion rates by sample and by locus is shown in Figure 1. A brief summary of the

sample/locus counts at the 100th, 99th, 95th, 90th and 50th quantiles are provided as insert in the Figure 1.

Ancestry was estimated for the 3,239 study subjects using a set of population informative SNPs [9] and data from

HapMap build 27. These SNPs used are common to the commercially available Affymetrix 500K, Illumina 317K, and

550K chips. Admixture coefficients were estimated for each subject using the GLU struct.admix module, the

HapMap CEU, YRI, ASA (JPT+CHB) samples were used as the fixed reference populations. A total of 3205 subjects

were detected to have European ancestry. The 34 subjects were detected to have less than 80% of European

ancestry, as shown in Figure 2 and are summarized in Table 3.

Table 1 Summary of Study Cohorts by Cancer Status

Cancer Status ATBC PLCO CPSII Total

Bladder Cancer 416 563 694 1673

Control 722 114 730 1566

Total 1138 677 1424 3239

Table 2 Summary of Gender by Cancer Status

Cancer Status Male Female Total

Bladder Cancer 1436 237 1673

Control 1298 268 1566

Total 2734 505 3239

Table 3 Summary of Population Structure by Cancer Status

Study Cohort Cancer Status

Imputed Ancestry

CEU ADMIXED CEU ASA ASA,CEU CEU,YRI YRI Total

ATBC Bladder Cancer 416 416

Control 722 722

PLCO Bladder Cancer 540 1 7 5 7 3 563

Control 110 1 3 114

CPS Bladder Cancer 687 1 3 1 1 1 694

Control 730 730

Total 3205 3 13 6 8 4 3239

http://code.google.com/p/glu-genetics/

7

Figure 1 Bladder completion rate by sample (left) and by locus (right).

Figure 2 Population structure

8

3. Illumina SNP Genotyping and Normalization SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between

members of a species. SNPs are one of the most common types of genetic variation. A SNP is a single base pair

mutation at a specific site in DNA, usually consisting of two alleles that makes up the individual’s genotype. Illumina

DNA Analysis BeadChips using the infinium Assay provide researchers genomic access and analyzing genetic

variation. The infinium is a two color channels assay, with the data consisting of two intensity values (X, Y) for each

SNP. There is one intensity channel for each of two fluorescent dyes associated with the two alleles of the SNP. The

alleles measured by the X channel (Cy5 dye) are called the A allele, whereas the alleles measured by the Y channel

(Cy3 dye) are called the B allele. Each SNP is analyzed independently to identify genotypes. Illumina’s standard

normalization algorithm is implemented as the first step in SNP genotyping data analysis. The intensity data are

normalized using Illumina’s self-normalization algorithm which draws on information contained in the array itself

and to convert raw X and Y (allele A and allele B) signal intensities to normalized values. Normalized values then are

used to analyze standard genotyping calls, Loss of Heterozygosity (LOH), and Copy Number (CN).

In a diploid genome without CNVs, the three possible genotype calls are AA, AB, and BB, respectively. The raw signal

intensity values measured for the A and B alleles are subject to an Illunina’s standard five-step standard

normalization procedure to determine six parameters: offset_X, offset_Y, theta, shear, scale_X, scale_Y. The

normalization algorithm is designed to adjust for nominal intensity variations observed in the two color channels,

background differences between the two channels, possible crosstalk between the dyes, global intensity difference,

and to scale the data [10]. Figure 3 depicts the 5 steps of the normalization process.

Step1: Outlier removal (Figure 3-A): Outlier SNPs are removed from consideration during the normalization

parameter estimation. They will not be excluded from downstream analysis.

Step2: Background estimation (offset_X, offset_Y) (Figure 3-B): Identify candidate homozygote A alleles along X-axis

and candidate homozygote B alleles along Y-axis. Two straight lines are fit into homozygote A and B alleles

respectively. Offset_X and offset_Y parameters are the intercepts from these two lines. The points are corrected

for translation.

Step3: Rotational estimation (theta) (Figure 3-C): Identify a set of control points by X-axis. A straight line is fit into

the control points. The theta parameter is the angle between this line and the X-axis and defines the amount of

rotation in the data. The points are corrected for rotation.

Step4: Shear estimation (shear) (Figure 3-D): Identify a set of control points by Y-Axis. A straight line is fit into the

control points. The shear parameter is the angle between this line and the Y-axis. The points are corrected for

shear.

Step5: Scaling estimation (scale_X, scale_Y) (Figure 3-E): A statistical method used to determine scale_X and scale_Y

parameters.

Figure 3-F is the final set of normalized data points. The points along X-axis represent AA alleles, points along Y-axis

represent BB alleles, and points along 45 degree represent AB alleles.

http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism

http://en.wikipedia.org/wiki/Alleles

9

Illumina then uses these 6 estimated parameter values to convert raw coordinates (X raw and Y raw) to normalized

coordinates (X normalized and Y normalized) for each SNP, representing the experiment-wide normalized signal

intensity on the A and B alleles, respectively.

Figure 3 Five- step normalization procedure. Figure 3-F is the final normalized data points for a particular SNP. The

points on X-axis represent AA alleles. The points along Y axis represent BB alleles. The points along approximate 45

degree are AB alleles.

To visualize the data after normalization, the genotyping data are transformed from Cartesian coordinates (Figure 4

left) to a polar coordinate plot (Figure 4 right). Cartesian coordinates use the X axis to represent the intensity of A

allele and the Y axis to represent the intensity of B allele. The polar coordinates use the X axis to represent

normalized theta (the angle deviation from pure A signal, where 0 represents pure A signal and 1.0 represents pure

B signal), and Y axis to represent the distance of point to origin. The theta and R are calculated by equations:

Theta = (2/pi) * arctan(Ynorm/Xnorm)

R = Xnorm + Ynorm

Where Xnorm and Ynorm represent transformed normalized signals from alleles A and B for a particular locus (SNP).

10

Figure 4 SNP Graphs: Cartesian Coordinates (left) & Polar Coordinates (Right). It displays all samples for the

currently selected SNP. Samples are colored according to their genotype. From right graph, for this particular SNP

(ID = rs17159012), 420 samples (red cluster) are called as AA alleles, 124 samples (purple cluster) called as AB alleles,

and 13 samples (blue cluster) called as BB alleles.

4. Mosaic Chromosomal Abnormalities Detection Methods

4.1. Log R Ratio and B Allele Frequency

The main goal of this project is to investigate the relationship between mosaic chromosomal abnormalities and

bladder cancer risk by first identify regions of the genome that are aberrant in copy number, more specifically, the

mosaic copy number variation on autosomal chromosomes in bladder cancer and cancer-free subjects. The

detection of autosomal mosaic events was based on assessment of allelic imbalance and copy number changes. The

chromosomal abnormalities were detected using two infinium high density assay outputs: the log R ratio (LRR) and B

allele frequency (BAF). The LRR and BAF values are originally developed on the Illumina platform. For the Illumina

SNP arrays, the LRR and BAF values can be directly calculated and exported from Illumina’s GenomeStudio software.

The Log R ratio (LRR) value is the normalized measure of total signal intensity and provides data on relative copy

number. For each SNP, let the normalized signal intensities for the A and B alleles be denoted as Xnorm and Ynorm,

respectively. We can then calculate the R-value as Robserved = Xnorm + Ynorm as a normalized measure of total signal

intensity. Log R ratio is then calculated as LRR = log2(Robserved / Rexpected), where Rexpected is computed from linear

interpolation of the genotype clusters (Figure 5 Left). The three cluster positions are generated from a large set of

samples that passed completion rate cutoff. The LRR value for a SNP is a measure of the difference between the

signal intensity of the test sample and a pool of reference samples of the same SNP genotype. Since LRR is the

logged ratio of observed probe intensity to expected intensity, deviation from zero is evidence for copy number

change.

The B allele frequency (BAF) derived from the ratio of allelic probe intensity is the proportion of hybridized sample

that carries the B allele as designated by the infinium Assay. The B allele frequency can also be referred to as “copy

angle” or “allelic composition”. It shows the relative presence of each of the two alternative nucleotides A and B at

11

each SNP locus profiled. BAF for a sample shows the theta value for a SNP corrected for cluster position. The theta-

value θ = (2/pi)*arctan(Ynorm/Xnorm). The BAF value is calculated by equation:

θAA, θAB, θBB are the θ values for three genotype clusters generated from a large set of samples that passed

completion rate cutoff (Figure 5 Right). In the right figure, D1 = (θ - θAB)) and D2 = (θAB - θBB ). In a normal sample,

discrete BAFs of 0.0, 0.5, and 1.0 are expected for each locus that representing AA, AB, BB alleles. Deviations from

this expectation are indicative of aberrant copy number. For example, if a locus has a BAF = 1/3, this might indicate

that there are 1 copy of the B allele and 2 copies of A alleles present in the sample because 1/(1+2) = 1/3.

Analyzing both of LRR and BAF metrics provides strong resolution for detecting true copy number changes and allelic imbalance (Table 4).

Figure 5 Log R Ratios (LRR) and Allelic Intensity Ratio (BAF).

Table 4 Summary of Copy Numbers, Genotypes, Expected LRR, and Expected BAF.

Total Copy Numbers CNV Genotypes Expected LRR Expected BAFs

Deletion of Two Copy Null < 0 N/A

Deletion of One Copy A, B < 0 0, 1

Normal Copy AA, AB, BB 0 0, 0.5, 1

Copy-Neutral LOH AA, BB 0 0, 1

Single Copy Duplication (Trisomy) AAA, AAB, ABB, BBB > 0 0, 1/3, 2/3, 1

Double Copy Duplication AAAA, AAAB, AABB, ABB, BBBB > 0 0, 1/4, 2/4, 3/4, 1

Mosaic Deletion mixed (AA, AB, BB) and (A, B) < 0 4 BAF bands

Mosaic Copy-Neutral LOH mixed (AA, AB, BB) and (AA, BB) 0 4 BAF bands

Mosaic Duplication mixed (AA, AB, BB) and (AAA, AAB, ABB, BBB) > 0 0, > 1/3, < 2/3, 1

12

Genetic mosaicism is the presence of cells within an organism that have a different genetic composition despite

being the product of a single fertilization event. For this project, three mosaic types were investigated: mosaic

deletion (loss), mosaic copy-neutral LOH, and mosaic duplication (gain) as defined below:

Mosaic deletion is the coexistence of cells with normal copy and deletion of one copy. It is characterized by LRR < 0

and two heterozygous BAF bands.

Mosaic copy-neutral LOH is the coexistence of cells with normal copy and copy-neutral LOH. It is characterized by

LRR = 0 and two heterozygous BAF bands.

Mosaic duplication is the coexistence of cells with normal copy and duplication of one copy. It is characterized by

LRR > 0 and two heterozygous BAF bands between (1/3, 2/3). Notice, if two heterozygous BAF bands = 1/3 (AAB)

and 2/3 (ABB), then it is pure duplication of one copy (trisomy).

4.2. Re-estimate LRR and BAF

LRR and BAF were estimated by the GenomeStudio software. However, there are two sources of biases that are not

overcome by Illumina’s five-step normalization method: dye bias and GC/CpG wave bias. There is an asymmetry in

the detection of the two alleles for each SNP, caused by a remaining bias between two dyes used in the Infinium II

assay after used Illumina’s normalization method. The dye intensity bias can reduce precision in estimating copy-

number and allelic imbalance. GC/CpG waves can be present when using incorrectly quantified DNA in the Infinium

assay, or they might be present in regions of high or low GC content. The presence of GC/CpG waves creates

artificial gains and losses in signal intensities for SNPs, and may lead to spurious copy-number variation calls. A four-

step custom software pipeline was implemented to the data exported from GenomeStudio, that contains called

genotype, genotype call quality score, genotype probe intensities (Xnorm, Ynorm), log R ratio (LRR), and B allele

frequency (BAF) for each assay, for further normalization to re-estimate LRR and BAF.

Step 1: Quantile normalization was applied [11] to Xnorm and Ynorm that were generated from Illumina’s

GenomeStudio software and resulted Xqnorm and Yqnorm. This procedure removes dye bias and improves the

asymmetry in the detection of the two alleles for each SNP, which influences both allelic proportions and copy

number estimates.

Step 2: Re-estimate genotype specific cluster centers (AA, AB, BB) for each SNP using Xqnorm and Yqnorm values from

assays with completion rate and genotype quality score greater than predefined thresholds so only SNPs from high

quality samples were used to generate each cluster position.

Step3: GC/CpG wave correction model was applied to each genotyped sample to get GC/CpG corrected allelic

composition theta = (2/pi) * arctan (Yqnorm / Xqnorm) and total intensity R, which was estimated as a linear combination

of (Xqnorm, Yqnorm, GC content in probes) [12]. GC/CpG correction reduces the wavy patterns of signal intensities and

improves the accuracy of copy-number variation detection.

Step4: Finally, LRR and BAF were recomputed using the resulting quantile-normalized and GC/CpG corrected values,

as described in [13].

13

Reduction in variance of the LRR values after applied above 4 steps is demonstrated in Figure 6 for one cluster group

of Illumina HumanHap610 assays from this project.

Figure 6 Variance of log2 R ratio (LRR) before and after normalization procedure for one cluster group of Illumina

HumanHap610 assays.

The reduction in GC/CpG waves is obvious in Figure 7.1 (sample without any chromosome abnormality) and Figure

7.2 (duplication) by plotting the signal intensity patterns before and after wave adjustment for the two samples

from this project.

Figure 7.1 Pre-normalization (left) and post-normalization (right) for a subject without chromosomal abnormality.

Each dot in the figure represents one SNP. Red dots represent B-allele frequency (BAF, scale on the right side), while

black dots show LRR values (LRR, scale on the left side). Three red bands represent BAF values for AA, AB, BB

genotypes along the entire chromosome. One black band in middle (overlap with red AB BAF band) represents LRR

values along the entire chromosome. There are wavy patterns with peaks and troughs for the LRR values across

entire chromosome 13 for pre-normalized data.

14

Figure 7.2 Pre-normalization (left) and post-normalization (right) for subject with duplication. There are wavy

patterns with peaks and troughs for the LRR values across entire chromosome 17 for pre-normalized data.

4.3. Segmentation Methods

In this project, two open-source packages, Genomic Alteration Detection Analysis (GADA) [14] and BAF

Segmentation [15], have been applied to the same data set for detecting breaking points on each chromosome.

GADA software uses Sparse Bayesian Learning (SBL) segmentation algorithm, and BAF Segmentation software uses

Circular Binary Segmentation (SBC) algorithm [16]. Resulting mosaic events in samples from both methods then

were combined.

There were two large mosaicism studies conducted by two independent research groups, Gene-Environment

Association Studies consortium (GENEVA) and Cancer Genome Research Lab (CGR), the results from both groups

were published at Nature Genetics at May 2012 [8,17]. Two lung cancer study data sets from Environment and

Genetics in Lung Cancer Etiology Study (EAGLO) and Prostate, Lung, Colorectal, Ovarian Cancer Screening Trial

(PLCO) were used by both groups. The GENENA group used CBS algorithm and CGR group used SBL algorithm. The

resulting mosaic events were then compared by both groups. The concordance rate was 75%. There were mosaic

events detected by one group and missed by other group for both groups. To minimize false negative rate, this

project used both segmentation algorithms to detect mosaic chromosomal abnormality and then combined the

results.

Here are main steps implemented for the Genomic Alteration Detection Analysis (GADA) software [14]:

Breakpoints detection is based on SBL (Sparse Bayesian Learning) algorithm.

The method detects segments where B deviation is different from 0. The B deviation is the observed BAF

value that is deviated from the expected BAF value of 0.5 for heterozygous SNPs.

Essential steps:

o Load the quantile normalized and GC/CpG wave corrected LRR and BAF.

o Sparse Bayesian Learning model (SBL) was used to discover the most likely genomic locations and

magnitudes for a CNV segment. The sparseness hyperparameter controls the SBL prior distribution

which is uninformative about the location and amplitude of the CNV breakpoints but imposes a penalty

15

on the number of CNV breakpoints. A higher aAlpha implies that less breakpoints are expected a priori

and results with fewer true CNV detected, yet fewer false positives.

o Backward Elimination (BE) is used to rank the statistical significance of each breakpoint obtained from

SBL and to remove sequentially the least significant breakpoints using two parameters T statistic and

MinSegLen. The T argument is the critical value of BE algorithm for the statistical score tm that

associated to the break point m. The breakpoints with tm lower than T are discarded. The score tm is the

difference between the sample averages of the probes falling on the left and right segment, divided by a

pooled estimation of the standard error. T can be efficiently adjusted to controls for the False Discovery

Rate (FDR). The argument MinSegLen indicates the number of consecutive probes (SNP markers) each

CNV segment must contain that have a BAF-deviation different from 0. As T and MinSegLen increase,

the number of the CNV breakpoints decreases.

Here are main steps implemented for the BAF Segmentation software [15]:

Breakpoints detection is based on CBS (Circular Binary Segmentation) algorithm [16].

The method detects segments where mBAF is different from 0.5 since expected BAF = 0.5 for AB allele in a

diploid genome without CNVs. BAF data is reflected into mBAF along the 0.5 axis by the transformation

mBAF = abs(BAF – 0.5) + 0.5, where abs stands for taking the absolute value.

Essential steps:

o Load the quantile normalized and GC/CpG wave corrected LRR and BAF

o Convert BAF data to mBAF. So homozygous SNPs (AA and BB) are positioned at 1, and heterozygous

SNPs without CNVs are positioned at 0.5.

o The homozygous SNPs are uninformative for determination of the total copy number. Remove

homozygous SNPs from mBAF profile based on a fixed mBAF threshold. SNPs above the threshold are

considered non-informative and removed.

o Triplet filtering is next applied to the mBAF threshold filtered data to future improve the removal. For

each SNP the absolute sum of the difference in mBAF between the investigated SNP and the pre- and

succeeding SNP was calculated and added to the SNPs distance from the 0.5 baseline. For a SNP with

index i: triplet sum[i] = abs(mBAF[i - 1] - mBAF[i]) + abs(mBAF[i + 1] - mBAF[i]) + mBAF[i] - 0.5

o Triplet sums are compared against a threshold. SNPs with triplet sums above the threshold were

considered outliers and removed. The triplet filtering is designed to remove non-informative

homozygous SNPs due to experimental noise, obtain mBAF values lower than the mBAF threshold.

o Applied Circular Binary Segmentation model to mBAF profiles after removal of no-informative

homozygous SNPs to discover the most likely genomic locations and magnitudes for a CNV segment (the

total number of breakpoints) controlled by alpha, the significance level for accepting change-points.

4.4. Mosaic Event Type Calling Method

Each event was assigned a copy-number state based on the median LRR value for the segment:

State = mosaic duplication (gain) if median(LRR) > 0.2s LRR

State = mosaic deletion (loss) if median(LRR) < 0.2s LRR

State = mosaic copy neutral LOH otherwise

Where s LRR is the standard deviation of the segment LRR values.

16

After application of each segmentation method to same data set, the output file contains start and end of detected

segmented region, chromosome, median LRR, and standard deviation of the LRR within the segmentation region.

For each sample, the adjacent events were merged if the event types were identical and distance between segments

was less than 1 Mbp. After merging, a minimum event size of length < 0.5 Mbps was excluded, as the false-positive

rate increased rapidly for events of smaller size. Most of the false-positives were due to noisy data (high LRR and

BAF variance) and non-mosaic CNVs were detected as being potentially mosaic.

4.5. Proportion of Mosaicism

For each segment that was identified by SBL/CBS, a Gaussian mixture model was fit to the normalized BAF values of

each segment with 2-4 Gaussian components and the best fitting model was chosen using the Akaike information

criterion (AIC). 2-4 components represent 2-4 possible BAF bands. A two component model (2 BAF bands,

represents AA and BB or A and B) will best fit for segments that have complete loss of heterozygosity or copy-

neutral or loss with mosaic proportions of nearly 100%. Three component models (3 BAF bands for AA, AB, and BB)

should be the best fit for segments that are normal or with very low mosaic proportions. For segments where two

or three component models are chosen, mosaic proportions are assigned manually when there was sufficient

evidence of existing of mosaicism after manually reviewing the combined LRR and BAF plot.

Segments where the four component model was the best fit (4 BAF bands: AA/A, BB/B, AB/A, and AB/B for mosaic

deletion; AA/AA, BB/BB, AB/AA, and AB/BB for mosaic CNLOH; AA/AAA, BB/BBB, AB/AAB, and AB/ABB for mosaic

duplication, see last three rows at Table 4) were assigned mosaic proportions based on the inferred state and

location of the estimated heterozygote BAF bands (mu1, mu2). The mu1 and mu2 are mean of the BAF values across

the segment for each of the two heterozygote BAF bands.

The mosaic proportions were calculated based on the inferred mosaic state and location of the estimated

heterozygote BAF (mu1, mu2) with formulas similar to [18]:

D = mu1 - mu2

Proportion of cells with a deletion = 2D / (1 + D)

Proportion of cells with a duplication = 2D / (1 - D)

Proportion of cells with copy number neutral loss of heterozygosity = D

4.6. Example of Mosaic duplication, Deletion, and CNLOH

Figure 8a is LRR and BAF plot for a normal sample. Figure 8b-g are LRR and BAF plots of six representative mosaic

chromosomal abnormality examples of different types of mosaic rearrangements selected from this project. The

plots show the signal intensity Log R ratio (LRR) (black dots, scale on the left side) and B allele frequency (BAF) (red

dots, scale on the right side) values along the entire chromosome carrying the rearrangements in selected samples.

17

Figure 8a Example of one subject with normal copy for chromosome 13. Each dot in the figure represents one SNP.

Red dots represent B allele frequency (BAF, scale on the right side), while black dots show Log R ratio values (LRR,

scale on the left side). Three red bands represent BAF values for AA (bottom red band), AB (middle red band), BB

(top red band) genotypes across entire chromosome 13. One black band in middle (overlap with red AB BAF band)

represents LRR values (around 0) along the entire chromosome 13.

Figure 8b Interstitial mosaic duplication at p arm of chromosome 16 characterized by increased Log R ratio (mean of

LRR within segment (blue line) > 0) and abnormal heterozygous BAF. The vertical gray lines indicate the

breakpoint(s) of the event segment. A non-mosaic trisomy would have a wider BAF split as 1/3 (AAB) and 2/3 (ABB)

and a larger elevation of LRR.

18

Figure 8c Mosaic duplication for entire chromosome 8. It is characterized by increased Log R ratio (mean of LRR

within segment > 0) and abnormal heterozygous BAF. The degree of mosaicism in figure 8c is less than in figure 8b

because it has a narrow split in the intermediate heterozygous BAF bands along with a smaller increase in LRR.

Figure 8d Mosaic copy neutral loss of heterozygosity (CNLON) for entire q arm of chromosome 1. It is characterized

by unchanged Log R ratio (mean of LRR within segment close to 0) and abnormal heterozygous BAF. The p arm is in

normal state. A non-mosaic CNLOH would have only two BAF bands (AA and BB) and LRR close to 0.

19

Figure 8e Mosaic copy neutral loss of heterozygosity (CNLON) for entire chromosome 14. It is characterized by

unchanged Log R ratio (mean of LRR within segment close to 0) and abnormal heterozygous BAF. The degree of

mosaicism in figure 8e is greater than in figure 8d because it has wider split in the intermediate BAF bands.

Figure 8f Two small interstitial mosaic heterozygous deletions at p arm of chromosome 2. It is characterized by

decreased Log R ratio (mean of LRR within segment < 0) and abnormal heterozygous BAF. A non-mosaic

heterozygous deletion would have no intermediate BAF bands and a larger decrease in LRR.

20

Figure 8g Large mosaic heterozygous deletions at q arm of chromosome 9. It is characterized by decreased Log R

ratio (mean of LRR within segment < 0) and abnormal heterozygous BAF. The mosaic deletion in 8g has a less

proportion of cells containing the deletion than the one in 8f because it has narrow split in the intermediate BAF

bands along with a smaller decrease in LRR.

5. Data Analysis

Analysis was started by loading sample intensity files (two files per sample, for red and green channels) into

Illumina’s GenomeStudio software. The intensity data were normalized using Illumina’s five-step self-normalization

procedures (see description at section 3. Illumina SNP Genotyping and Normalization) which drew on information

contained in the array itself to convert raw X and Y (allele A and allele B) signal intensities to normalized values.

Data on called genotype, genotype calls quality score, raw and normalized genotype probe intensities, LRR, and BAF

for each assay were exported from GenomeStudio software using its “Genotype Final Report” (GFR) format. Using

GFR file as input, the array dataset in the GFR file was converted into a high-performance binary file format (GDAT)

using the GLU software package (http://code.google.com/p/glu-genetics/ ) that was developed at Cancer Genomics

Research Laboratory (CGR).

A GC/CpG model file (GCM file) was generated using a copy of the reference genome UCSC hg18 and Illumina binary

manifest file “Human610-Quadv1_B.bpm”. Within GDAT, a four-step custom software pipeline (see description at

section 4.2. Re-estimate Log R Ratio and B Allele Frequency) was implemented. The information in GCM file was

used for GC/CPG correction. The LRR and BAF were re-estimated on the quantile-normalized and GC/CpG corrected

values and written directly into the GDAT file as a new data table. All of these procedures were implemented using

GLU software package. The renormalized LRR and BAF values from qualifying assay (completion rate >= 90%) were

then analyzed using two custom software pipelines that involved GADA and BAF Segmentation packages to detect

http://code.google.com/p/glu-genetics/

21

whole-chromosome and large segmental events greater than 0.5Mb in size to minimize the false discovery (see

description at section 4.3. Segmentation Methods).

We applied the GADA method with the following setting for the related parameters: SBL sparseness hyperparameter

to discover the total number of breakpoints: aAlpha = 0.85; the critical value of backward elimination algorithm for

the statistical score associated to a break point: T statistic = 10; the minimum number of SNPs each CNV segment

must contain: MinSegLen = 200. We applied the BAF Segmentation method with the following setting for the

related parameters: the threshold in mBAF for calling regions of mosaic event based on segmented mBAF values: ai_

threshold = 0.56 (default); the minimal number of SNPs a segmented region should contain to be allowed to be

called as mosaic event: ai_size = 45; the threshold in mBAF for removing putatively non-informative SNPs:

informative_threshold = 0.97 (default); the threshold for thriplet filtering used to improve removal of putatively

non-informative homozygous SNPs: triplet_threshold = 0.8 (default). The significant level for accepting change-

points: alpha = 0.001, using CBS to identify breakpoints of genomic regions.

For each sample, adjacent events were merged if the event types were identical and distance between segments

was less than 1Mbp. After merging, events of length < 0.5 Mbps were excluded. All events were then plotted.

False positive calls due to noisy assay data and non-mosaic copy-number variants and loss of heterozygosity due to

the hemizygous deletion (deletion of one-copy) and events inherited by descent (IBD) and uniparental disomy (UPD)

were also excluded from analysis base on manual review on each plot. These events were excluded because they

are not mosaic events. The segment boundaries were manually corrected for some of the events. Each event

detected was classified as mosaic duplication (gain), mosaic deletion (loss), and mosaic copy-neutral loss of

heterozygosity. Mosaic proportion of abnormal cells was estimated (see descripted at section 4.5. proportion of

mosacism). The magnitude of BAF differences for single-copy duplication events was one-third of the magnitude of

that for copy-neutral LOH or copy-deletion events reducing the sensitivity for calling copy-duplication events. For

mosaic duplication event, only proportion of abnormal cells <= 0.9 will be kept because as the proportion of mosaic

estimation > 0.9, it is difficult to distinguish between mosaic and non-mosaic duplication reliably.

To view the characteristics of mosaic events, mosaic events were plotted by proportion of abnormal cells and LRR

using Microsoft office excel software (Figure 9). Two circular genomic plots by bladder cancer and control for three

tracks of mosaic events for the autosomes 1 to 22 was generated using circos software (http://circos.ca/software/ )

(Figure 10). The frequency of mosaic events by age and cancer status on all subjects and male only plots were

generated using Microsoft office excel software (Figure 12). Logistic regression models were fit using SAS software

package to determine the relationship between individuals having mosaic event(s) and their age at DNA collection,

gender, and cancer diagnosis.

6. Results and Discussion

6.1. Characteristics of Mosaic Events

193 mosaic segments of size greater than 0.5 Mb on autosomal chromosomes in 163 individuals, for an overall

frequency of individuals with mosaicism of 5% were observed. 118 mosaic events were from bladder cancer

individuals (61.14%) and 75 mosaic events were from cancer-free controls (38.86%). Mosaic autosomal

abnormalities were more common in the bladder cancer individuals (98/1673 = 5.86%) compared with cancer-free

http://circos.ca/software/

22

persons (65/1577 = 4.12%). The most frequent chromosome of event observed was chromosome 17 for bladder

cancer individuals (6.74%) and chromosome 2 and 4 for control individuals (4.15%). Combining case and control

together, the most frequent chromosome of event observed was chromosome 2, 10, and 17 (8.29%) (Table 5),

which may imply instability for these three chromosomes. The most frequent type of event observed was mosaic

duplication (55.96%), whereas mosaic deletion and mosaic CNLOH constituted 12.44% and 31.61 % of mosaic

events, respectively (Table 6). The segment size for CNLOH was the largest and mosaic duplication was the smallest.

Median lengths were 0.82 Mb for mosaic duplications, 2.32 Mb for mosaic deletions, and 21.05 Mb for mosaic

CNLOHs. The abnormal cell proportions are between 20.88% and 89.86% for mosaic duplication; 25.45% and

96.64% for mosaic deletion; 3.84% and 95% for mosaic CNLOH (Figure 9).

Table 5 Frequency of Mosaic Chromosomal Events by Chromosome and Case-Control status.

Mosaic Chromosome Count Mosaic Chromosome Frequency (%)

Chromosome Bladder Cancer Control Total Bladder Cancer Control Total

1 5 4 9 2.59 2.07 4.66

2 8 8 16 4.15 4.15 8.29

3 3 2 5 1.55 1.04 2.59

4 4 4 8 2.07 2.07 4.15

5 3 2 5 1.55 1.04 2.59

6 1 8 9 0.52 4.15 4.66

7 7 1 8 3.63 0.52 4.15

8 7 4 11 3.63 2.07 5.70

9 7 3 10 3.63 1.55 5.18

10 10 6 16 5.18 3.11 8.29

11 2 2 4 1.04 1.04 2.07

12 4 3 7 2.07 1.55 3.63

13 8 4 12 4.15 2.07 6.22

14 6 3 9 3.11 1.55 4.66

15 5 4 9 2.59 2.07 4.66

16 9 3 12 4.66 1.55 6.22

17 13 3 16 6.74 1.55 8.29

18 3 3 6 1.55 1.55 3.11

19 3 1 4 1.55 0.52 2.07

20 3 1 4 1.55 0.52 2.07

21 5 2 7 2.59 1.04 3.63

22 2 4 6 1.04 2.07 3.11

Total 118 75 193 61.14 38.86 100

Table 6 Frequency of Mosaic Chromosomal Events by Event Type and Location.

Mosaic chromosome count Mosaic chromosome frequency (%)

Event location Gain Loss CN LOH Total Gain Loss CN LOH Total

Chromosome 4 0 5 9 2.07 0.00 2.59 4.66

Telomeric p 7 1 10 18 3.63 0.52 5.18 9.33

Telomeric q 10 2 17 29 5.18 1.04 8.81 15.03

Interstitial 87 21 29 137 45.08 10.88 15.03 70.98

Telomeric (p + q) 17 3 27 47 8.81 1.55 13.99 24.35

Total 108 24 61 193 55.96 12.44 31.61 100

23

Figure 9 Characteristics of mosaic events. Mosaic events plotted by proportion of abnormal cells (P) and LRR for 193

events in 163 individuals. A blue dot represents P and LRR values for a mosaic duplication event. A green dot is for

mosaic CNLOH event. A red dot is for mosaic deletion events.

Of mosaic chromosomal events being detected by GADA and BAF methods, 4.66% spanned the entire chromosome,

including 4 whole chromosome mosaic trisomy events on chromosome 8, 12, 18 and 21, with 3 of 4 events were

carried by one subject; 5 whole chromosome mosaic CNLOH events on chromosome 6, 9, 18, and 19 (2 events) that

were carried by 4 different subjects. There was no whole chromosome mosaic deletion event being detected

(Figure 10). We found that 9.33% of mosaic chromosomal events began at a telomere p arm and 15.03% of mosaic

chromosomal events end at a telomere q arm. The most mosaic chromosomal events were interstitial (70.98%),

spanning no telomere. The majority of telomeric events (p + q) were mosaic copy-neutral LOH (27 / 47 = 57.45%)

followed by mosaic duplication (17 / 47 = 36.17%). The majority of interstitial events were mosaic duplication (87 /

137 = 63.5%) followed by mosaic CNLOH (29 / 137 = 21%) (Table 6).

There are 16 individuals (9 bladder cases and 7 controls) having mosaic events on at least two chromosomes.

Among control individuals, the greatest number of mosaic chromosomal events observed for a single subject was 4

from ATBC cohort study and located at whole chromosome 12, 18, 19 and entire q arm of chromosome 21 and all of

them are mosaic duplication (Figure 11). Among bladder cancer individuals, the greatest number of mosaic

chromosomal events observed for a single subject was 11 from PLCO cohort study and located at 9 different

chromosomes, including 2 events on chromosome 12 and 13, and all of them are mosaic copy-neutral LOH with

very higher degree of mosaicism.

24

Figure 10 Circular plots display genomic location of mosaic events. Outer rings are the autosomes 1 to 22. Yellow

track for events of mosaic copy-neutral LOH; blue track for mosaic duplication events; red track for mosaic deletion

events. Left plot are events detected from bladder cancer subjects. Right plot are events detected from cancer-free

controls.

Figure 11 Mosaic duplication across entire of chromosome 12, 18, 19 and entire q arm of chromosome 21 for a

control individual. It is characterized by increased Log R ratio (mean of LRR within segment > 0) and abnormal

heterozygous BAF.

25

6.2. Mosaic Chromosomal Abnormalities and Age at DNA

The effect of increased age on the frequency of mosaic events across three cohort studies that predominantly

included individuals over the age of 60 has been examined. 27 individuals have missing ages. For remaining 3212

subjects, the frequency of control individuals with mosaic events increased with age from 3.21% for those under 60

to 4.53% for those between the ages of 66-70 to 5.96% for those between the ages of 76 and 89 years (P = 0.11).

The frequency of bladder cancer individuals with mosaic events was almost constant with age from 6.95% for those

under 60 to 7.24% for those between the ages of 66-70, then to 6.14% for those between the ages of 76 and 89 (P =

0.32). The frequency of mosaic events was higher in bladder cancer individuals than control individuals in first four

age groups. However, the frequency of mosaic events was very similar for bladder cancer individuals and control

individuals for those between the ages of 76-89 (Figure 12 Top).

The frequency of male control individuals with mosaic events increased with age from 3.39% for those under 60 to

4.41% for those between the ages of 66-70 to 7.51% for those between the ages of 76 and 89 years (P = 0.035). The

frequency of male bladder cancer individuals with mosaic events was almost constant with age from 7.75% for those

under 60 to 6.81% for those between the ages of 66-70, then to 6.81% for those between the ages of 76 and 89 (P =

0.26). The frequency of mosaic events was higher in bladder cancer male individuals than male control individuals

in first four age groups. However, the frequency of mosaic events was low for male bladder cancer individuals than

male control individuals for those between the ages of 76 and 89 (Figure 12 Bottom).

For female, there were no mosaic events for those under age 60 and very few events for other four age categories

and therefore it cannot provide reliable summary for each age category (Table 7).

Figure 12 Frequency of mosaic events by age and cancer status for all individuals (Top) and male only (Bottom).

26

Table 7 Mosaic Event Counts by Five Age Categories and Cancer Status.

Male + Female Male Female

AGE

Bladder Cancer

Control Bladder Cancer

Control Bladder Cancer

Control

Yes No Yes No Yes No Yes No Yes No Yes No

<=60 21 302 8 249 21 271 8 236 0 31 0 13

61-65 17 310 10 322 14 263 7 283 3 47 3 39

61-70 27 373 16 353 23 321 13 295 4 52 3 58

71-75 19 341 17 336 17 283 12 253 2 58 5 83

76-89 14 228 14 235 13 191 13 173 1 37 1 62

Total 98 1554 65 1495 88 1329 53 1240 10 225 12 255

6.3. Mosaic Chromosomal Abnormalities and Gender

The effect of gender on the frequency of mosaic events across three cohort studies by bladder cancer and control

has also been examined. For bladder cancer, the mosaic events were more frequent in males than females with

male = 6.13% and female = 4.22%. For control, the frequency of mosaic events was almost the same for male and

female with male = 4.08% and female = 4.48%. Logistic regression models were fit to the data with mosaic status

on gender, adjusting for age for (1) control and (2) case. For control, the OR = 0.994 with P-value = 0.986. For case,

the OR = 1.470 with P-value = 0.260. Logistic regression model was also fit to all subjects with mosaic status on

gender, adjusting for age and cancer status, and got OR = 1.173 with P-value = 0.498. So we did not observe any

significant gender effect for all three models (Table 8).

Table 8 Frequency of Mosaic Events by Gender and Cancer Status.

Mosaic event frequency (%) Adjusted Logistic Model

Male Female OR 95% P-value

Bladder Cancer 6.13 4.22 1.470 (0.752-2.872) 0.260

Control 4.08 4.48 0.994 (0.519-1.904) 0.986

Overall 5.16 4.36 1.173 (0.739-1.862) 0.498

6.4. Mosaic Chromosomal Abnormalities and Cancer Risk

6.4.1. Mosaic Chromosomal Abnormalities and Cancer Risk for All Subjects

To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer, various logistic

regression models were fit to the data with mosaic status as the response variable and (1) cancer status; (2) cancer

status + gender; (3) cancer status + age; (4) cancer status + age + gender as predictors for a test of partial

independence of cancer status and mosaic status, controlling for gender or/and age. For each model, cancer status

= 1 if bladder cancer and = 0 if control; gender = 1 if male and = 0 if female; and age as continuous variable.

There are modest evidence of positive relationship between mosaic event and bladder cancer (OR=1.44, 95% CI =

1.04–1.98; P = 0.027) for model (1); OR=1.43, 95% CI = 1.04–1.97; P = 0.029) for model (2); OR=1.45, 95% CI = 1.05–

27

2.00; P = 0.025) for model (3); OR=1.44, 95% CI = 0.99–2.49; P = 0.026) for model (4). All four models show very

similar P-values for cancer status. There is no significant evidence of gender effect and age effect for the models

having gender/age as predictor(s) (Table 9.1).

All four main effect models fit the data adequately. The P-value = 0.298 from Pearson’s goodness-of fit test for

model (2) and P-vales = 0.808 and 0.575 from Hosmer-Lemeshow goodness-of-fit test for model (3) and (4). Age is a

continuous variable, which caused models (3) and (4) have very large number of unique profiles (Table 9.2). In this

situation, the data is too sparse to use Pearson and deviance goodness-of-fit test and Hosmer-Lemeshow goodness-

of-fit test is more suitable for the situation where there are a large number of settings of the predictors. All four

models have very similar AIC. The difference of -2 Log L between model (1) and any of other 3 models < 3.84 = χ2 1,

0.05. None of the more complex models significantly improves upon the simplest model, the model (1). However,

we are interested not only whether there is evidence of cancer status effect but also age effect and gender effect on

mosaic status, so model (4) (Table 9.3) gives us more meaningful and interpretable results to answer our

scientifically as well as statistically important questions.

To test for equality of odds ratios between cancer status and mosaic status for various ages, we add the interaction

term with cancer_status + age + gender + cancer_status*age as predictors. We have almost significant evidence to

indicate that the odds ratios for bladder cancer and mosaic event are differ among various ages (P = 0.054). So it is

reasonable to assume that the log(OR) between the cancer status levels at a given age x close to a linear function of

x. This model fits the data very well and has smallest AIC value among all tested models (Table 9.4).

We also examined weather DNA source effect and study effect were significant predictors of mosaic status using

univariate analysis for each nominal variable. It is notable that DNA source (58.6% from blood, 41.4% from buccal)

was not a significant predictor (P-value = 0.578). The study (ATBS, PLCO, CPS) was also not a significant predictor (P-

value = 0.377).

Table 9.1 Summary Logistic Regression Models on All Subjects.

Model STATUS GENDER AGE STATUS*AGE

Setting: presence of mosaic event = 1; OR

95% Wald CI

Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq case vs control; male vs female

mosaic_event=status+error 1.44 (1.04-1.98) 0.027

mosaic_event=status+gender+error 1.43 (1.04-1.97) 0.029 0.503

mosaic_event=status+age+error 1.45 (1.05-2.00) 0.025 0.931

mosaic_event=status+gender+age+error 1.44 (1.05-1.99) 0.026 0.498 0.98

mosaic_event=status+age+gender+status*age+error 0.031 0.408 0.095 0.054

Table 9.2 Summary Goodness-of-Fit Results for Logistic Regression Models on All Subjects.

Model -2 Log L AIC Pearson GOF HL GOF

Setting: presence of mosaic event = 1; case vs control; male vs female

Pr > ChiSq Pr > ChiSq

mosaic_event=status+error 1287.16 1291.19 . .

mosaic_event=status+gender+error 1286.72 1292.72 0.298

mosaic_event=status+age+error 1284.12 1290.12 0.808

mosaic_event=status+gender+age+error 1283.64 1291.64 0.575

mosaic_event=status+age+gender+status*age+error 1279.96 1289.96 0.958

28

Table 9.3 SAS Output From Logistic Regression Main Effect Model on All Subjects.

Analysis of Maximum Likelihood Estimates

Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq

Intercept 1 -3.2544 0.6667 23.8264 <.0001

AGE_DNA 1 -0.00022 0.00891 0.0006 0.9802

SEX_Code 1 0.1598 0.2357 0.4597 0.4978

STATUS_Code 1 0.367 0.1649 4.957 0.026

Table 9.4 SAS Output From Logistic Regression Model With Interaction Term on All Subjects.


Parameter DF Estimate Standard Wald

Pr > ChiSq Error Chi-Square

Intercept 1 -5.3793 1.299 17.1479 <.0001

STATUS_Code 1 3.0518 1.4109 4.6788 0.0305

AGE_DNA 1 0.0303 0.0181 2.7921 0.0947

SEX_Code 1 0.1957 0.2366 0.6844 0.4081

STATUS_Code*AGE_DNA 1 -0.0395 0.0205 3.7226 0.0537

To reduce the number of predictors when age was in the model, we divided age into five age_groups: <=60, 61-65,

66-70, 71-75, and >=76. We did logistic regression of mosaic status on cancer status, gender, and age_group for

testing of partial independence of mosaic status and cancer status controlling for gender and 5 age groups. Again

there is modest evidence of positive correlation between bladder cancer and mosaic status with P-value = 0.025,

which is smaller than the P-value if age is continuous variable, but not by much (P-value = 0.026). There was no any

significant difference between any of last four age groups vs. first age group (Table 9.5).

Table 9.5 SAS Output From Logistic Regression Model with Age Split to Five Groups on All Subjects.

Type 3 Analysis of Effects

Effect DF Wald Pr > ChiSq

Chi-Square

STATUS_Code 1 5.0302 0.0249

SEX_Code 1 0.55 0.4583

AGE_GROUP 4 2.108 0.7159



Intercept 1 -3.3269 0.3066 117.7724 <.0001

STATUS_Code 1 0.3686 0.1643 5.0302 0.0249

SEX_Code 1 0.1753 0.2364 0.55 0.4583

AGE_GROUP 61--65 1 -0.1693 0.2744 0.3808 0.5372

AGE_GROUP 66--70 1 0.1418 0.2476 0.3282 0.5667

AGE_GROUP 71--75 1 0.0507 0.2576 0.0388 0.8439

AGE_GROUP >=76 1 0.182 0.2739 0.4416 0.5063

29

Odds Ratio Estimates

Effect Point Estimate 95% Wald Confidence Limits

STATUS_Code 1.446 1.048 1.995

SEX_Code 1.192 0.75 1.894

AGE_GROUP 61--65 vs <=60 0.844 0.493 1.446

AGE_GROUP 66--70 vs <=60 1.152 0.709 1.872

AGE_GROUP 71--75 vs <=60 1.052 0.635 1.743

AGE_GROUP >=76 vs <=60 1.2 0.701 2.052

6.4.2. Mosaic Chromosomal Abnormalities and Cancer Risk for Men Only

Bladder cancer is the fourth most common cancer diagnosed in men. Men are about 3 to 4 times more likely to get

bladder cancer during their lifetime than women. Overall, the chance men will develop this cancer during their life

is about 1 in 26. For women, the chance is about 1 in 90 [19].

To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer for male only, two

logistic regression models were fit to the data with mosaic status as the response variable and (1) cancer status; (2)

cancer status + age as predictors for a test of partial independence of mosaic status and cancer status, controlling

for age. For each model, cancer status = 1 if bladder cancer and = 0 if control and age as continuous variable.

There are modest evidence of positive relationship between mosaic event and bladder cancer risk for male

individuals (OR=1.53, 95% CI = 1.08–2.17; P = 0.017) for model (1) and (OR=1.55, 95% CI = 1.09–2.20; P = 0.014) for

model (2) (Table 10.1). There is no significant evidence of age effect for the model (2) (Table 10.2).

To test for equality of odds ratios between mosaic event and bladder cancer for the various ages for male

individuals, we add the interaction term with cancer status + age + cancer status*age as predictors. We have

significant evidence of unequal odds ratios between mosaic event and bladder cancer for the various ages (P =

0.017). This model fits the data very well and has smallest AIC value among all tested models (Table 10.3).

Table 10.1 Summary Logistic Regression Models On Male Subjects.

Model STATUS AGE STATUS*AGE HL GOF

Setting: presence of mosaic event = 1; case vs control

OR 95% Wald

CI Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq

Mosaic_event=status+error 1.53 (1.08-2.17) 0.017 .

Mosaic_event=status+age+error 1.55 (1.09-2.20) 0.014 0.998 0.547

Mosaic_event=status+age+status*age+error 0.008 0.035 0.017 0.906

Table 10.2 SAS Output From Logistic Regression Main Effect Model on Male Subjects.



Intercept 1 -3.1538 0.6499 23.5469 <.0001

STATUS_Code 1 0.4378 0.1788 5.992 0.0144

AGE_DNA 1 0.000018 0.0094 0 0.9985

30

Table 10.3 SAS Output From Logistic Regression Model With Interaction Term on Male Subjects.


Parameter DF Estimate Standard Wald

Pr > ChiSq Error Chi-Square

Intercept 1 -6.0032 1.3769 19.008 <.0001

STATUS_Code 1 4.0163 1.5229 6.9552 0.0084

AGE_DNA 1 0.0416 0.0197 4.453 0.0348

STATUS_Code*AGE_DNA 1 -0.0527 0.0221 5.7159 0.0168

6.4.3. Mosaic Chromosomal Abnormalities and Cancer Risk for Female Only

To investigate the relationship between mosaic chromosomal abnormalities and bladder cancer for female only, two

logistic regression models were fit to the data with the mosaic event as the response variable and (1) cancer status;

(2) cancer status + age as predictors for a test of partial independence of mosaic event and cancer status, controlling

for age. For each model, cancer status = 1 if bladder cancer and = 0 if control and age as continuous variable.

There are no significant evidence of cancer risk for mosaic female individuals (OR = 0.94, 95% CI = 0.40–2.22; P =

0.887) for model (1) and (OR = 0.93, 95% CI = 0.38–2.23; P = 0.865) for model (2) (Table 11.1). There is no

significant evidence of age effect for the model (2) (Table 11.2).

To test for equality of odds ratios between mosaic event and cancer status for the various ages for female

individuals, we add the interaction term with cancer status + age + cancer status*age as predictors. There is no

significant evidence of unequal odds ratios between mosaic event and bladder cancer for the various ages (P = 0.27)

(Table 11.2).

Table 11.1 Summary Logistic Models On Female Subjects.

Model STATUS AGE STATUS*AGE HL GOF

Setting: presence of mosaic event = 1; case vs. control

OR 95% Wald

CI Pr > ChiSq Pr > ChiSq Pr > ChiSq Pr > ChiSq

mosaic_event=status+error 0.94 (0.40-2.22) 0.887

mosaic_event=status+age+error 0.93 (0.38-2.23) 0.865 0.839 0.318

mosaic_event=status+age+status*age+error 0.264 0.309 0.27 0.064

Table 11.2 SAS Output From Logistic Regression Main Effect Model on Female Subjects.



Intercept 1 -2.6583 1.9801 1.8023 0.1794

STATUS_Code 1 -0.0766 0.449 0.0291 0.8645

AGE_DNA 1 -0.0056 0.0277 0.0412 0.8391

31

Table 11.3 SAS Output From Logistic Regression Model with Interaction Term on Female Subjects.



Intercept 1 0.319 3.2881 0.0094 0.9227

STATUS_Code 1 -4.8062 4.3008 1.2489 0.2638

AGE_DNA 1 -0.0482 0.0473 1.0368 0.3086

STATUS_Code*AGE_DNA 1 0.0683 0.062 1.2167 0.27

7. Conclusion and Further Studies

In this project, DNA of 3,239 individuals with 1,673 bladder cancer cases and 1,566 cancer-free controls have been

used to investigate for evidence of mosaicism of the autosomes using Illumina’s genome-wide SNP array data

generated from bladder cancer genome wide association analysis. 193 mosaic duplication, mosaic deletion, and

mosaic copy-neutral loss of heterozygosity (CNLOH) events with size > 0.5 Mb in autosomes of 163 study subjects

(5%), with abnormal cell proportions of between 3.84% and 96.64%, have been observed. Mosaic autosomal

abnormalities were statistically significantly positively associated with bladder cancer for male (OR = 1.55; P = 0.014)

but not for female. The frequency of mosaicism increased with age for male control subjects, ranging from 3.39% in

individuals under age 60 to 7.51% between 76 and 89 years old (P = 0.035). Mosaic autosomal abnormalities were

more common in the bladder cancer individuals (5.86%) compared with cancer-free persons (4.12%). The mosaic

events were more frequent in males than females with male = 6.13% and female = 4.22% for bladder cancer

individuals but similar for cancer-free persons with male = 4.08% and female = 4.48%. The most frequent class of

autosomal abnormality detected was mosaic duplication, representing 55.96% of mosaic events. The most frequent

autosomal of mosaic event observed was chromosome 17 for bladder cancer individuals (6.74%) and chromosome 2

and 4 for control individuals (4.15%). Combining case and control together, the most frequent chromosome of

mosaic event observed was chromosome 2, 10, and 17 (8.29%).

Mosaicism in older cancer-free male individuals suggests that age-related genomic instability could be due to

increased rates of somatic mutation or diminished capacity for genomic maintenance, such as with telomere

attrition, leading to proliferation of somatically altered cell populations [20].

This project can be investigated further in several areas if we can get additional information in the future. For all

subjects with mosaic events, it will be very interesting to assess the characters and behavior of mosaic events over

time and to determine these individuals number of mosaic events or the proportion of observed mosaicism change

over time. In this project, we calculated proportion of mosacism for each mosaic events but did not go further. If

we have additional data points collected with respect to specific time of diagnosis in case, we can investigate the

hypothesis of positive association between proportion of mosaicism and severity of the cancer (early to later stage).

Just very recently, we accidently identified one subject participated two studies (Non-Hodgkin’s Lymphoma (NHL)

and Ovarian cancer). We had her genotyped data from two DNA samples that were drawn at age 59 (NHL) and 63

years (NHL + Ovrian cancer). We did mosaic chromosomal abnormality analysis on both samples and found mosaic

events at multiple chromosomes. At age 59, there were mosaic events observed on chromosome 3, 8, 10, 13, 20,

and X (Figure 13 left). At age 63, there were mosaic events observed on chromosome 3, 4, 8, 9, 13, 20 and X (Figure

32

13 right). All of the mosaic events at 59 were observed at age 63 and with increased proportion of mosaicism

except the mosaic event on chromosome 10 at age 59 but unobserved at age 63. The disappearance of this event

may be the result of the cancer treatment. There were two new events detected at chromosome 4 and 9 at age 63,

which may be due to later stage NHL cancer or ovarian specific cancer.

33

Figure 13 Mosaic events observed on multiple chromosomes for a female subject having Non-Hodgkin’s Lymphoma

and Ovarian cancer. Left figures are for DNA samples drew at age 59 (NHL). The right figures are for DNA samples

drew at 63 years (NHL + Ovarian cancer).

34

We can also possible to determine observed mosaic events’ development origin (germline or somatic cells) if we

have blood, tumor tissue, and normal tissue data available. A germline mutation is one that was passed on to

offspring. An example of gemline mutations linked to cancer is the ones that occur in cancer susceptibility genes,

increasing a person's risk for the disease. Somatic mutations are not passed on to the next generation. By

distinguishing the origin of the mutation, we may able to discover cancer susceptibility genes such as well-known

BRCA1 and BRCA2 genes for breast cancer. We recently did mosaic analysis on several TCGA (The Cancer Genome

Atlas, http://cancergenome.nih.gov/) SNP data sets. Figure 14 shows possible gemline deletion at chr7: 54500000 –

55200000 for subject 1087 who had glioblastoma multiforme (GBM) with both blood and tumor samples taken at

the same time. Same mutation existed at both blood and tumor samples which means this is not tumor specific

mutation. We also observed some mutations existed at blood sample not tumor sample, which implies the

mutation in blood is somatic mutation instead of germline mutation (Figure 15).

Figure 14 Deletion at chromosome 7 (54500000 – 55200000) detected from blood sample (Top) and primary solid

tumor sample (bottom) for one GBM subject from TCGA data. The deletion in the blood sample is very likely a

gemline mutation because same event was existed at both DNA sources.

http://cancergenome.nih.gov/

35

Figure 15 Mosaic CNLOH at whole chromosome 3 detected from blood sample (Top) for ovarian cancer subject 1877

from TCGA data. Bottom figure is mutation detected from primary solid tumor sample of same subject. The

mutation in blood sample is somatic mutation because of very different mutation types between two DNA sources.

36

8. Reference 1. Youssoufian, H. & Pyeritz, R.E. Mechanisms and consequences of somatic mosaicism in humans. Nat. Rev. Genet. 3, 748–758 (2002). 2. Notini, A.J., Craig, J.M. & White, S.J. Copy number variation and mosaicism. Cytogenet. Genome Res. 123, 270–277 (2008). 3. Menten, B. et al. Emerging patterns of cryptic chromosomal imbalance in patients with idiopathic mental retardation and multiple congenital anomalies: a new series of 140 patients and review of published reports. J. Med. Genet. 43, 625–633 (2006). 4. Lu, X.Y. et al. Genomic imbalances in neonates with birth defects: high detection rates by using chromosomal microarray analysis. Pediatrics 122, 1310–1318 (2008). 5. Conlin, L.K. et al. Mechanisms of mosaicism, chimerism and uniparental disomy identified by single nucleotide polymorphism array analysis. Hum. Mol. Genet. 19, 1263–1275 (2010). 6. Vorsanova, S. et al. Evidence for High Frequency of Chromosomal Mosaicism in Spontaneous Abortions Revealed by Interphase FISH Analysis. J. Histochem Cytochem. 53, 375-380 (2005). 7. Solomon, D.A. et al. Mutational inactivation of STAG2 causes aneuploidy in human cancer. Science 333, 1039–1043 (2011). 8. Jacobs, K.B, Yeager M, Zhou W. et al. Detectable clonal mosaicism and its relationship to aging and cancer. Nat. Genet. 44:651–658 (2012). 9. Kai Yu. Et al. Population Substructure and Control Selection in Genome-Wide Association Studies. PLoS ONE (2008). 10. http://dnatech.genomecenter.ucdavis.edu/documents/illumina_gt_normalization.pdf 11. Staaf, J. et al. Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios. BMC Bioinformatics 9, 409 (2008). 12. Diskin, S.J. et al. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 36, e126 (2008). 13. Peiffer, D.A. et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole- genome genotyping. Genome Res. 16, 1136–1148 (2006). 14. Pique-Regi, R., Caceres, A. & Gonzalez, J.R. R-Gada: a fast and flexible pipeline for copy number analysis in association studies. BMC Bioinformatics 11, 380 (2010). 15. Staaf J, et al. Segmentation-based detection of allelic imbalance and loss-of heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol 9:R136 (2008). 16. Olshen, A.B. et al. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557−572 (2004). 17. Laurie, C.C. et al. Detectable Detectable clonal mosaicism from birth to old age and its relationship to cancer. Nat. Genet. 44, 642–650 (2012). 18. Rodíguez-Santiago, B. et al. Mosaic uniparental disomies and aneuploidies as large structural variants of the human genome. Am. J. Hum. Genet. 87, 129–138 (2010). 19. http://www.cancer.org/cancer/bladdercancer/detailedguide/bladder-cancer-key-statistics 20. Sahin E, Depinho RA. Linking functional decline of telomeres, mitochondria and stem cells during ageing. Nature 464:520–8 (2010).

http://dnatech.genomecenter.ucdavis.edu/documents/illumina_gt_normalization.pdf

http://www.cancer.org/cancer/bladdercancer/detailedguide/bladder-cancer-key-statistics

37

9. Appendix A: SAS Code

/* The analysis was done on 7 data sets:

(1) All subjects.

(2) Male subjects.

(3) Female subjects.

(4) Bladder cancer subjects.

(5) Control subjects.

(6) Male bladder cancer subject.

(7) Male control subjects. */

#libname Bladder

'T:\DCEG\Home\zhouw\CNV_Analysis\Bladder\GADA+BAF\association';

libname Bladder 'C:\Users\wyzhou22\Desktop\TAMU\STAT685\association';

proc format; value agefmt

1='<=60' 2='61--65' 3='66--70' 4='71--75' 5='>=76';

data Bladder.gada_baf_master_all;

set Bladder.gada_baf_master_1_3_1;

if AGE_DNA=. then AGE_GROUP=.;

else if AGE_DNA <= 60 then AGE_GROUP=1;




else if AGE_DNA >= 76 then AGE_GROUP=5;

format AGE_GROUP agefmt.;

run;

proc print data=Bladder.gada_baf_master_all;

run;

/* (1) All Subjects */

/* SEX_Code: P-value=0.450 */

proc logistic desc data=Bladder.gada_baf_master_all;

model new_MOSAIC_EVENT_Y=SEX_Code /lackfit aggregate scale=none;

title 'Gender Main Effect Model on All Subjects';

run;

/* AGE_DNA P-value=0.745 */


model new_MOSAIC_EVENT_Y=AGE_DNA /lackfit aggregate scale=none;

title 'Age Main Effect Model on All Subjects';

run;

/* AGE_DNA to 5 groups: overall P-value=0.922 */


class AGE_GROUP(ref="<=60")/param=ref;

model new_MOSAIC_EVENT_Y=AGE_GROUP/lackfit aggregate scale=none;

title 'Age_group Main Effect Model on All Subjects';

run;

/* DNA_source: P-value=0.578 */


38

class DNA_source;

model new_MOSAIC_EVENT_Y=DNA_source /lackfit aggregate scale=none;

title 'DNA_source Main Effect Model on All Subjects';

run;

/* STUDY_ID: P-value=0.377 */


class STUDY_ID;

model new_MOSAIC_EVENT_Y=STUDY_ID /lackfit aggregate scale=none;

title 'Study Cohort Main Effect Model on All Subjects';

run;

/* STATUS_Code: P-value=0.027 */


model new_MOSAIC_EVENT_Y=STATUS_Code /lackfit aggregate scale=none;

title 'Cancer Status Main Effect Model on All Subjects';

run;

/* STATUS_Code: P-value=0.029; SEX_code: P-value=0.503 */


model new_MOSAIC_EVENT_Y=STATUS_Code SEX_Code /lackfit aggregate scale=none;

title 'Cancer Status and Gender Main Effects Model on All Subjects';

run;

/* STATUS_Code: P-value=0.025; AGE_DNA: P-value=0.931 */


model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA /lackfit aggregate scale=none;

title 'Cancer Status and Age Main Effects Model on All Subjects';

run;

/* STATUS_Code: P-value=0.026; SEX_code: P-value=0.498; AGE_DNA: P-value=0.980

*/


model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA SEX_Code /lackfit aggregate

scale=none;

title 'Cancer Status and Gender and Age Main Effects Model on All Subjects';

run;

/* STATUS_Code: P-value=0.036; AGE_DNA: P-value=0.11; STATUS_Code*AGE_DNA: P-

value=0.062 */


model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA STATUS_Code*AGE_DNA/lackfit

aggregate scale=none;

title 'Cancer Status and Age and 2-WAY Interaction Model on All Subjects';

run;

/* STATUS_Code*AGE_DNA: P-value = 0.054, best fitted model */


model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA SEX_Code

STATUS_Code*AGE_DNA/lackfit aggregate scale=none;

title 'Cancer Status and Age and Gender and 2-WAY Interaction Model on All

Subjects';

39

run;

/* AGE_DNA to 5 groups: STATUS_Code: P-value=0.025; AGE_GROUP: P-value=0.716;

SEX_Code: P-value=0.456 */



model new_MOSAIC_EVENT_Y=STATUS_Code SEX_code AGE_GROUP/lackfit aggregate

scale=none;

title 'Cancer Status and Gender and Age_group Main Effects Model on All

Subjects';

run;

/* AGE_DNA to 5 groups: STATUS_Code: P-value=0.0658; AGE_GROUP: P-value=0.434;

SEX_Code: P-value=0.392; STATUS_Code* AGE_GROUP: P-value=0.596 */



model new_MOSAIC_EVENT_Y=STATUS_Code SEX_Code AGE_GROUP

STATUS_Code*AGE_GROUP/lackfit aggregate scale=none;

title 'STATUS and Gender and AGE_GROUP and 2-WAY Interaction Model on All

Subjects';

run;

/* (2) Male Only*/

data Bladder.gada_baf_master_male;

set Bladder.gada_baf_master_all;

if GENDER="MALE";

run;

proc print data=Bladder.gada_baf_master_male;

run;


proc logistic des data=Bladder.gada_baf_master_male;

model new_MOSAIC_EVENT_Y=STATUS_Code/lackfit aggregate scale=none;

title 'Cancer Status Main Effect Model on Male Subjects';

run;



model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA/lackfit aggregate scale=none;

title 'Cancer Status and Age Main Effects Model on Male Subjects';

run;


value=0.0168 */




title 'Cancer Status and Age and 2-WAY Interaction Model on Male Subjects';

run;

40

/* STATUS_Code: P-value=0.052; AGE_GROUP: P-value=0.179; STATUS_Code*AGE_DNA:

P-value=0.495 */



model new_MOSAIC_EVENT_Y=STATUS_Code AGE_GROUP STATUS_Code*AGE_GROUP/lackfit


title 'Cancer Status and Age_group and 2-WAY Interaction Model on Male

Subjects';

run;

/* (3) Female Only*/

data Bladder.gada_baf_master_female;


if GENDER="FEMALE";

run;

proc print data=Bladder.gada_baf_master_female;

run;


proc logistic des data=Bladder.gada_baf_master_female;

model new_MOSAIC_EVENT_Y=STATUS_Code/lackfit aggregate scale=none;

title 'Cancer Status Main Effect Model on Female Subjects';

run;



model new_MOSAIC_EVENT_Y=STATUS_Code AGE_DNA/lackfit aggregate scale=none;

title 'Cancer Status and Age Main Effects Model on Female Subjects';

run;


value=0.27 */




title 'Age Status and Age and 2-WAY Interaction Model on Female Subjects';

run;

/* (4) Bladder Cancer only */

data Bladder.gada_baf_master_case;


if STATUS="CASE";

run;

proc print data=Bladder.gada_baf_master_case;

run;

/* AGE_DNA: P-value=0.32 */

41

proc logistic des data=Bladder.gada_baf_master_case;

model new_MOSAIC_EVENT_Y=AGE_DNA/lackfit aggregate scale=none;

title 'Age Main Effect Model on Cancer Subjects';

run;

/* SEX_Code: P-value=0.260; AGE_DNA: P-value=0.351; SEX_Code: P-value=0.26 */

proc logistic des data=Bladder.gada_baf_master_case;

model new_MOSAIC_EVENT_Y=AGE_DNA SEX_Code/lackfit aggregate scale=none;

title 'Age and Gender Main Effects Model on Cancer Subjects';

run;

/* (5) Control only */

data Bladder.gada_baf_master_control;


if STATUS="CONTROL";

run;

proc print data=Bladder.gada_baf_master_control;

run;


proc logistic des data=Bladder.gada_baf_master_control;


title 'Age Main Effect Model on Control Subjects';

run;

/* AGE_DNA: P-value=0.119; SEX_Code: P-value=0.986 (table 7 control) */


model new_MOSAIC_EVENT_Y=AGE_DNA SEX_Code /lackfit aggregate scale=none;

title 'Age and Gender Main Effects Model on Control Subjects';

run;

/* AGE_DNA: P-value=0.309 SEX_Code: P-value=0.0761; AGE_DNA*SEX_Code: P-

value=0.0799 */


model new_MOSAIC_EVENT_Y=AGE_DNA SEX_Code AGE_DNA*SEX_Code/lackfit aggregate

scale=none;

title 'Age and Gender and 2-Way Interaction Model on Control Subjects;

run;

/* (6) Male Bladder Cancer Only */

data Bladder.gada_baf_master_male_case;

set Bladder.gada_baf_master_1_3_1;

if GENDER="MALE" and STATUS="CASE";

run;

proc print data=Bladder.gada_baf_master_male_case;

run;


42

proc logistic des data=Bladder.gada_baf_master_male_case;


title 'Age Main Effect Model on Male Cancer Subjects';

run;

/* (7) Male Control Only */

data Bladder.gada_baf_master_male_control;


if STATUS="CONTROL" and GENDER="MALE";

run;

proc print data=Bladder.gada_baf_master_male_control;

run;




title 'Age Main Effect Model on Male Control Subjects';

run;

43

Appendix B: Normalization Pipeline ## (1) Create GC model for Human610-Quadv1_B.bpm qrun -V -l walltime=24:00:00 glu cnv.make_gcmodel -r $GENOME/hg19.fa \ $ MANIFEST / Human610-Quadv1_B.bpm \ -o $GCMODELS/HumanOmni25M-8v1-1_B.gcm ## (2) Create GDAT file from Illumina’s genome file report file (gfr) qrun -V -l walltime=24:0:0 glu -v convert.from_gfr $MANIFEST/ Human610-Quadv1_B.bpm\ $GFR/ATBC_Hap610_vi_working_FinalReport.txt.bz2 \ -o $GDAT/ATBC_OmniEx_vi_working.gdat & qrun -V -l walltime=24:0:0 glu -v convert.from_gfr $MANIFEST/ Human610-Quadv1_B.bpm\ $GFR/PLCO_Hap610_vi_working_FinalReport.txt.bz2 \ -o $GDAT/PLCO_OmniEx_vi_working.gdat & qrun -V -l walltime=24:0:0 glu -v convert.from_gfr $MANIFEST/ Human610-Quadv1_B.bpm\ $GFR/CPS_Hap610_vi_working_FinalReport.txt.bz2 \ -o $GDAT/CPS_OmniEx_vi_working.gdat & ## (3) Normalization and re-estimate LRR and BAF glu cnv.normalize --gcmodeldir=/CGF/Resources/Data/Infinium/GCmodels \ $GDAT/ATBC_OmniEx_vi_working.gdat & glu cnv.normalize --gcmodeldir=/CGF/Resources/Data/Infinium/GCmodels \ $GDAT/PLCO_OmniEx_vi_working.gdat & glu cnv.normalize --gcmodeldir=/CGF/Resources/Data/Infinium/GCmodels \ $GDAT/CPS_OmniEx_vi_working.gdat &

44

Appendix C: GADA Segmentation Pipeline

## (1) run.sh export GLU_PYTHON_VERSION="2.7" export PYTHONPATH="/home/zhouw/src" BASE=/home/zhouw/CNV_Analysis/Bladder/GADA INCLUDE="--includes=$BASE/include_list/include.lst" EXCLUDE="--excludes=$DEFS/redacted_samples.lst" BATCHES=$BASE/work mkdir -p $BATCHES 2> /dev/null rm -Rf $BATCHES # Generate sample lists use for each batch with size = 20 samples per batch python2.7 setup_batches.py $GDAT/index.db gdat.lst --outdir=$BATCHES --size=20 $EXCLUDE $INCLUDE # Submit batches for i in $BATCHES/*/*; do if [ \! -f $i/segments.txt ]; then echo $i qrun --nowait -l walltime=72:0:0 sh run_batch_GADA.sh $i #sleep 3 fi done ## (2) run.sh call run_batch_GADA.sh to run GADA for each batch #!/bin/bash BPATH=$1 BATCH=`dirname $BPATH` GDAT=$GDAT/`basename $BATCH`.gdat BATCH=`basename $BPATH` BASE=`pwd` cd $BPATH mkdir rawData SBL 2> /dev/null # Generate files can be used for GADA-R glu convert.to_gada $GDAT --includesamples=sample.lst -o rawData/.txt # Call GADA-R package R --vanilla < $BASE/GADAKBJ.R # Clean the data folders rm -Rf rawData SBL

45

## (3) run_batch_GADA.sh call GADAKBJ.R to run GADA segmentation library(gadaKBJ) # Select the working directory (it must contain a folder called rawData with the samples) # in our case ... setwd(".") # Import data data <-setupParGADA.B.deviation(NumCols=6, GenoCol=6, BAFcol=5, log2ratioCol=4, dev.type="all") # Segmentation procedure and calling parSBL(data, estim.sigma2=TRUE, aAlpha=0.85) parBE.B.deviation(data, T=10, MinSegLen=100) # Writes segments to file exportSegments2File(data, file="segments.txt")

46

Appendix D: BAF Segmentation Pipeline

## (1) config.sh # GLU source tree source /home/zhouw/src # Tool location export BSEG=/home/zhouw/BAFsegmentation-1.2.0 # Batch size export SIZE=20 # GDAT files export GDATS=”$GDAT/ATBC_Hap610_v1_working.gdat $GDAT/PLCO_Hap610_v1_working.gdat $GDAT/CPS_Hap610_v1_working.gdat” export INCLUDE=/home/zhouw/CNV_Analysis/Bladder/BAF/include_list/include.lst export EXCLUDE=/home/zhouw/CNV_Analysis/Bladder/BAF/redacted_samples.lst ## (2) run.sh call run_batch_BAF.sh to run BAF Segmentation for each batch source config.sh # Generate the batch sample list files # Combine multiple excludes into one master list file SAMLIST=/home/zhouw/CNV_Analysis/Bladder/BAF/sample_list for GDAT in $GDATS; do DIR=`basename $GDAT | cut -f1 -d.` print $DIR WD=/home/zhouw/CNV_Analysis/Bladder/BAF/$DIR SAMPLE=$WD/SAMPLE mkdir -p $WD/SAMPLE 2>/dev/null # Generate sample lists use for each batch with size = 20 samples per batch python2.7 split.py --directory $WD/SAMPLE --size $SIZE --excludes $EXCLUDE --includes \ $INCLUDE $SAMLIST/$DIR.lst # submit batches for i in $WD/SAMPLE/*; do BATCH=`basename $i` mkdir $WD/$BATCH 2>/dev/null ./run_batch_BAF.sh $WD $BATCH $GDAT & done done

47

## (3) run_batch_BAF.sh call run_BAF.sh to run BAF Segmentation source config.sh export INPUT=$DIR/$BATCH/extracted export PLOTS=$DIR/$BATCH/plots export OUTPUT=$DIR/$BATCH/segmented mkdir $INPUT $PLOTS $OUTPUT 2>/dev/null if [ \! -s $OUTPUT/*.txt ]; then # Generate the per sample input file $RGLU_S convert.to_bseg $GDAT --includesamples $SAMPLE/$BATCH \ -o $INPUT/.txt > $WD/convert.log 2>&1 & wait # Generate sample_name.txt table based on files in extracted folder echo "Assay Filename IGV_index" | tr ' ' "\t" > $INPUT/sample_names.txt index=1 (for j in $INPUT/*.txt; do SAM=`basename $j | cut -f1 -d.` if [ "$SAM" != "sample_names" ]; then echo "$SAM $SAM.txt $index" | tr ' ' '\t' >> $INPUT/sample_names.txt index=`expr $index + 1` fi done) & wait # Submit batch qrun -l walltime=24:0:0 ./run_BAF.sh $INPUT $OUTPUT $PLOTS & fi # Take a quick nap sleep 1 wait ## clean the data folders rm -Rf $INPUT/* & ## (4) run_BAF.sh to run BAF Segmentation source ./config.sh INPUT=$1 OUTPUT=$2 PLOTS=$3 cd $BSEG

48

perl BAF_segment_samples.pl \ --input_directory=$INPUT \ --output_directory=$OUTPUT \ --plot_directory=$PLOTS \ --do_plot=FALSE --ai_size=45 cd -

segmentation-based detection of mosaic chromosomal …€¦ · · 2014-09-031 segmentation-based...

Documents