systematic analysis for identification of genes impacting ...the cancer genome atlas (tcga) the...

13
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015 65 Systematic Analysis for Identification of Genes Impacting Cancers Arpita Singhal Stanford University Saint Francis High School ABSTRACT Currently, vast amounts of molecular information involving genomic characterizations exist for various types of cancers. However, the integration of the various forms of biological data, necessary for a better understanding of the key processes underlying cancer, remains challenging. This project uses micro- array based comparative genomic hybridization (aCGH) data to study genomic alterations on various tumor samples, with the statistical procedures in R. To find the hidden copy number states of each chromosome to characterize genomic alterations, this project utilizes Hidden Markov Models on datasets from cancer patients. The efficacy of the homogeneous and heterogeneous Hidden Markov Models is evaluated against the known truth of simulated data by looking at the true positive rates and false discovery rates for breakpoint detection. This project mainly determines the number and types of copy number variations present in the chromosomes of the tumor datasets, obtained from the The Cancer Genome Atlas portal. Recurrent chromosomal aberrations at particular genome locations may indicate the presence of tumor suppressor genes or oncogenes. After recognizing the chromosomes with high copy number changes, genes causing these high copy number variations are identified. The association between chromosomal location and cancer phenotype provides a more reliable and informative cancer genome characterization that can lead to useful insights into cancer biology for further disease classification, prognosis, and personalized treatment. INTRODUCTION A central issue in cancer biology is the identification of specific chromosomal regions that are involved in cancer progression and other biological processes. Unbalanced chromosomal abnormalities, that result in gains and losses of chromosomal segments, often cause several human genetic disorders, including cancer. Driven by an accumulation of genetic and epigenetic changes, tumors represent altered levels of gene expression and the disruption of normal cell growth and survival. A variety of cancers exhibit gains in proto- oncogenes and losses in tumor suppressor genes; thus, growth-limiting functions and self-repair processes of cancerous regions are often seriously harmed. The genomic alterations, observed in tumors, reflect underlying failures in the maintenance of genetic stability. Copy Number Variations (CNVs) Copy Number Variations collectively describe deletions, insertions, duplications, and other complex variants present in the human genome. Redon et al. (2006) defined a CNV as a DNA segment of one kilobase or larger that is present at a variable copy number in comparison to a reference genome. A CNV can be simple in structure, such as a duplication, or it may involve complex gains or losses of homologous sequences at multiple sites in the genome. Chromosomal copy numbers are defined to be 2 for normal cells, 1 or 0 for single and double deletions, and 3 or higher for single copy gains or higher level amplifications. Figure 1 shows the various forms of chromosome changes. Cancer progression is usually a result of copy number variations, which may represent the over-expression of proto-oncogenes or down-regulation of tumor suppressor genes in cancer genomes. Structural variations, such as CNVs, influence the expression of different phenotypic traits and are found to impact various diseases and affect the development of tumors. DNA copy- number variations are used in cancer research, by searching for novel genes involved in cancers through the analysis of genes located in specific regions. Thereby, it is of considerable importance to identify as precisely as possible the chromosomal regions with abnormal copy numbers.

Upload: others

Post on 05-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

65

Systematic Analysis for Identification of Genes Impacting Cancers

Arpita Singhal

Stanford University

Saint Francis High School

ABSTRACT Currently, vast amounts of molecular information involving genomic characterizations exist for

various types of cancers. However, the integration of the various forms of biological data, necessary for a

better understanding of the key processes underlying cancer, remains challenging. This project uses micro-

array based comparative genomic hybridization (aCGH) data to study genomic alterations on various tumor

samples, with the statistical procedures in R. To find the hidden copy number states of each chromosome to

characterize genomic alterations, this project utilizes Hidden Markov Models on datasets from cancer

patients. The efficacy of the homogeneous and heterogeneous Hidden Markov Models is evaluated against the

known truth of simulated data by looking at the true positive rates and false discovery rates for breakpoint

detection. This project mainly determines the number and types of copy number variations present in the

chromosomes of the tumor datasets, obtained from the The Cancer Genome Atlas portal. Recurrent

chromosomal aberrations at particular genome locations may indicate the presence of tumor suppressor

genes or oncogenes. After recognizing the chromosomes with high copy number changes, genes causing these

high copy number variations are identified. The association between chromosomal location and cancer

phenotype provides a more reliable and informative cancer genome characterization that can lead to useful

insights into cancer biology for further disease classification, prognosis, and personalized treatment.

INTRODUCTION

A central issue in cancer biology is the identification of specific chromosomal regions that are

involved in cancer progression and other biological processes. Unbalanced chromosomal abnormalities, that

result in gains and losses of chromosomal segments, often cause several human genetic disorders, including

cancer. Driven by an accumulation of genetic and epigenetic changes, tumors represent altered levels of gene

expression and the disruption of normal cell growth and survival. A variety of cancers exhibit gains in proto-

oncogenes and losses in tumor suppressor genes; thus, growth-limiting functions and self-repair processes of

cancerous regions are often seriously harmed. The genomic alterations, observed in tumors, reflect underlying

failures in the maintenance of genetic stability.

Copy Number Variations (CNVs)

Copy Number Variations collectively describe deletions, insertions, duplications, and other complex

variants present in the human genome. Redon et al. (2006) defined a CNV as a DNA segment of one kilobase

or larger that is present at a variable copy number in comparison to a reference genome. A CNV can be simple

in structure, such as a duplication, or it may involve complex gains or losses of homologous sequences at

multiple sites in the genome. Chromosomal copy numbers are defined to be 2 for normal cells, 1 or 0 for

single and double deletions, and 3 or higher for single copy gains or higher level amplifications. Figure 1

shows the various forms of chromosome changes. Cancer progression is usually a result of copy number

variations, which may represent the over-expression of proto-oncogenes or down-regulation of tumor

suppressor genes in cancer genomes. Structural variations, such as CNVs, influence the expression of different

phenotypic traits and are found to impact various diseases and affect the development of tumors. DNA copy-

number variations are used in cancer research, by searching for novel genes involved in cancers through the

analysis of genes located in specific regions. Thereby, it is of considerable importance to identify as precisely

as possible the chromosomal regions with abnormal copy numbers.

Page 2: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

66

Array CGH

Through the use of microarray based comparative genomic hybridization, the regions of genes with

altered copy numbers can be identified. This technique characterizes the relationship between target sequences

on an unknown test genome and reference genome. Array CGH has been developed to identify CNV

expression within cancerous regions. As an indispensable tool to understand disease mechanisms, aCGH

detects and maps changes in the copy number of DNA sequences and can be used to analyze tumor genomes

and chromosomal aberrations. The log-ratio values, obtained from the aCGH data, are used as the emissions in

the Hidden Markov Model, in order to find the hidden states representing the copy numbers of the

chromosomes.

This technique uses a test DNA sample, such as tumor genomic DNA, and a reference DNA sample,

such as normal genomic DNA, that are both labeled with different fluorescent dyes. The DNA samples are

then combined with unlabeled Cot-1 DNA, a reagent used to block repetitive DNA sequences and prevent

nonspecific hybridization. The two samples are hybridized together onto a microarray, and a microarray

scanner is used to measure the fluorescent signals and capture digital images. The fluorescence intensity

signals from labeled DNA, hybridized on target probes, are processed and normalized. The difference between

the intensity signals of each probe from the test and reference genomes is expressed as a log ratio and can be

analyzed to detect genomic alterations and aberrations.

In the ideal case, the log ratio is equal to 0, demonstrating that no copy change has occurred in that

region of the genome; however, a higher or lower log ratio implies a change in copy number. The calculation

of the log ratios determines the copy number variation. The log ratio always changes due to the test intensity

while the reference intensity stays constant at 2, representing the homozygous phenotype in the normal

sample. When the tumor sample has no copy of the particular region identified on the chromosome, a value

log2(0/2) equal to infinity is seen indicating that region of the chromosome has experienced a homozygous

deletion. The log2(1/2) value is observed when the copy number is equal to 1; since log2(1/2) is equal to -1, a

heterozygous deletion has occurred. When the tumor intensity is equal to 3, the log2 ratio of (3/2) is calculated

and results in 0.585, implying that a heterozygous duplication has taken place. Lastly, when the tumor

intensity is equal to 4, the log2 ratio of (4/2) is calculated and results in 1 and implies that a homozygous

duplication has occurred.

The array CGH is further analyzed with appropriate statistical methods. A log ratio greater than 1

represents a higher number of target sequences in the test genome when compared to the reference genome;

conversely, a log ratio less than one indicates a lower number of target sequences in the test genome.

However, the complexity of eukaryotic genomes often causes the total signal of a microarray hybridization to

be diluted and makes aCGH data noisy and inappropriate in determining the accurate copy number of a region.

Thus, methods that can accurately use aCGH data must be implemented. The analysis of aCGH data can help

determine the location of DNA copy number aberrations within the tumor genome for improved cancer

diagnosis, drug development, and molecular therapy. A representation of the micro-array based comparative

genomic hybridization is shown in Figure 2.

With more array CGH data sets emerging, more efficient algorithms that detect regions of gains and

losses are necessary to provide an accurate estimate of error for the detection. The research conducted for this

study uses an algorithm to categorize the chromosomes based on the types of copy number aberrations to

accurately identify genes relevant to tumor progression.The objectives of this project are (1) analyze cancer

genomic data in order to predict the hidden number states for each chromosomal region and (2) use the hidden

number states of each region to accurately identify proto-oncogenes and tumor suppressor genes.

Page 3: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

67

General approach used in this project The approach used in this project can be divided into the following six steps which are discussed in detail

later.

1. Upload data from Data Portal

2. Normalization of Data

3. Segmentation of data

4. Applying Hidden Markov model

5. Results

a. Comparing the Efficacy of the Hidden Markov Models for True Positive Rate (TPR) and False

Discovery Rate (FDR)

b. Detection of Genes through Analysis of gains and Losses

Previous Approaches forArray CGH Data Analysis

With more array CGH data sets emerging, more efficient algorithms that detect regions of gains or

losses and provide an accurate estimate of error for the detection are necessary. Previously, researchers have

devised means for analyzing the array CGH data sets. Wang et al. (2004) used the method of Clustering Along

Chromosomes to detect the signal regions by depicting the spatial structure within genomic alterations. Olshen

et al. (2004) utilized circular binary segmentation to segment a chromosome into connecting regions and

illustrate a parametric model of the data with its use of a permutation reference distribution. However, these

methods do not take into account the various biological covariates, including the distance between clones, that

impact segmentation of the array CGH data. The research conducted for this study uses an algorithm to

categorize the chromosomes based on the types of copy number aberrations to accurately identify genes

relevant to tumor progression.

The Cancer Genome Atlas (TCGA)

The array CGH data used for this project was obtained from the TCGA data portal. This platform

allows access to data sets, and it provides various types of data, including clinical information, genomic

characterization data, and high level sequence analysis of the tumor genomes.

METHODS

Data

The data used for this project was obtained from the TCGA platform. GBM Level 1 Array CGH data,

from the Agilent Human Genome CGH Microarray 244A platform processed at the Harvard Medical School

Center, was downloaded from the TCGA data portal. Level 1 data represents raw signals per probe for each

participant’s tumor sample. All data sets were processed using the R packages Bioconductor (Gentleman,

2004), limma (Smyth and Speed, 2003), and snapCGH (Smith, 2009).

Data Normalization during Pre-Processing

Raw array CGH data often has many experimental and biological factors that make it difficult to

identify the true copy number for a genomic clone. Biological factors include the purity and ploidy of a

sample. In order to correct this issue, background correction and normalization techniques were performed on

each array. With normalization, the ploidy of the reference sample no longer played a role. The arrays were

normalized using the normalizeWithinArrays() function within the limma package. This function normalized

the expression log ratios for two-color spotted microarray experiments, so that the log ratios averaged to zero

within each array. The backgroundCorrect() function, also within the limma package, was used to correct the

background of the microarray expression intensities by subtracting the average signal intensity of the area

between spots.

Page 4: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

68

Segmentation of Data

Each array CGH was processed using the processCGH() function from the snapCGH package. This

function used the normalized MAList, that contained the log expression ratios and was created by the

normalization and background correction. It, then, ordered and filtered the clones based on the mapping

information of the log ratios. Thus, the datasets were segmented. Using segmentation models, specific

segments were identified and the segment variance of log ratio values was minimized.

Hidden Markov Models (HMMs)

Hidden Markov Models are a formal foundation for making probabilistic models of sequence labeling

problems (Eddy, 2004). An HMM indicates a finite set of states, with each set containing emission probability

distributions and specific transition probabilities between states. At each state, a residue is produced from the

state’s emission probability distribution. Then, the next state is chosen based on the state’s transition

probability distribution. The model thus generates two sets of information: the underlying state path, which is

created while transitioning from state to state and is hidden, and the observed sequence, which is the residue

emitted from each state in the state path. Because HMMs can effectively uncover the relationship between the

underlying states and the observed emissions, they are useful in analyzing array CGH data. The log ratios

obtained from the array CGH data are the emissions, and the underlying states are the copy number values of

each region on the chromosome and correspond to the emissions, based on specific probabilities.

Two types of HMMs exist to identify the underlying states of the array CGH data, representing the

copy number aberrations: the homogenous model and the heterogeneous model, which both have their own

distinct advantages. The former option, the homogenous HMM, estimates the number of hidden states via

model selection and performs an analysis for each chromosome. It regards the underlying states as segments

of a common mean that represent the copy number values of each region. The homogenous HMM assumes

that the transition probability matrix is the same at the each state and thus does not consider the distance

between clones. To fit the unsupervised homogenous HMM for each dataset, methods in the Bioconductor

package snapCGH were used; the function runHomHMM() was used to discover the hidden copy numbers for

each chromosome from the patient datasets for GBM. On the other hand, the heterogeneous HMM utilizes

transition probabilities that are dependent on the distance between clones; furthermore, the probability of

remaining in the same hidden state is a decreasing function of the distance between one probe and the probe

before it. When the distance between two clones is maximized, the state of a probe is not affected by the state

of the previous clone. The function, runBioHMM() was used for the heterogeneous HMM. This project uses

both the homogenous and heterogeneous HMMs to identify which one assesses the copy number variations

more accurately using simulated data and the corresponding True Positive Rates and False Discovery Rates.

RESULTS

First, the efficacy of the homogeneous HMM was compared to that of the heterogeneous HMM. A

three-step algorithm was then used to identify the altered chromosomal regions in the cancer data. The three

steps of the algorithm consist of the data pre-processing and segmentation of data, the identification of the

hidden copy number states of the cancer data using the HMMs, and the quantification of the specific gains or

losses to detect the genes in the regions of interest. The three-step algorithm is applied on array CGH GBM

datasets for five different patients.

True Positive Rate and False Discovery Rate

The efficacy of the homogeneous and heterogeneous HMMs is evaluated against the known truth of

the simulated data by looking at the true positive and false discovery rates for breakpoint detection, as seen in

Figure 3. The data was simulated using the simulateData() method in the snapCGH package. This function

simulates aCGH data, and this function was used to create 10 arrays to account for variation in copy number

data. The compareSegmentations() method was used to create a matrix, consisting of the true positive rates

Page 5: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

69

and the false discovery rates for each HMM; this function evaluates the performance of the segmentation

method to the known truth of the simulated data. The boxplot() function was used to generate a plot of the

rates to effectively compare the two HMMs. The true positive rates and the false discovery rates of both the

homogenous and heterogeneous HMMs demonstrated that the heterogenous HMM was more successful in

identifying the copy number values accurately.

Normalization, Background Correction, and Segmentation

The first step, the data pre-processing, helped eliminate any background errors within the data using

normalization and background correction methods. These methods allowed for the next steps to become less

likely to experience error. Segmentation was carried out using the snapCGH package in R which first splits

each dataset into various segments based on the variation of copy number. Then the unsupervised HMM was

used to find the copy number states of each chromosome. After the segmentation, the smoothed log ratios for

each patient’s data were plotted, as shown in Figure 4. Each figure represents the dataset from a different

GBM patient and demonstrates the log ratios of each patient plotted against the kilobase. The different colors

represent the twenty-four total chromosomes in the human genome. These log ratios were used as the

emissions in the HMM, necessary for determining the copy number states of each patient.

Use of the Hidden Markov Models

Both the homogeneous HMM and heterogeneous HMM were used to identify the copy number states

of each chromosome for every patient; however, only results from the heterogeneous HMM are shown

because of its higher efficacy rates. Figure 5 displays the plots of the hidden states of each patient that were

found for each chromosome. The plot for Patient 1 shows up-regulation of genetic data in somatic

chromosomes 5 and 14 and sex chromosome Y, which is shown as chromosome 24; down-regulation of

genetic data is observed in chromosomes 4 and 21. The plot of the states for Patient 2 demonstrates up-

regulation in chromosomes 2, 4, 5, 7, 8, 9, 12, 14, 20, 21, and 23; down-regulation is seen in somatic

chromosomes 1, 16, 18, and sex Chromosome Y. The states of Patient 3 show few copy number changes:

somatic chromosomes 10 and 15 and sex chromosome Y have an increased copy number, and chromosome 1

has a decrease in copy number. On the other hand, the states of Patient 4 show greater copy number variance.

Somatic chromosomes 3, 12, 14, 15, 16, 17, and 22 and sex chromosomes X and Y all demonstrate a greater

copy number, and this patient has no losses in genetic data. Lastly, Patient 5 also has several copy number

gains in somatic chromosomes 2, 6, 7, 9, 12, 13, 14, 15, 20, and 22 and sex chromosome Y.

While it is important to consider the fact that each individual’s genome consists of several mutations

and some copy number variations, whole chromosomal aberrations are quite often indications of disease. The

identification of the copy number states of each chromosome in the genomes of cancer patients is useful for

identifying common chromosomes that may impact the progression of the Glioblastoma Multiforme tumor. If

observed in several tumors, genes can be identified as oncogenes or tumor suppressor genes through the

analysis of the specific chromosomal position. In addition, the individual variance in copy number of each

chromosome for each patient allows for personalized treatment.

Identification of Specific Gains or Losses

The third and final step was conducted by comparing the log ratio plots of the five patient samples, as

seen in Figure 4, and identifying the common regions with similar gains or losses and mapping those regions

to specific genes. While some datasets displayed a more drastic change in the log ratios as compared to the

other datasets, a majority of the datasets exhibited an elevated copy number at chromosomes 12 and Y and a

decreased copy number at chromosome 1. The chromosome numbers are identified through the heterogeneous

HMM analysis on single chromosomes. The variance in copy number among the different patients can be

attributed to the diversity of genomic data from individual to individual. While each patient’s genome may

represent common gains and losses, there are several external conditions that influence the expression of

Page 6: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

70

regions of the genome, including the patient’s age and medical history. The gains or losses of certain

chromosomal regions were identified using the plots of the copy numbers that rely on the log ratios.

DISCUSSION

This project aims to design an algorithm that can identify the copy number states for each

chromosome. Remarkably, the method yields interesting data for analysis. This project applies the methods on

Glioblastoma multiforme array CGH data to figure out the copy number states for each chromosome. It also

efficiently matches the corresponding copy number gain or loss to a certain region of interest, that may be

involved in the progression of the tumor. The results from this project can be used for improved and

personalized treatment by identifying genes that are up-regulated or under-expressed. Each data set obtained

from a different patient, while being affected by the same disease, has some differing log ratios and copy

numbers. The variance in copy number among patients is due to the factors, including environmental and

hereditary information, that impact the log ratios and, thus, the copy number variations. For further research,

patient medical history, age, and other medical factors can be included in the study in order to more accurately

study chromosomal aberrations that are involved in GBM.

Some similar regions of interest were identified amongst the GBM patients. Most of the datasets

contain duplications at Chromosome 12. Using the GeneName data, the original names of the genes,

attributing to the elevated copy number were found. Chromosome 12 contains the genes, PDE3A and

ST8SIA1. PDE3A, or Phosphodiesterase 3A, plays a critical role in many cellular processes by regulating the

amplitude and duration of the intracellular cyclic nucleotide signals. ST8SIA1, or ST8 Alpha-N-Acetyl-

Neuraminide Alpha-2,8-Sialyltransferase 1, is important for cell adhesion and growth of malignant cells. The

dysregulation of these genes may attribute to the progression of cancer as these genes are important in

maintaining cell processes and seem to affect the growth of malignant cells. In chromosome 1, genes

AMY2A and KIFAP2 were under-expressed; this decrease in expression may have caused the cells to stop

functioning normally and thus encouraged tumor growth. Additionally, heterozygous and homozygous

duplications were seen near Chromosomes 19, which contains genes MLL4 and PSENEN genes. MLL4, or

Myeloid lymphoid or mixed-lineage leukemia 4, is most commonly seen in luekemia; however, it is often

amplified in tumor cell lines and may be involved in the formation of the GBM tumor. Also, some patients

had an increased copy number at chromosome 7, which represents the amplification in the Epidermal Growth

Factor Receptor (EGFR) gene that causes cells to grow and divide. EGFR is a highly prominent oncogene

present in various types of cancer, including GBM and Lung Cancers. In addition to the genes identified

across all samples, genes specific to certain patients can be used for more personalized treatment.

CONCLUSION

This project has successfully utilized array CGH data to discover various genes that may impact the

formation and progression of the GBM tumor in patients. The copy number phenotype discovered for each

cancer patient is associated with a known biological marker that may be associated with the progression of the

cancer, either by its overexpression or underexpression. If the gene is over-expressed, it is most likely an

oncogene that causes cells to grow and divide, as observed in cancers. When the gene is under-expressed, the

gene may be a cause of the tumor development because it is probably an important cell cycle gene, that

suppresses the formation of tumors in cells. The resulting copy number phenotype, determined with the HMM

used in this project, is associated with biological markers that may be previously unassociated with the cancer

phenotype. This association will help provide the most reliable and informative genome characterization of

cancer and the development of more specialized disease classification, prognosis, and personalized treatment

for the cancer patient.

Since this algorithm has been used on Level 1 data, this project has successfully demonstrated the

analysis of the raw data by normalization, segmentation, and implementation of the HMMs to identify cancer

biomarkers for the development of a better and more personalized form of treatment for patients affected with

Page 7: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

71

GBM. For further research, the algorithm used in this project can be used on more GBM datasets to more

successfully find the biological markers that may cause the formation of the brain tumor within the cancer

patients. Additionally, the algorithm used in this project can be utilized on other cancer types for a similar

analysis of cancer biomarkers. While incorporating this algorithm, other medical factors can be taken into

account to eliminate any interference in the study of the copy number variations. Further research can be

conducted that will standardize the data to incorporate factors, including the age and previous medical

conditions of the patient.

ACKNOWLEDGEMENTS

I am grateful to Professor Susan Holmes from the Statistics Department at Stanford University for her

valuable time, help, and guidance provided while I was conducting this project and taking the BioStatistics

course; Professor Trevor Martin for his help during the BioStatistics course; and Julia Fukuyama for her

advice on how to approach certain issues while using R. Also, my Physics-Honors teacher, Mrs. Segal,

provided me with valuable advice while conducting my project. In addition, I am very grateful to Dr. Sean

Davis, Staff Scientist at the Center for Cancer Research at the National Cancer Institute, for his valuable time

and feedback provided while conducting this project. Also, I am thankful to my parents for their continuous

support.

ANNOTATED BIBLIOGRAPHY Eddy, Sean R. “What Is a Hidden Markov Model.”Nature.com. Nature Publishing Group, 2004. Web. 5 Oct.

2014.

This research article discusses the definition of a Hidden Markov Model. The author defines a Hidden

Markov Model as a formal foundation for making probabilistic models of sequences by considering transition

probabilities. His definition really encompasses the significance of this project, which uses Hidden Markov

Models to find the underlying states from the given emissions. Additionally, the author uses examples based

on the genetic sequences. Through this example, he notes that the sequence, in terms of A, C, T, and G,

represents the overlying emissions, and the underlying state path is hidden and must be discovered through the

use of the Hidden Markov Models, that contains transition probabilities.

The author of this research article presents his research in a highly credible fashion since he first

defines the Hidden Markov Model and then provides examples supporting his definition. In addition, he

makes use of several sources from credible authors; for example, he cited Rabiner who conducted a tutorial on

Hidden Markov Models. Dr. Sean R. Eddy works at Howard Hughes Medical Institute and the Department of

Genetics at Washington University School of Medicine. He has authored research papers that have used

Hidden Markov Models. Thus, he is a credible source as he has the knowledge necessary for defining and

demonstrating what a Hidden Markov Model is.

Olshen, A.B., E. S. Venkatraman, Robert Lucito, and Michael Wigler.“Circular binary segmentation for the

analysis of array�based DNA copy number data.”Biostat (2004) 5 (4): 557-572,

doi:10.1093/biostatistics/kxh008.

The research paper, “Circular binary segmentation for the analysis of array�based DNA copy number

data,”discusses another approach for analyzing array CGH data. They have utilized array CGH data and

circular binary segmentation method to translate noisy intensity measurements into regions of equal copy

number. They have applied this method on test breast cancer data, as well as simulated data with known copy

number alterations to test the efficacy of their new method. They have effectively discovered another method

for analyzing array CGH data to detect regions of gains and losses based on the segments that they found with

their method.

Page 8: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

72

The authors of this research paper present the research in a highly efficient and credible way as they

have demonstrated a new development while applying it on simulated data and test data. Their method is one

approach for analyzing array CGH data to obtain the over-expressed and down-regulated regions. Dr.

Venkatraman is from the Department of Epidemiology and Biostatistics at the Memorial Sloan-Kettering

Cancer Center; his position gives him the credibility for conducting this research paper. The other two authors,

Robert Lucito and Michael Wigler also have significant experience in the cancer field as they conduct cancer

research at the Cold Spring Harbor Laboratory in New York.

Wang, P., Y. Kim, J. Pollack, B. Narasimhan, and R. Tibshirani.“A Method for Calling Gains and Losses in

Array CGH Data.”Biostatistics 6.1 (2004): 45-58. Web.

This research paper focuses on the development of a new method for detecting gains and losses in

Array CGH data. The authors utilize clustering to identify crucial regions. They have developed a new

algorithm, Clustering along Chromosomes (CLAC) to detect specific regions. The CLAC builds hierarchical

clustering-style trees along each chromosome arm or chromosome and then selects the interesting clusters by

controlling the False Discovery Rates. They have applied the data on a lung cancer microarray CGH data set.

Their clustering algorithm is iterative as it continues until a big cluster is formed, and it is based on the

identification of specific clusters with one gene in each cluster, and then the two adjacent clusters are merged.

The authors of this research paper all work in different departments at Stanford University and thus

represent an interdisciplinary approach to this paper. The main author, Dr. Wang, works in the Statistics

Department and thus is extremely knowledgeable in this field. Their research provides a valuable insight into

another way of analyzing array CGH data, and underscores the necessity of analyzing array CGH data to find

the regions that have demonstrated gains or losses for better disease treatment in the future.

WORKS CITED

Albertson, D.G. and Daniel Pinkel, “Genomic microarrays in Human Genetic Disease and cancer.”Hum. Mol.

Genet. (2003) 12 (suppl 2): R145-R152, August 5, 2003, doi:10.1093/hmg/ddg261

Eddy, Sean R. “What Is a Hidden Markov Model.” Nature.com. Nature Publishing Group, 2004. Web. 5 Oct.

2014.

Gentleman, R.C., Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dett- ling, Sandrine Dudoit,

Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, TorstenHothorn, Wolfgang

Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch Cheng Li, Martin Maechler, Anthony J.

Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean Y. H. Yang, and Jianhua

Zhang. Bioconductor: Open software development for computational biology and bioinformatics.

Genome Biology, 5:R80, 2004.

Marioni, J.C., N.P. Thorne, S. Tavare, F. Radyanyi. BioHMM: A heterogeneous Hidden Markov Model for

Segmenting array CGH data. Bioinformatics.2006; 22:1144-1146.

Olshen, A.B., E. S. Venkatraman, Robert Lucito, and Michael Wigler.“Circular binary segmentation for the

analysis of arraybased DNA copy number data.”Biostat (2004) 5 (4): 557-572,

doi:10.1093/biostatistics/kxh008.

Rabiner, L.R., “A Tutorial on Hidden Markov Model and Selected Applications in Speech

Recognition.”Proceedings of the IEEE, Volume 77, February 1989, 257-285.

Smith, M.L., John C. Marioni, Steven McKinney, Thomas Hardcastle and Natalie P. Thorne

(2009).“snapCGH: Segmentation, normalisation and processing of aCGH data.”R package version

1.34.0.

Redon, Richard, Shumpei Ishikawa, Karen R. Fitch, Lars Feuk, George H. Perry, T. Daniel Andrews, Heike

Fiegler, Michael H. Shapero, Andrew R. Carson, Wenwei Chen, Eun Kyung Cho, Stephanie Dallaire,

Jennifer L. Freeman, Juan R. González, MònicaGratacòs, Jing Huang, DimitriosKalaitzopoulos,

Daisuke Komura, Jeffrey R. Macdonald, Christian R. Marshall, Rui Mei, Lyndal Montgomery,

Kunihiro Nishimura, Kohji Okamura, Fan Shen, Martin J. Somerville, Joelle Tchinda, Armand

Page 9: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

73

Valsesia, Cara Woodwark, Fengtang Yang, Junjun Zhang, Tatiana Zerjal, Jane Zhang, LluisArmengol,

Donald F. Conrad, Xavier Estivill, Chris Tyler-Smith, Nigel P. Carter, Hiroyuki Aburatani, Charles

Lee, Keith W. Jones, Stephen W. Scherer, and Matthew E. Hurles. "Global Variation in Copy Number

in the Human Genome."Nature 444.7118 (2006): 444-54. Web.

Smyth, G.K. Limma: linear models for microarray data. In: ‘Bioinformatics and Computational Biology

Solutions using R and Bioconductor’. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds),

Springer, New York, pages 397-420, 2005. Web.

Wang, P., Y. Kim, J. Pollack, B. Narasimhan, and R. Tibshirani.“A Method for Calling Gains and Losses in

Array CGH Data.”Biostatistics 6.1 (2004): 45-58. Web.

Zhang, N. “DNA Copy Number Profiling in Normal and Tumor Genomes.”Frontiers in Computational and

Systems Biology.Vol. 15. London: Springer, 2010. 259-81. Web.

FIGURES

Figure 1: Forms of chromosome changes.

Page 10: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

74

Figure 2. Schematic Representation of Array CGH

Page 11: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

75

Figure 3: Boxplots comparing the efficacy of the Hidden Markov Models

Page 12: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

76

Figure 4: Log Ratios for five patients, that were used as the emissions in the Hidden

Markov Models.

Page 13: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform

International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015

77

Figure 5: The states are identified with the Heterogeneous Hidden Markov Model for the five

patients, and they range from 0 to 5 for the chromosomes, depending on the patient.