systematic analysis for identification of genes impacting ...the cancer genome atlas (tcga) the...
TRANSCRIPT
![Page 1: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/1.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
65
Systematic Analysis for Identification of Genes Impacting Cancers
Arpita Singhal
Stanford University
Saint Francis High School
ABSTRACT Currently, vast amounts of molecular information involving genomic characterizations exist for
various types of cancers. However, the integration of the various forms of biological data, necessary for a
better understanding of the key processes underlying cancer, remains challenging. This project uses micro-
array based comparative genomic hybridization (aCGH) data to study genomic alterations on various tumor
samples, with the statistical procedures in R. To find the hidden copy number states of each chromosome to
characterize genomic alterations, this project utilizes Hidden Markov Models on datasets from cancer
patients. The efficacy of the homogeneous and heterogeneous Hidden Markov Models is evaluated against the
known truth of simulated data by looking at the true positive rates and false discovery rates for breakpoint
detection. This project mainly determines the number and types of copy number variations present in the
chromosomes of the tumor datasets, obtained from the The Cancer Genome Atlas portal. Recurrent
chromosomal aberrations at particular genome locations may indicate the presence of tumor suppressor
genes or oncogenes. After recognizing the chromosomes with high copy number changes, genes causing these
high copy number variations are identified. The association between chromosomal location and cancer
phenotype provides a more reliable and informative cancer genome characterization that can lead to useful
insights into cancer biology for further disease classification, prognosis, and personalized treatment.
INTRODUCTION
A central issue in cancer biology is the identification of specific chromosomal regions that are
involved in cancer progression and other biological processes. Unbalanced chromosomal abnormalities, that
result in gains and losses of chromosomal segments, often cause several human genetic disorders, including
cancer. Driven by an accumulation of genetic and epigenetic changes, tumors represent altered levels of gene
expression and the disruption of normal cell growth and survival. A variety of cancers exhibit gains in proto-
oncogenes and losses in tumor suppressor genes; thus, growth-limiting functions and self-repair processes of
cancerous regions are often seriously harmed. The genomic alterations, observed in tumors, reflect underlying
failures in the maintenance of genetic stability.
Copy Number Variations (CNVs)
Copy Number Variations collectively describe deletions, insertions, duplications, and other complex
variants present in the human genome. Redon et al. (2006) defined a CNV as a DNA segment of one kilobase
or larger that is present at a variable copy number in comparison to a reference genome. A CNV can be simple
in structure, such as a duplication, or it may involve complex gains or losses of homologous sequences at
multiple sites in the genome. Chromosomal copy numbers are defined to be 2 for normal cells, 1 or 0 for
single and double deletions, and 3 or higher for single copy gains or higher level amplifications. Figure 1
shows the various forms of chromosome changes. Cancer progression is usually a result of copy number
variations, which may represent the over-expression of proto-oncogenes or down-regulation of tumor
suppressor genes in cancer genomes. Structural variations, such as CNVs, influence the expression of different
phenotypic traits and are found to impact various diseases and affect the development of tumors. DNA copy-
number variations are used in cancer research, by searching for novel genes involved in cancers through the
analysis of genes located in specific regions. Thereby, it is of considerable importance to identify as precisely
as possible the chromosomal regions with abnormal copy numbers.
![Page 2: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/2.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
66
Array CGH
Through the use of microarray based comparative genomic hybridization, the regions of genes with
altered copy numbers can be identified. This technique characterizes the relationship between target sequences
on an unknown test genome and reference genome. Array CGH has been developed to identify CNV
expression within cancerous regions. As an indispensable tool to understand disease mechanisms, aCGH
detects and maps changes in the copy number of DNA sequences and can be used to analyze tumor genomes
and chromosomal aberrations. The log-ratio values, obtained from the aCGH data, are used as the emissions in
the Hidden Markov Model, in order to find the hidden states representing the copy numbers of the
chromosomes.
This technique uses a test DNA sample, such as tumor genomic DNA, and a reference DNA sample,
such as normal genomic DNA, that are both labeled with different fluorescent dyes. The DNA samples are
then combined with unlabeled Cot-1 DNA, a reagent used to block repetitive DNA sequences and prevent
nonspecific hybridization. The two samples are hybridized together onto a microarray, and a microarray
scanner is used to measure the fluorescent signals and capture digital images. The fluorescence intensity
signals from labeled DNA, hybridized on target probes, are processed and normalized. The difference between
the intensity signals of each probe from the test and reference genomes is expressed as a log ratio and can be
analyzed to detect genomic alterations and aberrations.
In the ideal case, the log ratio is equal to 0, demonstrating that no copy change has occurred in that
region of the genome; however, a higher or lower log ratio implies a change in copy number. The calculation
of the log ratios determines the copy number variation. The log ratio always changes due to the test intensity
while the reference intensity stays constant at 2, representing the homozygous phenotype in the normal
sample. When the tumor sample has no copy of the particular region identified on the chromosome, a value
log2(0/2) equal to infinity is seen indicating that region of the chromosome has experienced a homozygous
deletion. The log2(1/2) value is observed when the copy number is equal to 1; since log2(1/2) is equal to -1, a
heterozygous deletion has occurred. When the tumor intensity is equal to 3, the log2 ratio of (3/2) is calculated
and results in 0.585, implying that a heterozygous duplication has taken place. Lastly, when the tumor
intensity is equal to 4, the log2 ratio of (4/2) is calculated and results in 1 and implies that a homozygous
duplication has occurred.
The array CGH is further analyzed with appropriate statistical methods. A log ratio greater than 1
represents a higher number of target sequences in the test genome when compared to the reference genome;
conversely, a log ratio less than one indicates a lower number of target sequences in the test genome.
However, the complexity of eukaryotic genomes often causes the total signal of a microarray hybridization to
be diluted and makes aCGH data noisy and inappropriate in determining the accurate copy number of a region.
Thus, methods that can accurately use aCGH data must be implemented. The analysis of aCGH data can help
determine the location of DNA copy number aberrations within the tumor genome for improved cancer
diagnosis, drug development, and molecular therapy. A representation of the micro-array based comparative
genomic hybridization is shown in Figure 2.
With more array CGH data sets emerging, more efficient algorithms that detect regions of gains and
losses are necessary to provide an accurate estimate of error for the detection. The research conducted for this
study uses an algorithm to categorize the chromosomes based on the types of copy number aberrations to
accurately identify genes relevant to tumor progression.The objectives of this project are (1) analyze cancer
genomic data in order to predict the hidden number states for each chromosomal region and (2) use the hidden
number states of each region to accurately identify proto-oncogenes and tumor suppressor genes.
![Page 3: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/3.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
67
General approach used in this project The approach used in this project can be divided into the following six steps which are discussed in detail
later.
1. Upload data from Data Portal
2. Normalization of Data
3. Segmentation of data
4. Applying Hidden Markov model
5. Results
a. Comparing the Efficacy of the Hidden Markov Models for True Positive Rate (TPR) and False
Discovery Rate (FDR)
b. Detection of Genes through Analysis of gains and Losses
Previous Approaches forArray CGH Data Analysis
With more array CGH data sets emerging, more efficient algorithms that detect regions of gains or
losses and provide an accurate estimate of error for the detection are necessary. Previously, researchers have
devised means for analyzing the array CGH data sets. Wang et al. (2004) used the method of Clustering Along
Chromosomes to detect the signal regions by depicting the spatial structure within genomic alterations. Olshen
et al. (2004) utilized circular binary segmentation to segment a chromosome into connecting regions and
illustrate a parametric model of the data with its use of a permutation reference distribution. However, these
methods do not take into account the various biological covariates, including the distance between clones, that
impact segmentation of the array CGH data. The research conducted for this study uses an algorithm to
categorize the chromosomes based on the types of copy number aberrations to accurately identify genes
relevant to tumor progression.
The Cancer Genome Atlas (TCGA)
The array CGH data used for this project was obtained from the TCGA data portal. This platform
allows access to data sets, and it provides various types of data, including clinical information, genomic
characterization data, and high level sequence analysis of the tumor genomes.
METHODS
Data
The data used for this project was obtained from the TCGA platform. GBM Level 1 Array CGH data,
from the Agilent Human Genome CGH Microarray 244A platform processed at the Harvard Medical School
Center, was downloaded from the TCGA data portal. Level 1 data represents raw signals per probe for each
participant’s tumor sample. All data sets were processed using the R packages Bioconductor (Gentleman,
2004), limma (Smyth and Speed, 2003), and snapCGH (Smith, 2009).
Data Normalization during Pre-Processing
Raw array CGH data often has many experimental and biological factors that make it difficult to
identify the true copy number for a genomic clone. Biological factors include the purity and ploidy of a
sample. In order to correct this issue, background correction and normalization techniques were performed on
each array. With normalization, the ploidy of the reference sample no longer played a role. The arrays were
normalized using the normalizeWithinArrays() function within the limma package. This function normalized
the expression log ratios for two-color spotted microarray experiments, so that the log ratios averaged to zero
within each array. The backgroundCorrect() function, also within the limma package, was used to correct the
background of the microarray expression intensities by subtracting the average signal intensity of the area
between spots.
![Page 4: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/4.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
68
Segmentation of Data
Each array CGH was processed using the processCGH() function from the snapCGH package. This
function used the normalized MAList, that contained the log expression ratios and was created by the
normalization and background correction. It, then, ordered and filtered the clones based on the mapping
information of the log ratios. Thus, the datasets were segmented. Using segmentation models, specific
segments were identified and the segment variance of log ratio values was minimized.
Hidden Markov Models (HMMs)
Hidden Markov Models are a formal foundation for making probabilistic models of sequence labeling
problems (Eddy, 2004). An HMM indicates a finite set of states, with each set containing emission probability
distributions and specific transition probabilities between states. At each state, a residue is produced from the
state’s emission probability distribution. Then, the next state is chosen based on the state’s transition
probability distribution. The model thus generates two sets of information: the underlying state path, which is
created while transitioning from state to state and is hidden, and the observed sequence, which is the residue
emitted from each state in the state path. Because HMMs can effectively uncover the relationship between the
underlying states and the observed emissions, they are useful in analyzing array CGH data. The log ratios
obtained from the array CGH data are the emissions, and the underlying states are the copy number values of
each region on the chromosome and correspond to the emissions, based on specific probabilities.
Two types of HMMs exist to identify the underlying states of the array CGH data, representing the
copy number aberrations: the homogenous model and the heterogeneous model, which both have their own
distinct advantages. The former option, the homogenous HMM, estimates the number of hidden states via
model selection and performs an analysis for each chromosome. It regards the underlying states as segments
of a common mean that represent the copy number values of each region. The homogenous HMM assumes
that the transition probability matrix is the same at the each state and thus does not consider the distance
between clones. To fit the unsupervised homogenous HMM for each dataset, methods in the Bioconductor
package snapCGH were used; the function runHomHMM() was used to discover the hidden copy numbers for
each chromosome from the patient datasets for GBM. On the other hand, the heterogeneous HMM utilizes
transition probabilities that are dependent on the distance between clones; furthermore, the probability of
remaining in the same hidden state is a decreasing function of the distance between one probe and the probe
before it. When the distance between two clones is maximized, the state of a probe is not affected by the state
of the previous clone. The function, runBioHMM() was used for the heterogeneous HMM. This project uses
both the homogenous and heterogeneous HMMs to identify which one assesses the copy number variations
more accurately using simulated data and the corresponding True Positive Rates and False Discovery Rates.
RESULTS
First, the efficacy of the homogeneous HMM was compared to that of the heterogeneous HMM. A
three-step algorithm was then used to identify the altered chromosomal regions in the cancer data. The three
steps of the algorithm consist of the data pre-processing and segmentation of data, the identification of the
hidden copy number states of the cancer data using the HMMs, and the quantification of the specific gains or
losses to detect the genes in the regions of interest. The three-step algorithm is applied on array CGH GBM
datasets for five different patients.
True Positive Rate and False Discovery Rate
The efficacy of the homogeneous and heterogeneous HMMs is evaluated against the known truth of
the simulated data by looking at the true positive and false discovery rates for breakpoint detection, as seen in
Figure 3. The data was simulated using the simulateData() method in the snapCGH package. This function
simulates aCGH data, and this function was used to create 10 arrays to account for variation in copy number
data. The compareSegmentations() method was used to create a matrix, consisting of the true positive rates
![Page 5: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/5.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
69
and the false discovery rates for each HMM; this function evaluates the performance of the segmentation
method to the known truth of the simulated data. The boxplot() function was used to generate a plot of the
rates to effectively compare the two HMMs. The true positive rates and the false discovery rates of both the
homogenous and heterogeneous HMMs demonstrated that the heterogenous HMM was more successful in
identifying the copy number values accurately.
Normalization, Background Correction, and Segmentation
The first step, the data pre-processing, helped eliminate any background errors within the data using
normalization and background correction methods. These methods allowed for the next steps to become less
likely to experience error. Segmentation was carried out using the snapCGH package in R which first splits
each dataset into various segments based on the variation of copy number. Then the unsupervised HMM was
used to find the copy number states of each chromosome. After the segmentation, the smoothed log ratios for
each patient’s data were plotted, as shown in Figure 4. Each figure represents the dataset from a different
GBM patient and demonstrates the log ratios of each patient plotted against the kilobase. The different colors
represent the twenty-four total chromosomes in the human genome. These log ratios were used as the
emissions in the HMM, necessary for determining the copy number states of each patient.
Use of the Hidden Markov Models
Both the homogeneous HMM and heterogeneous HMM were used to identify the copy number states
of each chromosome for every patient; however, only results from the heterogeneous HMM are shown
because of its higher efficacy rates. Figure 5 displays the plots of the hidden states of each patient that were
found for each chromosome. The plot for Patient 1 shows up-regulation of genetic data in somatic
chromosomes 5 and 14 and sex chromosome Y, which is shown as chromosome 24; down-regulation of
genetic data is observed in chromosomes 4 and 21. The plot of the states for Patient 2 demonstrates up-
regulation in chromosomes 2, 4, 5, 7, 8, 9, 12, 14, 20, 21, and 23; down-regulation is seen in somatic
chromosomes 1, 16, 18, and sex Chromosome Y. The states of Patient 3 show few copy number changes:
somatic chromosomes 10 and 15 and sex chromosome Y have an increased copy number, and chromosome 1
has a decrease in copy number. On the other hand, the states of Patient 4 show greater copy number variance.
Somatic chromosomes 3, 12, 14, 15, 16, 17, and 22 and sex chromosomes X and Y all demonstrate a greater
copy number, and this patient has no losses in genetic data. Lastly, Patient 5 also has several copy number
gains in somatic chromosomes 2, 6, 7, 9, 12, 13, 14, 15, 20, and 22 and sex chromosome Y.
While it is important to consider the fact that each individual’s genome consists of several mutations
and some copy number variations, whole chromosomal aberrations are quite often indications of disease. The
identification of the copy number states of each chromosome in the genomes of cancer patients is useful for
identifying common chromosomes that may impact the progression of the Glioblastoma Multiforme tumor. If
observed in several tumors, genes can be identified as oncogenes or tumor suppressor genes through the
analysis of the specific chromosomal position. In addition, the individual variance in copy number of each
chromosome for each patient allows for personalized treatment.
Identification of Specific Gains or Losses
The third and final step was conducted by comparing the log ratio plots of the five patient samples, as
seen in Figure 4, and identifying the common regions with similar gains or losses and mapping those regions
to specific genes. While some datasets displayed a more drastic change in the log ratios as compared to the
other datasets, a majority of the datasets exhibited an elevated copy number at chromosomes 12 and Y and a
decreased copy number at chromosome 1. The chromosome numbers are identified through the heterogeneous
HMM analysis on single chromosomes. The variance in copy number among the different patients can be
attributed to the diversity of genomic data from individual to individual. While each patient’s genome may
represent common gains and losses, there are several external conditions that influence the expression of
![Page 6: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/6.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
70
regions of the genome, including the patient’s age and medical history. The gains or losses of certain
chromosomal regions were identified using the plots of the copy numbers that rely on the log ratios.
DISCUSSION
This project aims to design an algorithm that can identify the copy number states for each
chromosome. Remarkably, the method yields interesting data for analysis. This project applies the methods on
Glioblastoma multiforme array CGH data to figure out the copy number states for each chromosome. It also
efficiently matches the corresponding copy number gain or loss to a certain region of interest, that may be
involved in the progression of the tumor. The results from this project can be used for improved and
personalized treatment by identifying genes that are up-regulated or under-expressed. Each data set obtained
from a different patient, while being affected by the same disease, has some differing log ratios and copy
numbers. The variance in copy number among patients is due to the factors, including environmental and
hereditary information, that impact the log ratios and, thus, the copy number variations. For further research,
patient medical history, age, and other medical factors can be included in the study in order to more accurately
study chromosomal aberrations that are involved in GBM.
Some similar regions of interest were identified amongst the GBM patients. Most of the datasets
contain duplications at Chromosome 12. Using the GeneName data, the original names of the genes,
attributing to the elevated copy number were found. Chromosome 12 contains the genes, PDE3A and
ST8SIA1. PDE3A, or Phosphodiesterase 3A, plays a critical role in many cellular processes by regulating the
amplitude and duration of the intracellular cyclic nucleotide signals. ST8SIA1, or ST8 Alpha-N-Acetyl-
Neuraminide Alpha-2,8-Sialyltransferase 1, is important for cell adhesion and growth of malignant cells. The
dysregulation of these genes may attribute to the progression of cancer as these genes are important in
maintaining cell processes and seem to affect the growth of malignant cells. In chromosome 1, genes
AMY2A and KIFAP2 were under-expressed; this decrease in expression may have caused the cells to stop
functioning normally and thus encouraged tumor growth. Additionally, heterozygous and homozygous
duplications were seen near Chromosomes 19, which contains genes MLL4 and PSENEN genes. MLL4, or
Myeloid lymphoid or mixed-lineage leukemia 4, is most commonly seen in luekemia; however, it is often
amplified in tumor cell lines and may be involved in the formation of the GBM tumor. Also, some patients
had an increased copy number at chromosome 7, which represents the amplification in the Epidermal Growth
Factor Receptor (EGFR) gene that causes cells to grow and divide. EGFR is a highly prominent oncogene
present in various types of cancer, including GBM and Lung Cancers. In addition to the genes identified
across all samples, genes specific to certain patients can be used for more personalized treatment.
CONCLUSION
This project has successfully utilized array CGH data to discover various genes that may impact the
formation and progression of the GBM tumor in patients. The copy number phenotype discovered for each
cancer patient is associated with a known biological marker that may be associated with the progression of the
cancer, either by its overexpression or underexpression. If the gene is over-expressed, it is most likely an
oncogene that causes cells to grow and divide, as observed in cancers. When the gene is under-expressed, the
gene may be a cause of the tumor development because it is probably an important cell cycle gene, that
suppresses the formation of tumors in cells. The resulting copy number phenotype, determined with the HMM
used in this project, is associated with biological markers that may be previously unassociated with the cancer
phenotype. This association will help provide the most reliable and informative genome characterization of
cancer and the development of more specialized disease classification, prognosis, and personalized treatment
for the cancer patient.
Since this algorithm has been used on Level 1 data, this project has successfully demonstrated the
analysis of the raw data by normalization, segmentation, and implementation of the HMMs to identify cancer
biomarkers for the development of a better and more personalized form of treatment for patients affected with
![Page 7: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/7.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
71
GBM. For further research, the algorithm used in this project can be used on more GBM datasets to more
successfully find the biological markers that may cause the formation of the brain tumor within the cancer
patients. Additionally, the algorithm used in this project can be utilized on other cancer types for a similar
analysis of cancer biomarkers. While incorporating this algorithm, other medical factors can be taken into
account to eliminate any interference in the study of the copy number variations. Further research can be
conducted that will standardize the data to incorporate factors, including the age and previous medical
conditions of the patient.
ACKNOWLEDGEMENTS
I am grateful to Professor Susan Holmes from the Statistics Department at Stanford University for her
valuable time, help, and guidance provided while I was conducting this project and taking the BioStatistics
course; Professor Trevor Martin for his help during the BioStatistics course; and Julia Fukuyama for her
advice on how to approach certain issues while using R. Also, my Physics-Honors teacher, Mrs. Segal,
provided me with valuable advice while conducting my project. In addition, I am very grateful to Dr. Sean
Davis, Staff Scientist at the Center for Cancer Research at the National Cancer Institute, for his valuable time
and feedback provided while conducting this project. Also, I am thankful to my parents for their continuous
support.
ANNOTATED BIBLIOGRAPHY Eddy, Sean R. “What Is a Hidden Markov Model.”Nature.com. Nature Publishing Group, 2004. Web. 5 Oct.
2014.
This research article discusses the definition of a Hidden Markov Model. The author defines a Hidden
Markov Model as a formal foundation for making probabilistic models of sequences by considering transition
probabilities. His definition really encompasses the significance of this project, which uses Hidden Markov
Models to find the underlying states from the given emissions. Additionally, the author uses examples based
on the genetic sequences. Through this example, he notes that the sequence, in terms of A, C, T, and G,
represents the overlying emissions, and the underlying state path is hidden and must be discovered through the
use of the Hidden Markov Models, that contains transition probabilities.
The author of this research article presents his research in a highly credible fashion since he first
defines the Hidden Markov Model and then provides examples supporting his definition. In addition, he
makes use of several sources from credible authors; for example, he cited Rabiner who conducted a tutorial on
Hidden Markov Models. Dr. Sean R. Eddy works at Howard Hughes Medical Institute and the Department of
Genetics at Washington University School of Medicine. He has authored research papers that have used
Hidden Markov Models. Thus, he is a credible source as he has the knowledge necessary for defining and
demonstrating what a Hidden Markov Model is.
Olshen, A.B., E. S. Venkatraman, Robert Lucito, and Michael Wigler.“Circular binary segmentation for the
analysis of array�based DNA copy number data.”Biostat (2004) 5 (4): 557-572,
doi:10.1093/biostatistics/kxh008.
The research paper, “Circular binary segmentation for the analysis of array�based DNA copy number
data,”discusses another approach for analyzing array CGH data. They have utilized array CGH data and
circular binary segmentation method to translate noisy intensity measurements into regions of equal copy
number. They have applied this method on test breast cancer data, as well as simulated data with known copy
number alterations to test the efficacy of their new method. They have effectively discovered another method
for analyzing array CGH data to detect regions of gains and losses based on the segments that they found with
their method.
![Page 8: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/8.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
72
The authors of this research paper present the research in a highly efficient and credible way as they
have demonstrated a new development while applying it on simulated data and test data. Their method is one
approach for analyzing array CGH data to obtain the over-expressed and down-regulated regions. Dr.
Venkatraman is from the Department of Epidemiology and Biostatistics at the Memorial Sloan-Kettering
Cancer Center; his position gives him the credibility for conducting this research paper. The other two authors,
Robert Lucito and Michael Wigler also have significant experience in the cancer field as they conduct cancer
research at the Cold Spring Harbor Laboratory in New York.
Wang, P., Y. Kim, J. Pollack, B. Narasimhan, and R. Tibshirani.“A Method for Calling Gains and Losses in
Array CGH Data.”Biostatistics 6.1 (2004): 45-58. Web.
This research paper focuses on the development of a new method for detecting gains and losses in
Array CGH data. The authors utilize clustering to identify crucial regions. They have developed a new
algorithm, Clustering along Chromosomes (CLAC) to detect specific regions. The CLAC builds hierarchical
clustering-style trees along each chromosome arm or chromosome and then selects the interesting clusters by
controlling the False Discovery Rates. They have applied the data on a lung cancer microarray CGH data set.
Their clustering algorithm is iterative as it continues until a big cluster is formed, and it is based on the
identification of specific clusters with one gene in each cluster, and then the two adjacent clusters are merged.
The authors of this research paper all work in different departments at Stanford University and thus
represent an interdisciplinary approach to this paper. The main author, Dr. Wang, works in the Statistics
Department and thus is extremely knowledgeable in this field. Their research provides a valuable insight into
another way of analyzing array CGH data, and underscores the necessity of analyzing array CGH data to find
the regions that have demonstrated gains or losses for better disease treatment in the future.
WORKS CITED
Albertson, D.G. and Daniel Pinkel, “Genomic microarrays in Human Genetic Disease and cancer.”Hum. Mol.
Genet. (2003) 12 (suppl 2): R145-R152, August 5, 2003, doi:10.1093/hmg/ddg261
Eddy, Sean R. “What Is a Hidden Markov Model.” Nature.com. Nature Publishing Group, 2004. Web. 5 Oct.
2014.
Gentleman, R.C., Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dett- ling, Sandrine Dudoit,
Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, TorstenHothorn, Wolfgang
Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch Cheng Li, Martin Maechler, Anthony J.
Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean Y. H. Yang, and Jianhua
Zhang. Bioconductor: Open software development for computational biology and bioinformatics.
Genome Biology, 5:R80, 2004.
Marioni, J.C., N.P. Thorne, S. Tavare, F. Radyanyi. BioHMM: A heterogeneous Hidden Markov Model for
Segmenting array CGH data. Bioinformatics.2006; 22:1144-1146.
Olshen, A.B., E. S. Venkatraman, Robert Lucito, and Michael Wigler.“Circular binary segmentation for the
analysis of arraybased DNA copy number data.”Biostat (2004) 5 (4): 557-572,
doi:10.1093/biostatistics/kxh008.
Rabiner, L.R., “A Tutorial on Hidden Markov Model and Selected Applications in Speech
Recognition.”Proceedings of the IEEE, Volume 77, February 1989, 257-285.
Smith, M.L., John C. Marioni, Steven McKinney, Thomas Hardcastle and Natalie P. Thorne
(2009).“snapCGH: Segmentation, normalisation and processing of aCGH data.”R package version
1.34.0.
Redon, Richard, Shumpei Ishikawa, Karen R. Fitch, Lars Feuk, George H. Perry, T. Daniel Andrews, Heike
Fiegler, Michael H. Shapero, Andrew R. Carson, Wenwei Chen, Eun Kyung Cho, Stephanie Dallaire,
Jennifer L. Freeman, Juan R. González, MònicaGratacòs, Jing Huang, DimitriosKalaitzopoulos,
Daisuke Komura, Jeffrey R. Macdonald, Christian R. Marshall, Rui Mei, Lyndal Montgomery,
Kunihiro Nishimura, Kohji Okamura, Fan Shen, Martin J. Somerville, Joelle Tchinda, Armand
![Page 9: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/9.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
73
Valsesia, Cara Woodwark, Fengtang Yang, Junjun Zhang, Tatiana Zerjal, Jane Zhang, LluisArmengol,
Donald F. Conrad, Xavier Estivill, Chris Tyler-Smith, Nigel P. Carter, Hiroyuki Aburatani, Charles
Lee, Keith W. Jones, Stephen W. Scherer, and Matthew E. Hurles. "Global Variation in Copy Number
in the Human Genome."Nature 444.7118 (2006): 444-54. Web.
Smyth, G.K. Limma: linear models for microarray data. In: ‘Bioinformatics and Computational Biology
Solutions using R and Bioconductor’. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds),
Springer, New York, pages 397-420, 2005. Web.
Wang, P., Y. Kim, J. Pollack, B. Narasimhan, and R. Tibshirani.“A Method for Calling Gains and Losses in
Array CGH Data.”Biostatistics 6.1 (2004): 45-58. Web.
Zhang, N. “DNA Copy Number Profiling in Normal and Tumor Genomes.”Frontiers in Computational and
Systems Biology.Vol. 15. London: Springer, 2010. 259-81. Web.
FIGURES
Figure 1: Forms of chromosome changes.
![Page 10: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/10.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
74
Figure 2. Schematic Representation of Array CGH
![Page 11: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/11.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
75
Figure 3: Boxplots comparing the efficacy of the Hidden Markov Models
![Page 12: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/12.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
76
Figure 4: Log Ratios for five patients, that were used as the emissions in the Hidden
Markov Models.
![Page 13: Systematic Analysis for Identification of Genes Impacting ...The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform](https://reader034.vdocument.in/reader034/viewer/2022052016/602ecf955e83cd71770ecdb8/html5/thumbnails/13.jpg)
International Journal of Scientific Research and Innovative Technology ISSN: 2313-3759 Vol. 2 No. 7; July 2015
77
Figure 5: The states are identified with the Heterogeneous Hidden Markov Model for the five
patients, and they range from 0 to 5 for the chromosomes, depending on the patient.