in silico analysis of microarray data for breast cancer
TRANSCRIPT
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
1/42
1
INTRODUCTION
Cancer is a class ofdiseases or disorders characterized by uncontrolled division ofcells and
the ability of these to spread, either by direct growth into adjacent tissue through invasion, or
by implantation into distant sites by metastasis (where cancer cells are transported through
the bloodstream or lymphatic system). Cancer may affect people at all ages, but risk tends to
increase with age. It is one of the principal causes of death in developed countries.
There are many types of cancer. Severity of symptoms depends on the site and character of
the malignancy and whether there is metastasis. A definitive diagnosis usually requires the
histological examination of tissue by a pathologist. This tissue is obtained by biopsy or
surgery. Most cancers can be treated and some cured, depending on the specific type,location, and stage. Once diagnosed, cancer is usually treated with a combination of surgery,
chemotherapy and radiotherapy. As research develops, treatments are becoming more
specific for the type of cancer pathology. Drugs that target specific cancers already exist for
several types of cancer. If untreated, cancers may eventually cause illness and death, though
this is not always the case.
The unregulated growth that characterizes cancer is caused by damage to DNA, resulting in
mutations to genes that encode for proteins controlling cell division. Many mutation eventsmay be required to transform a normal cell into a malignant cell. These mutations can be
caused by radiation, chemicals or physical agents that cause cancer, which are called
carcinogens, or by certain viruses that can insert their DNA into the human genome.
Mutations occur spontaneously, and may be passed down from one cell generation to the
next as a result of mutations within germ lines. However, some carcinogens also appear to
work through non-mutagenic pathways that affect the level of transcription of certain genes
without causing genetic mutation .Many forms of cancer are associated with exposure to
environmental factors such as tobacco smoke, radiation, alcohol, and certain viruses. Some
risk factors can be avoided or reduced.
http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Diseasehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_divisionhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_%28biology%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Metastasishttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Bloodstreamhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Lymphatic_systemhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Causes_of_deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Developed_countryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Histologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Anatomical_pathologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Biopsyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cancer_staginghttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Chemotherapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiation_therapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/DNAhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Mutationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Genehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Proteinhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Germ_linehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Transcription_%28genetics%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Environmental_factorhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Tobacco_smokehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Effects_of_alcohol_on_the_bodyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Virushttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Virushttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Effects_of_alcohol_on_the_bodyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Tobacco_smokehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Environmental_factorhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Transcription_%28genetics%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Germ_linehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Proteinhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Genehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Mutationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/DNAhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiation_therapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Chemotherapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cancer_staginghttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Biopsyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Anatomical_pathologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Histologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Developed_countryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Causes_of_deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Lymphatic_systemhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Bloodstreamhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Metastasishttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_%28biology%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_divisionhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Disease -
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
2/42
2
Breast tissue is a cancer originating from breast tissue, most commonly from the inner lining
of milk ducts or the lobules that supply the ducts with milk. The primary risk factors for
breast cancer are sex, age, lack of childbearing or breastfeeding, high hormone levels, race,
etc. The classification of breast tissue includes TNM classification (tumor nodes metastatic
grade, receptor status and the presence/absence of genes as determined by DNA testing).
Breast cancer, like other cancers, occurs because of an interaction between the environment
and a defective gene. Normal cells divide as many times as needed and stop. They attach to
other cells and stay in place in tissues. Cells become cancerous when mutations destroy their
ability to stop dividing, to attach to other cells and to stay where they belong. When cells
divide, their DNA is normally copied with many mistakes. Error-correcting proteins fix those
mistakes. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in
the error-correcting mechanisms. These mutations are either inherited or acquired after birth.
Presumably, they allow the other mutations, which allow uncontrolled division, lack of
attachment, and metastasis to distant organs. Normal cells will commit cell suicide
(apoptosis) when they are no longer needed. Until then, they are protected from cell suicide
by several protein clusters and pathways. One of the protective pathways is the PI3K/AKT
pathway; another is the RAS/MEK/ERK pathway. Sometimes the genes along these
protective pathways are mutated in a way that turns them permanently "on", rendering the
cell incapable of committing suicide when it is no longer needed. This is one of the steps that
cause cancer in combination with other mutations. Normally, the PTEN protein turns off the
PI3K/AKT pathway when the cell is ready for cell suicide. In some breast cancers, the gene
for the PTEN protein is mutated, so the PI3K/AKT pathway is stuck in the "on" position, and
the cancer cell does not commit suicide.
Mutations that can lead to breast cancer have been experimentally linked to estrogen
exposure.
Failure of immune surveillance, the removal of malignant cells throughout one's life by the
immune system. Abnormal growth factor signalling in the interaction between stoma cells
and epithelial cells can facilitate malignant cell growth. The extensive use of DNA
microarray technology in the characterization of the cell transcriptome is leading to an ever
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
3/42
3
increasing amount of microarray data from cancer studies. Although similar questions for the
same type of cancer are addressed in different studies, a comparative analysis of their results
is hampered by the use of heterogeneous microarray platforms and analysis methods.
Gene expression profiling by DNA microarrays has become an important tool for studying
the transcriptome of cancer cells, and has been successfully used in many studies of tumour
classification and of identification of marker genes associated with cancer. With an
increasing number of microarray data becoming available, the comparison of studies with
similar research goals, to identify genes being differentially expressed in normal versus
tumour tissue, has gained high importance. In general, the evaluation of multiple data sets
promises to yield more reliable and more valid results since these results are based on a
larger number of samples and the effects of individual study-specific biases are weakened.However, the comparison of results from different microarray studies is hampered by the fact
that different studies use different protocols, microarray platforms and analysis techniques.
The question whether the results of gene expression measurements obtained by different
platforms can be compared has been addressed in several studies. It has been found that
results derived from the measurements like lists of tumour subtype marker genes or measures
of intra-study correlation of gene expression patterns can be compared and thus inter-
validated between different platforms. However, the measures of gene expression themselves
could not be directly compared between different platforms. Some studies propose methods
for meta-analysis of microarray data with the goal to identify significantly differentially
expressed genes across studies by using statistical techniques that avoid the direct
comparison of gene expression values.
The goal of this study is to investigate the benefit of performing supervised classification
analyses across disparate sources of microarray data. Methods of supervised classification
analysis render it possible to automatically build classifiers that distinguish among specimens
on the basis of predefined class label information (phenotypes), and in many cancer research
studies the application of these methods has shown promising results of improved tumor
diagnosis and prognosis.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
4/42
4
Objectives:
1. To identify and retrieve Microarray data for breast tumor cells.
2. To carry out the clustering of tumorous breast cells based on the expression profile of the
genes.
3. To carry out the clustering algorithms such as Self Organizing Maps(SOM) and K-means.
4. To compare and analyze the expression patterns on the basis of similarity.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
5/42
5
REVIEW OF LITERATUREWorldwide breast cancer comprises 22.9% of all non-skin cancer incidence among women,
making it the most common cause of cancer death. In 2008, breast cancer caused 458,503
deaths worldwide. The World Cancer Research Fund estimated that 38% of breast cancer
cases in US are preventable through reducing alcohol intake, increasing physical activity
levels and maintaining a healthy weight. Smoking tobacco also increases risk of breast
cancer.
Breast cancer, like other cancers, occurs because of an interaction between the environment
and a defective gene. Normal cells divide as many times as needed and stop. They attach to
other cells and stay in place in tissues. Cells become cancerous when mutations destroy their
ability to stop dividing, to attach to other cells and to stay where they belong. When cells
divide, their DNA is normally copied with many mistakes. Error-correcting proteins fix those
mistakes. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in
the error-correcting mechanisms. These mutations are either inherited or acquired after birth.
Presumably, they allow the other mutations, which allow uncontrolled division, lack of
attachment, and metastasis to distant organs. Normal cells will commit cell suicide
(apoptosis) when they are no longer needed. Until then, they are protected from cell suicide
by several protein clusters and pathways. One of the protective pathways is the PI3K/AKT
pathway; another is the RAS/MEK/ERK pathway. Sometimes the genes along these
protective pathways are mutated in a way that turns them permanently "on", rendering the
cell incapable of committing suicide when it is no longer needed. This is one of the steps that
causes cancer in combination with other mutations. Normally, the PTEN protein turns off the
PI3K/AKT pathway when the cell is ready for cell suicide. In some breast cancers, the gene
for the PTEN protein is mutated, so the PI3K/AKT pathway is stuck in the "on" position, and
the cancer cell does not commit suicide. Mutations that can lead to breast cancer have beenexperimentally linked to estrogen exposure. Failure of immune surveillance, the removal of
malignant cells throughout one's life by the immune system. Abnormal growth factor
signalling in the interaction between stromal cells and epithelial cells can facilitate malignant
cell growth. In the United States, 10 to 20 percent of patients with breast cancer and patients
with ovarian cancer have a first- or second-degree relative with one of these diseases.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
6/42
6
Mutations in either of two major susceptibility genes, breast cancer susceptibility gene 1
(BRCA1) and breast cancer susceptibility gene 2 (BRCA2), confer a lifetime risk of breast
cancer of between 60 and 85 percent and a lifetime risk of ovarian cancer of between 15 and
40 percent. However, mutations in these genes account for only 2 to 3 percent of all breast
cancers.
Breast cancer cell lines provide a useful starting point for the discovery and functional
analysis of genes involved in breast cancer.38 established breast cancer cell lines were
studied to determine recurrent genetic alterations and the extent to which these cell lines
resemble uncultured tumors.DNA copy number profiles were generated by comparative
genome hybridization (CGH) for most of the publicly available breast cancer cell lines
(Kallioniemietal., 2000).
Immunotherapy approaches to fight cancer are based on the principle of mounting an
immune response against a self-antigen expressed by the tumor cells. In order to reduce
potential autoimmunity side-effects, the antigens used should be as tumor-specific as
possible. A complementary approach to experimental tumor antigen discovery is to screen
the human genome in silico, particularly the databases of "Expressed Sequence Tags"
(ESTs), in search of tumor-specific and tumor-associated antigens. The public databases
currently provide a massive amount of ESTs from several hundreds of cDNA tissue libraries,
including tumoral tissues from various types. We describe a novel method of EST database
screening that allows new potential tumor-associated genes to be efficiently selected. The
resulting list of candidates is enriched in known genes, described as being expressed in tumor
cells (Vinals et al., 2001).
The ways in which molecular biology has affected the clinical management of breast cancer
is very challenging. There is a technique which can be used to compare expression patterns
of thousands of genes between different cell types. Cancer researchers use a tumors geneexpression profile to distinguish it from other tumor types. Only 5% of patients with a
good profile developed metastasis by 5 years after initial therapy and 15% developed
metastasis by 10 years after therapy. Using multivariate analysis, the gene expression
signature was the strongest predictor for metastasis-free survival. This gene expression
profile may be used in tailoring therapy for breast cancer and it can greatly reduce the
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
7/42
7
number of patients who would receive unnecessary adjuvant systemic treatment (Kristine,
2002).
Sequence analysis of individual targets is an important step in annotation and validation.
Investigation of human breast cancer BCA3 gene was carried with LION Target Engine and
with other bioinformatics tools. LION Target Engine confirmed that BCA3 gene is located
11p154.A significant numbers of new orthologs were also identified and these were the basis
for a high quality protein secondary structure prediction. Sequence conservation from
multiple sequence alignment (MSA), splice variant identification (SVI), secondary structure
prediction and predicted phosphorylation sites suggested that the removal of interaction sites
through alternative splicing might play a modulatory role in BCA3 (Canaves and Leon,
2003).
Genome wide monitoring of gene expression using DNA microarrays represents one of the
latest breakthroughs in experimental biology. As microarray analysis emerges from its
infancy, there is widespread hope that microarrays will significantly impact on our ability to
explore the genetic changes associated with cancer etiology and development and ultimately
lead to the discovery of new biomarkers for disease diagnosis (Macgregor, 2003).
Breast cancer is a heterogeneous disease whose evolution is difficult to predict by using
classic histoclinical prognostic factors. Prognostic classification can benefit from molecular
analysis such as large scale expression Profiling. Hierarchical Clustering (HCL) identified
relevant clusters of co expressed proteins and clusters of tumors. Protein expression profiling
may be a clinically useful approach to assess breast cancer heterogeneity (Bertucci et al.,
2004).
Inflammatory breast cancer (IBC) is a rare but aggressive form of breast cancer with a five
year survival limited to 40% .Diagnosis based on clinical/pathological criteria, may be
difficult. The potential of gene expression profiling to contribute to a better understanding of
IBC and to provide new diagnostic and predictive factors for IBC as well as for potential
therapeutic targets (Jones et al., 2004).
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
8/42
8
Correlations of risk factors with genomic data promises to provide special treatment for
individual patients and needs interpretation of complex, multivariate patterns in gene
expression data, as well as assessment of their ability to improve clinical predictions.DNA
microarray data were analyzed from samples of primary breast tumors using non-linear
statistical analysis to assess multiple patterns of interactions of group of genes that have
predictive value for the individual patient with respect to lymph node metastasis and cancer
recurrence. Aggregate patterns of gene expression (metagenes) were identified with about
90% accuracy (Potamias etal.,2004).
Publicly available human genomic sequence data provide an unprecedented opportunity for
researchers to decode the functionality of human genomic Cancer Genome Anatomy Project
(CGAP) and Gene Expression Omnibus (GEO) are two bioinformatics infrastructures for
studying functional genomics. The feasibility was explored for incorporating the Internet-
available bioinformatics databases to discover human breast cancer-related genes. Several
tools including Gene Finder, Virtual Northern and SAGE digital gene expression displayer
(DGED) were used to analyze differential gene expression between benign and malignant
breast tissue. A pilot study was performed using both expressed sequence tags(EST) and
SAGE to analyze the expression of panel of known genes including high abundance genes -
actin and G3PDH, BRCA1 and p53 and two breast cancer related genes, Her2/heu and
MUC1.They produced 53 differentially expressed genes according to the screening of a
criteria greater than 5-fold difference and p< 0.01.Combined multiple high throughput
analysis was an effective data mining strategy in cancer gene identification (Shen et al.,
2005).
Characteristic patterns of gene expression have emerged, reflecting molecular differences
among different types of cancer. The application of microarray based technologies in the
investigations of molecular markers has evolved from descriptive biological definitions of
the tumor landscape to more functional and mechanistic studies with correlations to clinically
important traits within the field of breast cancer management. Array based molecular
profiling represents a technological breakthrough in how biological samples for instance
tumor specimens, can be analyzed (Ingrid et al., 2006).
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
9/42
9
The identification of the change of the gene expression in multifactorial diseases such as
breast cancer is a major goal of DNA microarray experiment. Consideration of genes
behaviour in a wide variety of experiment can improve the statistical reliability on
identifying genes with moderate changes between samples. The approach was evaluated via
the reidentification of breast cancer specific genes experiment. It successfully prioritized
several genes associated with breast tumor for which the experiment differs between normal
and breast cancer cells was marginal and would have been difficult to recognize using
conventional analysis methods (Yang et al., 2006).
Genome-wide expression microarray studies have revealed that the biological and clinical
heterogeneity of breast cancer can be partly explained by information embedded within a
complex but ordered transcriptional architecture. Comprising this architecture are gene
expression networks, or signatures, reflecting biochemical and behavioral properties of
tumors that might be harnessed to improve disease sub typing, patient prognosis and
prediction of therapeutic response. Emerging 'hypothesis-driven' strategies that incorporate
knowledge of pathways and other biological phenomena in the signature discovery process
are linking prognosis and therapy prediction with transcriptional readouts of tumorigenic
mechanisms that better inform therapeutic options (Miller and Edison, 2007).
Gene expression measurements from breast cancer (BRCA) tumors are established clinical
predictive tools to identify tumor subtypes, identify patients showing poor/good prognosis
and identify patients likely to have disease recurrence. Experiment data from 9 published
microarray studies examining estrogen receptor negative(ER-) BRCA tumor cases from the
Oncomine database. Some of the genes identified, distinguish key transcripts previously seen
in array studies, while others are newly defined. Many of the genes identified as over
expressed in ER-tumors were previously identified as expression markers for neoplastic
transformation in multiple human cancers (David et al., 2008).
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
10/42
10
Germ line mutations of high penetrant BRCA1 and BRCA2 genes have been associated to
hereditary breast cancer risk, while polymorphic variants of the two genes still have an
unknown role in breast pathogenesis. The aim of our study was to characterize BRCA1 and
BRCA2 genes polymorphic variants in familial breast cancer. 110 patients affected by
familial breast and/or ovarian cancer have been consecutively enrolled according to family
history and BRCA mutation risk. All of them have been screened for BRCA1 and BRCA2
pathogenetic mutations, SNPs and intronic variants. In silico analyses have been also
performed using different computational methods to individualize genetic variations that can
alter the two genes expression and function. BRCA1 resulted mutated in 14% while BRCA2
in 3% of cases, while 80% of patients presented at least one polymorphism. A neural network
splicing prediction model individualized one BRCA1 and one BRCA2 intronic variants ableto determine alternative splicing. Furthermore, Q356R BRCA1 and N289H BRCA2 appear
to show a possible harmful role also due to their location in functional regions of the two
genes. However, in silico data are not always consistent with biological evidences. In
conclusion, SNPs profile provides a basis for DNA-based cancer risk classification and help
to define the gene alterations that could influence biochemistry activity protein or could
modify drug sensitivity (Tommasi et al., 2008).
Cancer patients make antibodies to tumor derived proteins that are potential biomarkers for
early detection. To detect antibodies to tumor antigens in patient sera, novel high-density
custom protein microarrays (NAPPA) were adapted. They were probed with sera from
patients with early stage breast cancer and healthy women. Custom in situ protein
microarrays can be used to detect serum tumor antigen-specific antibodies and enables the
rapid, simultaneous detection of immunogenic tumor antigens from patient sera. These auto
antibodies were being evaluated as potential biomarkers for the early diagnosis of breast
cancer (Anderson et al., 2009).
Detection of circulating tumor cells (CTC) may provide diagnostic and prognostic
information in breast cancer patients. Deregulation of miRNAs is frequent in tumor
progression. Phase I preclinical study was performed by means of computational tools for
miRNA profiling including MIRGATOR, MIRBASE, SMIRNAdb, Gene HUB-GEPIS,
MICRORNA. ORG. Computational tools identify a set of miRNA bioinformatics approach is
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
11/42
11
a useful high-throughput method to select associated miRNAs. The selected miRNAs should
be further evaluated for CTC detection (Calvo et al., 2009).
Estrogen receptor positive breast cancer can be prevented by Tamoxifen. Microarray
technology has identified genomic classifiers for breast cancer prognosis and prediction.
Meta-analysis of microarray data was conducted to identify individual genes associated with
estrogen receptor(ER) status of breast cancer (Holko et al., 2009).
Breast carcinoma is one of the most common causes of cancer related death. Serial Analysis
of Gene Expression (SAGE) is a comprehensive profiling method that allows for global,
unbiased and quantitative characterization of transcriptomes. Four normal breast tissues,
eleven primary breast tumors and three breast metastatic tissues were retrieved and the data
were analyzed by Correspondence Analysis, Hierarchical Clustering, Support Tree and
Significance Analysis of Microarray. Comprehensive analysis from whole transcriptomes of
primary breast cancer tissues compared with normal breast tissue and breast cancer
metastatic tissues revealed that clinically disparate outcomes could be linked to a relatively
small number of transcripts (Margossian et al., 2009).
The ability to compare genome-wide expression profiles in human tissue samples has the
potential to add an invaluable molecular pathology aspect to the detection and evaluation of
multiple diseases. Applications include initial diagnosis, evaluation of disease subtype,
monitoring of response to therapy and the prediction of disease recurrence. The derivation of
molecular signatures that can predict tumor recurrence in breast cancer has been a
particularly intense area of investigation and a number of studies have shown that molecular
signatures can outperform currently used clinic pathologic factors in predicting relapse in this
disease. However, many of these predictive models have been derived using relatively simple
computational algorithms and whether these models are at a stage of development worthy of
large-cohort clinical trial validation is currently a subject of debate. In this review, we focuson the derivation of optimal molecular signatures from high-dimensional data and discuss
some of the expected future developments in the field (Goodison et al., 2010).
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
12/42
12
Breast tumors consist of several different tissue components. Despite the heterogeneity, most
gene expression analysis has traditionally been performed without prior micro dissection of
the tissue sample. Thus the gene expression profiles obtained reflect the m RNA contribution
from the various tissue components. The differentially expressed genes were validated in
independent gene expression data from a set of laser capture micro dissected invasive ductal
carcinomas (Hyat etal., 2010).
The receptor tyrosine kinaseHER2 is an oncogene amplified in invasive breast cancer and its
over expression in mammary epithelial cell lines is a strong determinant of a tumorigenic
phenotype. Accordingly, HER2-overexpressing mammary tumors are commonly indicative
of poor prognosis in patients. Several quantitative proteomic studies have employed two-
dimensional gel electrophoresis in combination with MS/MS, which provides only limited
information about the molecular mechanisms underlying HER2/neu signaling. In the present
study, we used a SILAC-based approach to compare the proteomic profile of normal breast
epithelial cells with that of Her2/neu-overexpressing mammary epithelial cells, isolated from
primary mammary tumors arising in mouse mammary tumor virus-Her2/neu transgenic mice.
We identified 23 proteins with relevant annotated functions in breast cancer, showing a
substantial differential expression. This included over expression of creatine kinase, retinol-
binding protein 1, thymosin 4 and tumor protein D52, which correlated with the tumorigenic
phenotype of Her2-overexpressing cells. The differential expression pattern of two genes,
gelsolin and retinol binding protein 1, was further validated in normal and tumor tissues.
Finally, an in silico analysis of published cancer microarray data sets revealed a 23-gene
signature, which can be used to predict the probability of metastasis-free survival in breast
cancer patients (Pandey et al., 2010).
Genome wide DNA methylation changes in a cell line model of breast cancer metastasis have
been identified. The complex epigenetic change that was observed along with concurrent
karyotype analysis have led to hypothesize that complex genomic alterations in cancer cells
are superimposed over promoter specific methylation events that are responsible for gene
specific changes observed in breast cancer metastasis (Wendy et al., 2010).
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
13/42
13
Aberrant miRNA activity has been reported in many diseases and often numerous miRNAs
concurrently deregulated. There was coincident loss of expression of 6 miRNAs with
metastatic potential in breast cancer. A computational method, miR-AT! was used to
investigate combinatorial activity among this group of miRNAs. A number of genes
previously implicated in cancer metastasis are among the predicted combinatorial targets,
including TGFB1, ARPC3 and RANKL (Sultana et al., 2011).
2.1 Cluster analysis of microarray information
Microarray experiments generate mountains of data, which has to be stored and analyzed.
For humans it is difficult to handle very large numeric data sets. Therefore a general conceptis to try to reduce the dimensionality of the data. A number of clustering techniques have
been used to group genes based on their expression patterns. The user has to choose the
appropriate method for each task. The basic concepts in clustering are to try to identify and
group together similarly expressed genes and then try to correlate the observations to
biology. The idea is that co-regulated and functionally related genes are grouped into
clusters. Clustering provides the framework for this analysis. The hard part is to analyze the
biological processes and consequences. However, clustering can be a very useful tool. The
other side of the coin is the visualization of the information. Before clustering a large number
of requirements have to be met. The data has to be of good quality, the chip design should be
correct, one should have interesting genes contained in the chip, sample preparation has to be
flawless, data has to be properly treated (outliers, normalization etc). Clustering does not
improve the quality of the data. If poor-quality data is used then also the outcome is useless.
2.2 Principles of clustering
Clustering organizes the data into a small number of (relatively) homogeneous groups.Usually normalization of the expression values is used. At this stage, it is of interest to look
at the changes in expression patterns, not to follow the actual numeric changes. Thus, the
methods are used to find similar expression motifs irrespective of the expression level.
Therefore, both low and high expression level genes can end up in the same cluster if the
expression profiles are correlated by shape. The majority of clustering methods has been
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
14/42
14
available already for long. During the last few years, some new microarray analysis
dedicated methods have been developed, too. The methods can be described and classified in
different ways. The most commonly used methods are described here. Clustering methods
can be grouped as supervised and unsupervised. Supervised methods assign some predefined
classes to a data set, whereas in unsupervised methods no prior assumptions are applied.
Hierarchical clustering, K-means, self-organizing maps (SOMs), and principal component
analysis (PCA) have been commonly used. There are also other methods, such as
multidimensional scaling (MDS), minimum description length (MDS), gene wshaving (GS),
decision trees, and support vector machines (SVMs).
2.3 Hierarchical clustering
Hierarchical clustering is a statistical method for finding relatively homogeneous clusters.
The hierarchical clustering algorithm either iteratively joins the two closest clusters starting
from single clusters (agglomerative, bottom-up approach) or iteratively partitions clusters
starting from the complete set (divisive, top-down approach). After each step, a new distance
matrix between the newly formed clusters and the other clusters is recalculated. If there are N
cases, N-1 clustering steps are needed. There are several methods of hierarchical cluster
analysis including
Single linkage (minimum method, nearest neighbor)
complete linkage (maximum method, furthest neighbor)
Average linkage (UPGMA).
For a set of N genes to be clustered, and an NxN distance (or similarity) matrix, the
hierarchical clustering is performed as follows:
1. Assign each gene to a cluster of its own.
2. Find the closest pair of clusters and merge them into a single cluster.
3. Compute the distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all genes are clustered.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
15/42
15
Step 3 can be performed in different ways depending on the chosen approach. In single
linkage clustering, the distance between one cluster and another is considered to be equal to
the shortest distance from any member of one cluster to any member of the other cluster. In
complete linkage clustering, the distance between one cluster and another cluster is
considered to be equal to the longest distance from any member of one cluster to any member
of the other cluster. In average linkage, the distance between one cluster and another cluster
is considered to be equal to the average distance from any member of one cluster to any
member of the other cluster. The hierarchical clustering can be represented as a tree, or a
dendrogram. Branch lengths represent the degree of similarity between the genes. The
method does not provide clusters as such. Conceptually, different clusters and sizes of
clusters can be obtained by moving along the trunk or branches of the tree and deciding onwhich level to put forth branches (cut the tree).Hierarchical clustering is often applied in the
analysis of patient samples to organize the data based on the cases. Indeed, for patient
samples the hierarchical clustering method is most often the best option. Usually, in addition
to patient based organization, genes are also clustered by applying two-way clustering. On
one axis are the samples (patients) and on the other axis the genes
2.4 Self-organizing map
Kohonens self-organizing map (SOM) is a neural net that uses unsupervised learning for
which no prior knowledge of classes is required. SOMs are usually used to visualize and
interpret large high-dimensional data sets. In SOM, every input is connected to every output
via connections with variable weights. Also, the output nodes are highly interconnected.
SOM tries to learn to map similar input vectors (gene expression profiles) to similar regions
of the output array of nodes. The method maps the multidimensional distances of the feature
space to two-dimensional distances in the output map. The SOM algorithm is iterative. The
map attempts to represent all the available observations with optimal accuracy using arestricted set of models. At the same time, the models become ordered on the grid so that
similar models are close to each other and dissimilar models far from each other. So the order
and organization of the nodes (tentative clusters) contain more information than just the
actual partition of genes to clusters. In SOMs, the number of clusters has to be
predetermined. The value is given by the dimensions of the two-dimensional grid or array.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
16/42
16
One has to experiment with the actual number of clusters. Most programs facilitate the
analysis of all the genes within a cluster or clusters. One should find such array dimensions
that there is a minimum number of poorly fitting genes in the clusters. The actual number of
clusters is also difficult to assign, but the square root of the number of genes is a good initial
estimate. However, it depends on the data set.
2.5 K-means clustering
K-means is a least-squares partitioning method for which the number of groups, K, has to be
provided. The algorithm computes cluster centroids and uses them as new cluster seeds, and
assigns each object to the nearest seed. However, it is also possible to estimate K from the
data, taking the approach of a mixture density estimation problem.
1. The genes are arbitrarily divided into K centroids. The reference vector i.e., location of
each centroid, is calculated.
2. Each gene is examined and assigned to one of the clusters depending on the minimum
distance.
3. The centroids position is recalculated.
4. Steps 2 and 3 are repeated until all the genes are grouped into the final required number of
clusters.
During the course of iterations, the program tries to minimize the sum, over all groups, of the
squared within-group residuals, which are the distances of the objects to the respective group
centroids . Convergence is reached when the objective function (i.e. the residual sum-of-
squares) cannot be lowered any more. The obtained groups are geometrically as compact as
possible around their respective centroids. K-means partitioning is a so-called NP-hard
problem (there is no known algorithm that would be able to solve the problem in polynomial
time), thus there is no guarantee that the absolute minimum of the objective function has
been reached. Therefore, it is good practice to repeat the analysis several times using
randomly selected initial group centroids, and check whether these analyses produce
comparable results.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
17/42
17
MATERIALS AND METHODS
Progress in micro array data generation for breast cancer has now provides considerable
resources that may be used for the in-silico analysis of its gene expression; define genes that
are important in the development of breast cancer. This information can further be used to
identify and validate novel drug targets.
The next advance in availability of computational tools, which can be, used for the analysis
of micro array data. Numerous databases are now available which contain micro array data
for various kinds of cancers. Most of these are accessible over the Internet through
convenient Web browser interfaces. Many also permit downloading of micro array data.
The whole work for the present study can be summarized in the following steps:
1. Organisms Disease - Breast Cancer2. Operating systemsWindows and Linux3. Tools and Softwares Available online/Offline e.g. Genesis, Bioinformatics Research
Tools, etc
4. Research Hub RecoursesSMD, PMC etc.5. Retrieval of micro array dataPrepare datasets using different databases.6. Analysis of micro array data by using different clustering techniques for example
Hierarchical clustering, K-mean clustering, Self organizing map, Principle component
analysis etc.
For the identification of the genes responsible for the occurrence of breast cancer with the
help of microarray technology, the gene expression data from wet labs is collected and
maintained in online data bases to be further worked upon with the help of bioinformatics
tools and techniques. By application of tools that help to demarcate the genes of interest and
by studying their expression levels our objectives can be solved. Such microarray based
approaches are enumerated below
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
18/42
18
METHOD
Algorithm for Clustering Analysis:
Raw Data from SMD werecollected and downloaded
Five experiments were taken
and their log transformationratios were arranged in a
separate excel file for 5000
genes with all the parameters
such as clone id, cluster id,accession number etc.same.
The data sets were uploaded
in the GENESIS software.
K-means clustering and Self
Organizing Maps weregenerated.
Expression profile images
were executed.
Comparision of clusters of
gene expression based ontheir similarity.
Calculation of number ofgenes manually that were
co-expressed.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
19/42
19
3.1 Softwares for clustering and visualization
Several softwares are available for clustering and visualization of gene expression patterns.
Many programs are freely available on the Internet. Both the Gene Spring and Kensington
package available at CSC contain several options for clustering. Despite the initial time
required for learning the use of these programs, they provide some benefits, mainly due to
allowing several analyses to be done within a single package. These programs are not as such
any better than freely available programs. They, however, might have a more user-friendly
interface.
Table 3.1 Software for cluster analysis and visualization.
Program Author Platform DescriptionGenesis Alexander Sturn
(2000)
Windows, Linux Clustering,
Visualization
Cluster
(Version 2.11)
Michael Eisen
(1998)
Windows Performs hierarchicalclustering, self
organizing maps
SAM
(Version 2.0)
Rob Tibshirani
(2005)
Excel Add-in SignificanceAnalysis of
Microarrays:
Supervised LearningScanAlyze
(Version 2.5.3)
Michael Eisen
(2002)
Windows Processes fluorescent
images of
microarrays
TreeView
(Version 1.6)
Michael Eisen
(2002)
Windows Graphical results of
analysis from Cluster
Expression Profiler
(Version 1.0)
EBI
(2007)
Web Analysis and
clustering of gene
expression data
GeneCluster
(Version 2.0)
Whitehead Institute/
MIT(2004)
Java Self-organizing maps
J-Express
(Version7.8.1,7.8.2,7.8.3)
MolMine
(2011)
Java Clustering andvisualization
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
20/42
20
SMD (Stanford Microarray Database)
1. The SMD (http://www.smd.Stanford.EDU/) project stores raw and normalized datafrom microarray experiments, as well as their corresponding image files.
2. In addition, SMD provides interfaces for data retrieval, analysis and visualization. SMDis funded by the National Cancer Institute at the US National Institutes of Health, the
National Science Foundation, and the Howard Hughes Medical Institute fund the
Microarray Database. The database is a joint project in the Departments of Biochemistry
and Genetics at the School of Medicine, Stanford University.
3. Therefore, already normalized data for breast cancer cell lines was obtained for its geneexpression analysis. The details of the data are given below.
3.3 Normalization
1. Normalization refers to computational data transformations intended to remove certain
systematic biases from microarray data, such as dye effects, intensity dependence and spatial
or print-tip effects. (In this context, it doesn't necessarily have anything to do with the normal
or Gaussian distribution.)
2. The normalization constant is calculated by the database (Stanford Microarray
Database), the first step is to select good spots on which to base the normalization. The
threshold value is initially set to > 0.65. If fewer than 10% of the spots in the print pass these
criteria, the program will use > 0.60. If fewer than 10% of the spots in the print pass the .60
threshold, the program will use > 0.55. All spots that pass the 0.55 threshold are used in the
normalization calculation, regardless of how many there are. If more than 10% of the spots
pass any threshold, the program uses those passing spots in the calculation and does not try a
lower threshold value.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
21/42
21
3.1.Normalized microarray data analysis
Total intensity normalization relies on the assumption that most genes do not respond to
experimental conditions and so the average log ratio on the array should be zero. A single,
global, multiplicative adjustment is performed so that the average log ratio is zero for well
measured spots. All spots are normalized using the same constant, regardless of whether they
were used in the calculation. In the database, this is performed by computing normalized
values for all intensities by dividing the raw value by normalization constant. (Fig.3.1)
Fig.3.1: Normalization of experiments by log transformation ratios.
3.3 Centralization
1. It is the process of moving a distribution so that it is centered over the expected mean
2. For the log-transformed intensity ratios, an intensity dependent centralization might help to
correct the dye bias.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
22/42
22
3.2 Centroidal K-means clustering
Here each gene is represented by a series of ratio values that are relative to the expression
level of that gene in the reference sample. Since the reference sample is usually independent
from the experiment, the analysis is preferred to be independent from the gene expression
observed in the reference sample, and that is exactly what is achieved by mean and/or median
centering. After applying this procedure the values of each gene reflect the variation from
some property of the series of observed values such as the mean or median (Fig.3.2)
Fig.3.2: Centroidal K-means clustering
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
23/42
23
3.4 Genesis
1. Genesis a versatile and transparent software suite for large-scale gene expression cluster
analysis. (Release: Genesis 1.7.6, Date:2010-09-20,Genesis Server: Release 1.1.0)
2. The software enables data import and visualization, data normalization, and clustering
using: (1) Hierarchical Clustering, (2) k-means, (3) Self Organizing Maps, (4) Principal
Component Analysis and (5) Support Vector Machines.
3. An important and valuable feature of this software is to calculate and compare clustering
results from different algorithmic approaches. For example, one can begin with Hierarchical
Clustering to get a first impression on the number of patterns hidden in the dataset and then
use this information to adjust the parameters for k-means and SOM clustering.
4. PCA can be used to visualize these clusters in 3D space and get an impression on cluster
size, integrity, and distribution, and to retrieve the most significant patterns in a study. It can
also reveal some information about the number of clusters in the dataset, provided that data
clouds of genes in the principal component space representing a cluster can be distinguished.
3.3Clustering and visualization procedures are executed with this software (Fig.3.3).
Fig.3.3 GENESIS Software
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
24/42
24
3.5 Visualization
1. Scatter plots are very useful for the initial analysis and comparison of data sets. It is
customary to present the clustering results by grouping genes of clusters next to each other.
2. In SOM based analysis, also the order of the clusters contains information about the
relationships between clusters.
3. For the manual analysis of the goodness of clustering, one has to look at the expression
patterns within the generated clusters.
4. Genes within a cluster should follow the average expression pattern of the cluster. Because
every gene has to be assigned to a cluster, genes with unique expression patterns do not fit
well in any group.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
25/42
25
RESULTS AND DISCUSSION
4.1 Depending on the raw data 5000 genes were divided into 9 different clusters in Self
Organizing Map. The genes (Fig.4.1) which were distributed showed expression of gene
pattern. Some genes are under expressed and some genes are over expressed.
Fig. 4.1: Expression profile for k-means clustering
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
26/42
26
4.2 Single cluster image for k means clustering:
The cluster image for a particular cluster for K-means clustering showing the expression of
genes based upon their log transformation of intensity ratios. The five LOG_RAT2N_MEAN
denotes the log transformation for five different experiments based on their intensity and all
the other parameters such as Clone ID, Gene Symbol, Gene Name, Cluster ID and Accession
Number remains same for all the five experiments. (Fig.4.2)
Fig.4.2:Single cluster image for k means clustering
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
27/42
27
4.3 Expression profile for Self Organizing Image(SOM) clustering:
Depending on the raw data 5000 genes were divided into 9 different clusters in Self
Organizing Map. The genes (Fig.4.3) which were distributed showed expression of gene
pattern. Some genes were under expressed and some genes were over expressed.
Fig.4.3:Expression profile for Self Organizing Image(SOM) clustering.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
28/42
28
4.4 Single cluster image for k means clustering:
The cluster image for a particular cluster for SOM clustering showing the expression of
genes based upon their log transformation of intensity ratios. The five LOG_RAT2N_MEAN
denotes the log transformation for five different experiments based on their intensity and all
the other parameters such as Clone ID, Gene Symbol, Gene Name, Cluster ID and Accession
Number remains same for all the five experiments. (Fig.4.4)
Fig.4.4:Single Cluster Image for Self Organizing Map
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
29/42
29
4.1 Distribution of Genes in different cluster by using SOM and K-mean clustering
algorithm:
After the genes were expressed in K-means clustering and Self Organizing Maps, they were
compared for the distribution of the genes in both the clustering algorithm (Table 4.1).
Table 4.1: Distribution of Genes in different cluster by using SOM and K-mean
clustering algorithm.
Self Organizing Map K-means
Cluster Genes Genes(%) Cluster Genes Genes (%)
S1 936 19 K1 455 9
S2 209 4 K2 766 15
S3 1052 21 K3 537 11
S4 40 1 K4 679 14
S5 49 1 K5 435 9
S6 1099 22 K6 587 12S7 880 18 K7 358 7
S8 280 6 K8 581 12
S9 455 9 K9 602 12
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
30/42
30
The expression of genes in cluster S4 and K5 (Figure 4.5) was found to be similar and over
expressed as they have similar pattern of peaks, which represents the status of gene expressions
because those genes were expressed to the usual/expected degree leading to disruption of normal
growth control processes of cells. Depending on the fluorescence ratios, the over expressed genes
are assigned positive values. The values may range from 0 to +
K5 S4Figure 4.5: Comparison of Cluster no.4 of SOM & Cluster no.5 of K-mean based on their
expression profile.
The expression of genes in cluster S7 and K9 (Figure 4.6) was found to be similar and most of the
genes were under expressed because those genes were not expressed to the usual/expected degree
leading to disruption of normal growth control processes of cells. Depending on the fluorescence
ratios, the under expressed genes are assigned negative values. The values may range from 0 to +1.
K9 S7Figure 4.6: Comparison of Cluster no.7 of SOM & Cluster no.9 of K-mean based on their
expression profile.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
31/42
31
The expression of genes in cluster S1 and K8 of (Figure 4.7) was found to be similar and some
genes are showing under express while others are over expressed because some genes were
transcribed into m RNA and some are not transcribed leading to disruption of normal growth
control processes of cells.
K8 S1
Figure 4.7: Comparison of Cluster no.1 of SOM & Cluster no.8 of K-mean based on their
expression profile.
The expression of genes in cluster S8 and K1 of (Figure 4.8) was found to be similar and some
genes are showing under express while others are over expressed because some genes were
transcribed into m RNA and some are not transcribed leading to disruption of normal growth
control processes of cells..
K1 S8
Figure.4.8: Comparison of Cluster no.8 of SOM & Cluster no.1 of K-mean based on their
expression profile.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
32/42
32
STATISTICAL DATA FOR CLUSTER ANALYSIS:
From the above table (Table 4.1), it was seen that some of the genes distributed in the two
clusters were highly co expressed but some genes were having same expression profile image
(S1,K8), (S8,K1), (S7,K9) and (S4,K5) which were used in the K-means clustering and Self
Organizing Maps(SOM).As we are using a breast tumor cell(one sample),therefore we are
conducting t-test for one sample to test the effectiveness of the sample and to know the
accuracy of the distribution of genes.
Table 4.2. Similar clusters showing overexpressed and underexpressed genes:
SOM Cluster(x) x K-Means Cluster(y) y
936 (S1) 876096 455 (K1) 207025
40 (S4) 1600 435 (K5) 189225
880 (S7) 774400 581 (K8) 337561
280 (S8) 78400 602 (K9) 362404
x=2136 x=1730496 y=2073 y=1096215
Mean, x=x/n=2136/4 Mean, y=y/n=2073/4
=534 =518.25
Sample variance, Sample variance.
s=x-nx/n-1 s=y-ny /n-1
=1730496-4*(534)/4-1 =1096215-4*(518.25)/4-1
=1730496-1140624/3 =1096215-1074332.25/3
=589872/3 =21882.75/3
=196624 =7294.25
S=196624 S=7294.25
=443.42 =85.4
t=x -/s/n t= x -/s/n
=534-520/443.42 =518.25-520/85.4
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
33/42
33
=0.0315*2 =-0.031*2
=0.063 =-0.062
Table value, |t|=0.062
t9(5%)=2.262 t9(5%)=2.262(Table value)
Since the calculated value of t is smaller than the table value at 5% probability level and on
nine degrees of freedom. Therefore it can be concluded that there is no significant difference
between sample mean(518.25,534) and population mean(520).In other words, random
sample of the four clusters (K5, S4), (K9,S7), (K8,S1) and (K1,S8) has been selected from
the population whose population average efficiency can be 520 genes.
From the statistical analysis of the distribution of the genes in both the clusters that wereunder expressed and over expressed, it was observed that both the clusters are having same
t value.
Therefore, the genes present in both the clustering algorithms were same and accurate.
4.2. Genes present in clusters of similar expression profile and co- expressed:
By observing the results of SOM and K-mean, it was observed that the expression patterns
of most of the clusters were almost same and these clusters were also having almost samegenes. The list (Table 4.2) of coexpressed genes, which was selected after the comparative
analysis of clusters showing similar expression profile.
Table 4.3: Genes which were present in clusters of similar expression profile and co-expressed.
S.No Gene No. Gene Name
1 36 Paxillin
2 55 G-protein coupled receptor 19
3 67 Homo sapiens c DNA FLI40901 fis,clone UTERU2003704
4 90 Complement component 8,alpha polypeptide
5 146 Damage-specific DNA binding protein1,127kDa
6 156 Connective tissue growth factor
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
34/42
34
7 186 Rhotekin
8 211 Hypothetical protein MGC2747
9 212 Homo sapiens HC6(HC6),m RNA,complete cds
10 216 Brain acyl-coA hydrolase
11 231 ESTs, weakly similar to A43932 mucin 2 precursor, intestinal-
human(fragments) [Homo sapiens]
12 250 Glycoprotein IX (platelet)
13 253 Protein kinase C,theta
14 256 Potassium voltage-gated channel, shaker-related subfamily
,member1(episodic ataxia with myokymia)
15 261 Nucleoporin 214 kDa
16 342 Sperm associated antigen 11
17 347 Microtubule-associated protein 1B
18 349 Ribosomal protein L15
19 357 JTV1 gene
20 368 DKFZP566C134 protein
21 384 Spindlin
22 391 EST23 419 Synaptotagmin VI
24 431 Cofilin 2(muscle)
25 437 BRCA1 associated RING domain 1
26 453 Integrin, alpha 5(fibronectin receptor, alpha polypeptide)
27 497 Uncharacterized hematopoietic stem/progenitor cells protein
MDS029)
28 514 Methyl-CpG binding domain protein
29 542 ESTs
30 582 Steroid sulfatase (microsomal),aryl sulfatase C,isozyme S
31 665 Annexin A7
32 786 ESTs, weakly similar to L1 repeat, Tf subfamily, member 18[Mus
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
35/42
35
musculus]
33 793 Homo sapiens c DNA FL1 37399 fis, clone BRAMY 2027587
34 816 ESTs
35 819 Major Histocompatibility Complex, class I,C
36 827 Zinc finger protein 177
37 862 MAD, mothers against decapentaplegic homolog 3(Drosophila)
38 892 Homo Sapiens, clone IMAGE:3610040, m RNA
39 904 Copin-like protein
40 906 Pleckstrin homology,Sec 7 and coiled/coil domains 4
41 912 Sulfide dehydrogenase like(yeast)
42 929 CLLL8 protein
43 941 ESTs
44 972 Cytochrome P450,subfamily XXVIA, polypeptide 1
45 1001 Tripartite motif-containing 5
Clustering of the normalized data by using K-mean and Self Organizing Maps were carried
out and in both the cases entire dataset was clustered into nine different clusters based on the
expression profile of genes in the breast cancer. The data set was retrieved from SMD.
Genesis was used to carry out for normalization of raw data retrieved from Stanford
Microarray Database (SMD),Department of Pathology, Stanford University and to carry out
clustering of the genes by clustering algorithms The Genes were clustered into nine different
clusters in case of both techniques, based on the expression profile of those genes in breast
cancer. In case of SOM; clusters S1, S3, S6 and S7 had maximum number of genes. Genes
that were clustered in S6, S3, S1 and S7 were highly co-expressed. Similarly, In case of K-
mean Cluster: cluster K2, K4, K6 and K9 had maximum number of genes. Genes that were
clustered in cluster number K2, K4 and K9 are highly co-expressed. The expression of genes
in some of clusters has found to be similar and some genes were under express while others
were over expressed. The genes that were over expressed showed that they were expressed to
the usual/expected degree. Depending on the fluorescence ratios, the over expressed genes
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
36/42
36
were assigned positive values. In contrary to the genes that were under expressed showed
that they were not expressed up to a certain degree. The under expressed genes were assigned
negative values. The value of the log transformation ratio ranges from 0 to + in case ofover
expressed genes and in case of under expressed genes, the value of log transformation ratio
ranges from 0 to -1.Due to the over expressed and under expressed genes in cancer cells,
disruption of normal growth control processes of cells occurs.
\
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
37/42
37
SUMMARY AND CONCLUSION
Breast cancer, like other cancers, occurs because of an interaction between the environment
and a defective gene. Normal cells divide as many times as needed and stop. They attach to
other cells and stay in place in tissues. Cells become cancerous when mutations destroy their
ability to stop dividing, to attach to other cells and to stay where they belong. The mutations
known to cause cancer, such as p53, BRCA1 and BRCA2, occur in the error-correcting
mechanisms. Microarray experiments generate mountains of data, which has to be stored and
analyzed. A number of clustering techniques have been used to group genes based on their
expression patterns. The user has to choose the appropriate method for each task. The basic
concept in clustering is to try to identify and group together similarly expressed genes and
then tries to correlate the observations to biology. The idea is that co-regulated and
functionally related genes are grouped into clusters .The micro array data analysis for breast
cancer was carried out by using Genesis. The data set was retrieved from SMD. Genesis was
used to carry out for normalization of raw data retrieved from SMD and to carry out
clustering of the genes by clustering algorithms. The results were generated with the help of
SOM and Kmean techniques. The Genes were clustered into 9 different clusters in case of
both techniques, based on the expression profile of those genes in breast cancer. In case of
SOM; clusters S1, S3, S6 and S7 had maximum number of genes. Genes that were clustered
in S6, S3,S1 and S7 were highly co-expressed. Similarly, In case of K-mean Cluster number
K2, K4, K6 and K9 has maximum number of genes. Genes that were clustered in cluster
number K2, K4 and K9 are highly co-expressed. The expression of genes in some of clusters
has found to be similar and some genes were under express while others were over
expressed. The expression of genes in some of clusters has found to be similar and some
genes were under express while others were over expressed.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
38/42
38
CONCLUSION
Thus, it was concluded from the statistical analysis that clustering results obtained by two
techniques were same and approximately accurate. Forty-five genes have been identified
which were co expressed in different clusters. In future, work promoter analysis can be
carried out to analyze the regulatory systems of these forty-five genes. Drug target can be
identified with the help of this regulatory system analysis. Functions of these forty-five genes
are unknown and can be predicted on the bases of the known genes of similar cluster.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
39/42
39
REFERENCES
Anderson, K.S., Sibani, S., Wong, J. and Raphael, J. (2009). Using custom protein
microarrays to identify autoantibody biomarkers for the early detection of breast cancer:
Cancer Res; 69: 223-233
Bertucci, F., Birnbaum, D. and Hassoun, J. (2004). Protein Expression Profiling Identifies
Subclasses of Breast Cancer and Predicts Prognosis, American Association for Cancer
Research.2: 12-29
Calvo, M.B., Antolin, N. S. and Vilamil, M.V. (2009). MicroRNA for circulating tumor
cells detection in breast cancer: In-silico and in-vitro analysis, Tumor Biology,11:280-285
Cannaves, J. M. and Leon, D. A. (2003). In silico stydy of breast cancer associated gene 3
using LION Target Engine and other tools. Biotechniques 7: 1222-1228
David, S., Larson, G., Glackin,C. and Shove, O. (2008). Meta-analysis of breast cancer
microarray studies in conjunction with conserved cis-elements suggest patterns for
coordinate regulation.BMC Bioinformatics 9: 1186-1471
Goodison, S., Sun, Y. and Urquidi, V. (2010). Derivation of cancer diagnostic and
prognostic signatures from gene expression data.Bioanalysis 5: 855-862
Holko, M., Scholtens, D. and Khan, S. A. (2009). Differential gene expression associated
with estrogen receptor status of breast cancer identified by microarray meta analysis.
American Association for Cancer Research 69: 5472
Hyat, M., Tramm,T., Alsner, J. and Finak,G. (2010). In-silico ascription of gene
expression differences to tumor and stromal cells in a model to study impact on breast cancer
outcome.Public Library of Science 5:1371
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
40/42
40
Ingrid, A. H., Kristen, M.C. and Sofia, K.G. (2006). Microarrays in breast cancer research
and clinical practice. Endocrine Related Cancer 13: 1017-1031
Jones, C., Mckay, A., Cossu, A. and Davies, S. (2004). Gene expression profiling for
Molecular Characterization of Inflammatory Breast Cancer and Prediction of Response to
Chemotherapy.Cancer Res 64:8558
Kristine, D. N. (2002). Microarrays and Breast Cancer: Predicting Metastatic Disease.
Medscape Hematology-Oncology.Absract nr 2067
Kallioniemi, A., Veldman, R. and Jiang, Y. (2000). Comparative Genomic HybridizationAnalysis of 38 Breast Cancer cell lines-A Basis for Interpreting Complementary DNA
Microarray data.American Association of Cancer Research, 60:4519
Macgregor, P.F. (2003). Gene expression in cancer-applications of microarrays. Expert
Review of Molecular Diagnostics, 3:185-200
Margossian, A., Diaz, J. and Corvalan, A. (2009). In silico analysis of Breast Cancer
Transcriptome Libraries distinguishes Tumor Subclasses. Cancer Res; 69:1165
Miller, D.L. and Edison, T. L. (2007). Expression genomics in breast cancer research-
microarrays at the crossroads of biology and medicine. Breast Cancer Research, 9:206
Pandey, A., Sukumar, S., Cole, R.N. and Chen, H. (2010). Proteomic characterization of
Her 2/neu over expressing breast cancer cells. Proteomics, John Wiley and Sons Ltd,10
:3800-3810
Potamias, G., Analyti, A. and Tollis, Y. (2004). Breast cancer microarrays and Biomedical
Informatics-The Prognochip Project.1st
International Advanced Research Workshop on In
silico Oncology: Advances and Challenges, Sparta, Greece.
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
41/42
41
Shen, D., He, J. and Chang, H.R. (2005). In silico identification of breast cancer genes by
combined multiple high throughput analysis. International Journal of Molecular Medicine 15
:205-212
Sultana, Z., Craig, D. B. and Dombkowski, A.A. (2011).In silico analysis of combinatorial
microRNA Activity Reveals Target Genes and Pathways Associated with Breast Cancer
Metastasis. Cancer Inform 17 :13-29
Tommasi, S., Pilato, B., Pinto, R. and Bruno, M. (2008). Molecular and in silico analysis
of BRCA1 and BRCA2 variants.Mutat Res 644 :64-70
Vinals, C., Gauli, S. and Coche.T.(2001). Using in silico transcriptomics to search for
tumor-associated antigens for immunotherapy. Vaccine 19 :2607-2614
Wendy, K., Andrews, J., Pilon, J. and Hodgson, A. (2010). Gene expression Profiling
predicts clinical outcomes of breast cancerNature415:530-536
Yang, Y., Choi, J. and Yoon, S. (2006). Large scale data mining approach for gene-specific
standardization of microarray gene expression data. Oxford Journal, Bioinformatics 22
:2818-2904
-
7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER
42/42
WEBSITES
1. http://www.helsinki.fi/biochipcenter
2. http://www.microarrays.btk.utu.fi
3. http://www.med.uio.no/dnr/microarray/
4. http://www.genome.tugraz.at/genesisclient/1.7.2/install.htm
5. http://www.smd.Stanford.EDU/
6. http://www.ncbi.nlm.nih.gov.in
7. http://www.aacr.gov
http://www.helsinki.fi/biochipcenterhttp://www.microarrays.btk.utu.fi/http://www.med.uio.no/dnr/microarray/http://www.genome.tugraz.at/genesisclient/1.7.2/install.htmhttp://www.smd.stanford.edu/http://www.ncbi.nlm.nih.gov.in/http://www.aacr.gov/http://www.aacr.gov/http://www.aacr.gov/http://www.ncbi.nlm.nih.gov.in/http://www.smd.stanford.edu/http://www.genome.tugraz.at/genesisclient/1.7.2/install.htmhttp://www.med.uio.no/dnr/microarray/http://www.microarrays.btk.utu.fi/http://www.helsinki.fi/biochipcenter