in silico analysis of microarray data for breast cancer

7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

1/42

1

INTRODUCTION

Cancer is a class ofdiseases or disorders characterized by uncontrolled division ofcells and

the ability of these to spread, either by direct growth into adjacent tissue through invasion, or

by implantation into distant sites by metastasis (where cancer cells are transported through

the bloodstream or lymphatic system). Cancer may affect people at all ages, but risk tends to

increase with age. It is one of the principal causes of death in developed countries.

There are many types of cancer. Severity of symptoms depends on the site and character of

the malignancy and whether there is metastasis. A definitive diagnosis usually requires the

histological examination of tissue by a pathologist. This tissue is obtained by biopsy or

surgery. Most cancers can be treated and some cured, depending on the specific type,location, and stage. Once diagnosed, cancer is usually treated with a combination of surgery,

chemotherapy and radiotherapy. As research develops, treatments are becoming more

specific for the type of cancer pathology. Drugs that target specific cancers already exist for

several types of cancer. If untreated, cancers may eventually cause illness and death, though

this is not always the case.

The unregulated growth that characterizes cancer is caused by damage to DNA, resulting in

mutations to genes that encode for proteins controlling cell division. Many mutation eventsmay be required to transform a normal cell into a malignant cell. These mutations can be

caused by radiation, chemicals or physical agents that cause cancer, which are called

carcinogens, or by certain viruses that can insert their DNA into the human genome.

Mutations occur spontaneously, and may be passed down from one cell generation to the

next as a result of mutations within germ lines. However, some carcinogens also appear to

work through non-mutagenic pathways that affect the level of transcription of certain genes

without causing genetic mutation .Many forms of cancer are associated with exposure to

environmental factors such as tobacco smoke, radiation, alcohol, and certain viruses. Some

risk factors can be avoided or reduced.
http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Diseasehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_divisionhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_%28biology%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Metastasishttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Bloodstreamhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Lymphatic_systemhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Causes_of_deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Developed_countryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Histologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Anatomical_pathologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Biopsyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cancer_staginghttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Chemotherapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiation_therapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/DNAhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Mutationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Genehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Proteinhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Germ_linehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Transcription_%28genetics%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Environmental_factorhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Tobacco_smokehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Effects_of_alcohol_on_the_bodyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Virushttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Virushttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Effects_of_alcohol_on_the_bodyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Tobacco_smokehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Environmental_factorhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Transcription_%28genetics%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Germ_linehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Proteinhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Genehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Mutationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/DNAhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiation_therapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Chemotherapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cancer_staginghttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Biopsyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Anatomical_pathologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Histologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Developed_countryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Causes_of_deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Lymphatic_systemhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Bloodstreamhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Metastasishttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_%28biology%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_divisionhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Disease


2/42

2

Breast tissue is a cancer originating from breast tissue, most commonly from the inner lining

of milk ducts or the lobules that supply the ducts with milk. The primary risk factors for

breast cancer are sex, age, lack of childbearing or breastfeeding, high hormone levels, race,

etc. The classification of breast tissue includes TNM classification (tumor nodes metastatic

grade, receptor status and the presence/absence of genes as determined by DNA testing).

Breast cancer, like other cancers, occurs because of an interaction between the environment

and a defective gene. Normal cells divide as many times as needed and stop. They attach to

other cells and stay in place in tissues. Cells become cancerous when mutations destroy their

ability to stop dividing, to attach to other cells and to stay where they belong. When cells

divide, their DNA is normally copied with many mistakes. Error-correcting proteins fix those

mistakes. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in

the error-correcting mechanisms. These mutations are either inherited or acquired after birth.

Presumably, they allow the other mutations, which allow uncontrolled division, lack of

attachment, and metastasis to distant organs. Normal cells will commit cell suicide

(apoptosis) when they are no longer needed. Until then, they are protected from cell suicide

by several protein clusters and pathways. One of the protective pathways is the PI3K/AKT

pathway; another is the RAS/MEK/ERK pathway. Sometimes the genes along these

protective pathways are mutated in a way that turns them permanently "on", rendering the

cell incapable of committing suicide when it is no longer needed. This is one of the steps that

cause cancer in combination with other mutations. Normally, the PTEN protein turns off the

PI3K/AKT pathway when the cell is ready for cell suicide. In some breast cancers, the gene

for the PTEN protein is mutated, so the PI3K/AKT pathway is stuck in the "on" position, and

the cancer cell does not commit suicide.

Mutations that can lead to breast cancer have been experimentally linked to estrogen

exposure.

Failure of immune surveillance, the removal of malignant cells throughout one's life by the

immune system. Abnormal growth factor signalling in the interaction between stoma cells

and epithelial cells can facilitate malignant cell growth. The extensive use of DNA

microarray technology in the characterization of the cell transcriptome is leading to an ever


3/42

3

increasing amount of microarray data from cancer studies. Although similar questions for the

same type of cancer are addressed in different studies, a comparative analysis of their results

is hampered by the use of heterogeneous microarray platforms and analysis methods.

Gene expression profiling by DNA microarrays has become an important tool for studying

the transcriptome of cancer cells, and has been successfully used in many studies of tumour

classification and of identification of marker genes associated with cancer. With an

increasing number of microarray data becoming available, the comparison of studies with

similar research goals, to identify genes being differentially expressed in normal versus

tumour tissue, has gained high importance. In general, the evaluation of multiple data sets

promises to yield more reliable and more valid results since these results are based on a

larger number of samples and the effects of individual study-specific biases are weakened.However, the comparison of results from different microarray studies is hampered by the fact

that different studies use different protocols, microarray platforms and analysis techniques.

The question whether the results of gene expression measurements obtained by different

platforms can be compared has been addressed in several studies. It has been found that

results derived from the measurements like lists of tumour subtype marker genes or measures

of intra-study correlation of gene expression patterns can be compared and thus inter-

validated between different platforms. However, the measures of gene expression themselves

could not be directly compared between different platforms. Some studies propose methods

for meta-analysis of microarray data with the goal to identify significantly differentially

expressed genes across studies by using statistical techniques that avoid the direct

comparison of gene expression values.

The goal of this study is to investigate the benefit of performing supervised classification

analyses across disparate sources of microarray data. Methods of supervised classification

analysis render it possible to automatically build classifiers that distinguish among specimens

on the basis of predefined class label information (phenotypes), and in many cancer research

studies the application of these methods has shown promising results of improved tumor

diagnosis and prognosis.


4/42

4

Objectives:

1. To identify and retrieve Microarray data for breast tumor cells.

2. To carry out the clustering of tumorous breast cells based on the expression profile of the

genes.

3. To carry out the clustering algorithms such as Self Organizing Maps(SOM) and K-means.

4. To compare and analyze the expression patterns on the basis of similarity.


5/42

5

REVIEW OF LITERATUREWorldwide breast cancer comprises 22.9% of all non-skin cancer incidence among women,

making it the most common cause of cancer death. In 2008, breast cancer caused 458,503

deaths worldwide. The World Cancer Research Fund estimated that 38% of breast cancer

cases in US are preventable through reducing alcohol intake, increasing physical activity

levels and maintaining a healthy weight. Smoking tobacco also increases risk of breast

cancer.




ability to stop dividing, to attach to other cells and to stay where they belong. When cells

divide, their DNA is normally copied with many mistakes. Error-correcting proteins fix those

mistakes. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in

the error-correcting mechanisms. These mutations are either inherited or acquired after birth.

Presumably, they allow the other mutations, which allow uncontrolled division, lack of

attachment, and metastasis to distant organs. Normal cells will commit cell suicide

(apoptosis) when they are no longer needed. Until then, they are protected from cell suicide

by several protein clusters and pathways. One of the protective pathways is the PI3K/AKT

pathway; another is the RAS/MEK/ERK pathway. Sometimes the genes along these

protective pathways are mutated in a way that turns them permanently "on", rendering the

cell incapable of committing suicide when it is no longer needed. This is one of the steps that

causes cancer in combination with other mutations. Normally, the PTEN protein turns off the

PI3K/AKT pathway when the cell is ready for cell suicide. In some breast cancers, the gene

for the PTEN protein is mutated, so the PI3K/AKT pathway is stuck in the "on" position, and

the cancer cell does not commit suicide. Mutations that can lead to breast cancer have beenexperimentally linked to estrogen exposure. Failure of immune surveillance, the removal of

malignant cells throughout one's life by the immune system. Abnormal growth factor

signalling in the interaction between stromal cells and epithelial cells can facilitate malignant

cell growth. In the United States, 10 to 20 percent of patients with breast cancer and patients

with ovarian cancer have a first- or second-degree relative with one of these diseases.


6/42

6

Mutations in either of two major susceptibility genes, breast cancer susceptibility gene 1

(BRCA1) and breast cancer susceptibility gene 2 (BRCA2), confer a lifetime risk of breast

cancer of between 60 and 85 percent and a lifetime risk of ovarian cancer of between 15 and

40 percent. However, mutations in these genes account for only 2 to 3 percent of all breast

cancers.

Breast cancer cell lines provide a useful starting point for the discovery and functional

analysis of genes involved in breast cancer.38 established breast cancer cell lines were

studied to determine recurrent genetic alterations and the extent to which these cell lines

resemble uncultured tumors.DNA copy number profiles were generated by comparative

genome hybridization (CGH) for most of the publicly available breast cancer cell lines

(Kallioniemietal., 2000).

Immunotherapy approaches to fight cancer are based on the principle of mounting an

immune response against a self-antigen expressed by the tumor cells. In order to reduce

potential autoimmunity side-effects, the antigens used should be as tumor-specific as

possible. A complementary approach to experimental tumor antigen discovery is to screen

the human genome in silico, particularly the databases of "Expressed Sequence Tags"

(ESTs), in search of tumor-specific and tumor-associated antigens. The public databases

currently provide a massive amount of ESTs from several hundreds of cDNA tissue libraries,

including tumoral tissues from various types. We describe a novel method of EST database

screening that allows new potential tumor-associated genes to be efficiently selected. The

resulting list of candidates is enriched in known genes, described as being expressed in tumor

cells (Vinals et al., 2001).

The ways in which molecular biology has affected the clinical management of breast cancer

is very challenging. There is a technique which can be used to compare expression patterns

of thousands of genes between different cell types. Cancer researchers use a tumors geneexpression profile to distinguish it from other tumor types. Only 5% of patients with a

good profile developed metastasis by 5 years after initial therapy and 15% developed

metastasis by 10 years after therapy. Using multivariate analysis, the gene expression

signature was the strongest predictor for metastasis-free survival. This gene expression

profile may be used in tailoring therapy for breast cancer and it can greatly reduce the


7/42

7

number of patients who would receive unnecessary adjuvant systemic treatment (Kristine,

2002).

Sequence analysis of individual targets is an important step in annotation and validation.

Investigation of human breast cancer BCA3 gene was carried with LION Target Engine and

with other bioinformatics tools. LION Target Engine confirmed that BCA3 gene is located

11p154.A significant numbers of new orthologs were also identified and these were the basis

for a high quality protein secondary structure prediction. Sequence conservation from

multiple sequence alignment (MSA), splice variant identification (SVI), secondary structure

prediction and predicted phosphorylation sites suggested that the removal of interaction sites

through alternative splicing might play a modulatory role in BCA3 (Canaves and Leon,

2003).

Genome wide monitoring of gene expression using DNA microarrays represents one of the

latest breakthroughs in experimental biology. As microarray analysis emerges from its

infancy, there is widespread hope that microarrays will significantly impact on our ability to

explore the genetic changes associated with cancer etiology and development and ultimately

lead to the discovery of new biomarkers for disease diagnosis (Macgregor, 2003).

Breast cancer is a heterogeneous disease whose evolution is difficult to predict by using

classic histoclinical prognostic factors. Prognostic classification can benefit from molecular

analysis such as large scale expression Profiling. Hierarchical Clustering (HCL) identified

relevant clusters of co expressed proteins and clusters of tumors. Protein expression profiling

may be a clinically useful approach to assess breast cancer heterogeneity (Bertucci et al.,

2004).

Inflammatory breast cancer (IBC) is a rare but aggressive form of breast cancer with a five

year survival limited to 40% .Diagnosis based on clinical/pathological criteria, may be

difficult. The potential of gene expression profiling to contribute to a better understanding of

IBC and to provide new diagnostic and predictive factors for IBC as well as for potential

therapeutic targets (Jones et al., 2004).


8/42

8

Correlations of risk factors with genomic data promises to provide special treatment for

individual patients and needs interpretation of complex, multivariate patterns in gene

expression data, as well as assessment of their ability to improve clinical predictions.DNA

microarray data were analyzed from samples of primary breast tumors using non-linear

statistical analysis to assess multiple patterns of interactions of group of genes that have

predictive value for the individual patient with respect to lymph node metastasis and cancer

recurrence. Aggregate patterns of gene expression (metagenes) were identified with about

90% accuracy (Potamias etal.,2004).

Publicly available human genomic sequence data provide an unprecedented opportunity for

researchers to decode the functionality of human genomic Cancer Genome Anatomy Project

(CGAP) and Gene Expression Omnibus (GEO) are two bioinformatics infrastructures for

studying functional genomics. The feasibility was explored for incorporating the Internet-

available bioinformatics databases to discover human breast cancer-related genes. Several

tools including Gene Finder, Virtual Northern and SAGE digital gene expression displayer

(DGED) were used to analyze differential gene expression between benign and malignant

breast tissue. A pilot study was performed using both expressed sequence tags(EST) and

SAGE to analyze the expression of panel of known genes including high abundance genes -

actin and G3PDH, BRCA1 and p53 and two breast cancer related genes, Her2/heu and

MUC1.They produced 53 differentially expressed genes according to the screening of a

criteria greater than 5-fold difference and p< 0.01.Combined multiple high throughput

analysis was an effective data mining strategy in cancer gene identification (Shen et al.,

2005).

Characteristic patterns of gene expression have emerged, reflecting molecular differences

among different types of cancer. The application of microarray based technologies in the

investigations of molecular markers has evolved from descriptive biological definitions of

the tumor landscape to more functional and mechanistic studies with correlations to clinically

important traits within the field of breast cancer management. Array based molecular

profiling represents a technological breakthrough in how biological samples for instance

tumor specimens, can be analyzed (Ingrid et al., 2006).


9/42

9

The identification of the change of the gene expression in multifactorial diseases such as

breast cancer is a major goal of DNA microarray experiment. Consideration of genes

behaviour in a wide variety of experiment can improve the statistical reliability on

identifying genes with moderate changes between samples. The approach was evaluated via

the reidentification of breast cancer specific genes experiment. It successfully prioritized

several genes associated with breast tumor for which the experiment differs between normal

and breast cancer cells was marginal and would have been difficult to recognize using

conventional analysis methods (Yang et al., 2006).

Genome-wide expression microarray studies have revealed that the biological and clinical

heterogeneity of breast cancer can be partly explained by information embedded within a

complex but ordered transcriptional architecture. Comprising this architecture are gene

expression networks, or signatures, reflecting biochemical and behavioral properties of

tumors that might be harnessed to improve disease sub typing, patient prognosis and

prediction of therapeutic response. Emerging 'hypothesis-driven' strategies that incorporate

knowledge of pathways and other biological phenomena in the signature discovery process

are linking prognosis and therapy prediction with transcriptional readouts of tumorigenic

mechanisms that better inform therapeutic options (Miller and Edison, 2007).

Gene expression measurements from breast cancer (BRCA) tumors are established clinical

predictive tools to identify tumor subtypes, identify patients showing poor/good prognosis

and identify patients likely to have disease recurrence. Experiment data from 9 published

microarray studies examining estrogen receptor negative(ER-) BRCA tumor cases from the

Oncomine database. Some of the genes identified, distinguish key transcripts previously seen

in array studies, while others are newly defined. Many of the genes identified as over

expressed in ER-tumors were previously identified as expression markers for neoplastic

transformation in multiple human cancers (David et al., 2008).


10/42

10

Germ line mutations of high penetrant BRCA1 and BRCA2 genes have been associated to

hereditary breast cancer risk, while polymorphic variants of the two genes still have an

unknown role in breast pathogenesis. The aim of our study was to characterize BRCA1 and

BRCA2 genes polymorphic variants in familial breast cancer. 110 patients affected by

familial breast and/or ovarian cancer have been consecutively enrolled according to family

history and BRCA mutation risk. All of them have been screened for BRCA1 and BRCA2

pathogenetic mutations, SNPs and intronic variants. In silico analyses have been also

performed using different computational methods to individualize genetic variations that can

alter the two genes expression and function. BRCA1 resulted mutated in 14% while BRCA2

in 3% of cases, while 80% of patients presented at least one polymorphism. A neural network

splicing prediction model individualized one BRCA1 and one BRCA2 intronic variants ableto determine alternative splicing. Furthermore, Q356R BRCA1 and N289H BRCA2 appear

to show a possible harmful role also due to their location in functional regions of the two

genes. However, in silico data are not always consistent with biological evidences. In

conclusion, SNPs profile provides a basis for DNA-based cancer risk classification and help

to define the gene alterations that could influence biochemistry activity protein or could

modify drug sensitivity (Tommasi et al., 2008).

Cancer patients make antibodies to tumor derived proteins that are potential biomarkers for

early detection. To detect antibodies to tumor antigens in patient sera, novel high-density

custom protein microarrays (NAPPA) were adapted. They were probed with sera from

patients with early stage breast cancer and healthy women. Custom in situ protein

microarrays can be used to detect serum tumor antigen-specific antibodies and enables the

rapid, simultaneous detection of immunogenic tumor antigens from patient sera. These auto

antibodies were being evaluated as potential biomarkers for the early diagnosis of breast

cancer (Anderson et al., 2009).

Detection of circulating tumor cells (CTC) may provide diagnostic and prognostic

information in breast cancer patients. Deregulation of miRNAs is frequent in tumor

progression. Phase I preclinical study was performed by means of computational tools for

miRNA profiling including MIRGATOR, MIRBASE, SMIRNAdb, Gene HUB-GEPIS,

MICRORNA. ORG. Computational tools identify a set of miRNA bioinformatics approach is


11/42

11

a useful high-throughput method to select associated miRNAs. The selected miRNAs should

be further evaluated for CTC detection (Calvo et al., 2009).

Estrogen receptor positive breast cancer can be prevented by Tamoxifen. Microarray

technology has identified genomic classifiers for breast cancer prognosis and prediction.

Meta-analysis of microarray data was conducted to identify individual genes associated with

estrogen receptor(ER) status of breast cancer (Holko et al., 2009).

Breast carcinoma is one of the most common causes of cancer related death. Serial Analysis

of Gene Expression (SAGE) is a comprehensive profiling method that allows for global,

unbiased and quantitative characterization of transcriptomes. Four normal breast tissues,

eleven primary breast tumors and three breast metastatic tissues were retrieved and the data

were analyzed by Correspondence Analysis, Hierarchical Clustering, Support Tree and

Significance Analysis of Microarray. Comprehensive analysis from whole transcriptomes of

primary breast cancer tissues compared with normal breast tissue and breast cancer

metastatic tissues revealed that clinically disparate outcomes could be linked to a relatively

small number of transcripts (Margossian et al., 2009).

The ability to compare genome-wide expression profiles in human tissue samples has the

potential to add an invaluable molecular pathology aspect to the detection and evaluation of

multiple diseases. Applications include initial diagnosis, evaluation of disease subtype,

monitoring of response to therapy and the prediction of disease recurrence. The derivation of

molecular signatures that can predict tumor recurrence in breast cancer has been a

particularly intense area of investigation and a number of studies have shown that molecular

signatures can outperform currently used clinic pathologic factors in predicting relapse in this

disease. However, many of these predictive models have been derived using relatively simple

computational algorithms and whether these models are at a stage of development worthy of

large-cohort clinical trial validation is currently a subject of debate. In this review, we focuson the derivation of optimal molecular signatures from high-dimensional data and discuss

some of the expected future developments in the field (Goodison et al., 2010).


12/42

12

Breast tumors consist of several different tissue components. Despite the heterogeneity, most

gene expression analysis has traditionally been performed without prior micro dissection of

the tissue sample. Thus the gene expression profiles obtained reflect the m RNA contribution

from the various tissue components. The differentially expressed genes were validated in

independent gene expression data from a set of laser capture micro dissected invasive ductal

carcinomas (Hyat etal., 2010).

The receptor tyrosine kinaseHER2 is an oncogene amplified in invasive breast cancer and its

over expression in mammary epithelial cell lines is a strong determinant of a tumorigenic

phenotype. Accordingly, HER2-overexpressing mammary tumors are commonly indicative

of poor prognosis in patients. Several quantitative proteomic studies have employed two-

dimensional gel electrophoresis in combination with MS/MS, which provides only limited

information about the molecular mechanisms underlying HER2/neu signaling. In the present

study, we used a SILAC-based approach to compare the proteomic profile of normal breast

epithelial cells with that of Her2/neu-overexpressing mammary epithelial cells, isolated from

primary mammary tumors arising in mouse mammary tumor virus-Her2/neu transgenic mice.

We identified 23 proteins with relevant annotated functions in breast cancer, showing a

substantial differential expression. This included over expression of creatine kinase, retinol-

binding protein 1, thymosin 4 and tumor protein D52, which correlated with the tumorigenic

phenotype of Her2-overexpressing cells. The differential expression pattern of two genes,

gelsolin and retinol binding protein 1, was further validated in normal and tumor tissues.

Finally, an in silico analysis of published cancer microarray data sets revealed a 23-gene

signature, which can be used to predict the probability of metastasis-free survival in breast

cancer patients (Pandey et al., 2010).

Genome wide DNA methylation changes in a cell line model of breast cancer metastasis have

been identified. The complex epigenetic change that was observed along with concurrent

karyotype analysis have led to hypothesize that complex genomic alterations in cancer cells

are superimposed over promoter specific methylation events that are responsible for gene

specific changes observed in breast cancer metastasis (Wendy et al., 2010).


13/42

13

Aberrant miRNA activity has been reported in many diseases and often numerous miRNAs

concurrently deregulated. There was coincident loss of expression of 6 miRNAs with

metastatic potential in breast cancer. A computational method, miR-AT! was used to

investigate combinatorial activity among this group of miRNAs. A number of genes

previously implicated in cancer metastasis are among the predicted combinatorial targets,

including TGFB1, ARPC3 and RANKL (Sultana et al., 2011).

2.1 Cluster analysis of microarray information

Microarray experiments generate mountains of data, which has to be stored and analyzed.

For humans it is difficult to handle very large numeric data sets. Therefore a general conceptis to try to reduce the dimensionality of the data. A number of clustering techniques have

been used to group genes based on their expression patterns. The user has to choose the

appropriate method for each task. The basic concepts in clustering are to try to identify and

group together similarly expressed genes and then try to correlate the observations to

biology. The idea is that co-regulated and functionally related genes are grouped into

clusters. Clustering provides the framework for this analysis. The hard part is to analyze the

biological processes and consequences. However, clustering can be a very useful tool. The

other side of the coin is the visualization of the information. Before clustering a large number

of requirements have to be met. The data has to be of good quality, the chip design should be

correct, one should have interesting genes contained in the chip, sample preparation has to be

flawless, data has to be properly treated (outliers, normalization etc). Clustering does not

improve the quality of the data. If poor-quality data is used then also the outcome is useless.

2.2 Principles of clustering

Clustering organizes the data into a small number of (relatively) homogeneous groups.Usually normalization of the expression values is used. At this stage, it is of interest to look

at the changes in expression patterns, not to follow the actual numeric changes. Thus, the

methods are used to find similar expression motifs irrespective of the expression level.

Therefore, both low and high expression level genes can end up in the same cluster if the

expression profiles are correlated by shape. The majority of clustering methods has been


14/42

14

available already for long. During the last few years, some new microarray analysis

dedicated methods have been developed, too. The methods can be described and classified in

different ways. The most commonly used methods are described here. Clustering methods

can be grouped as supervised and unsupervised. Supervised methods assign some predefined

classes to a data set, whereas in unsupervised methods no prior assumptions are applied.

Hierarchical clustering, K-means, self-organizing maps (SOMs), and principal component

analysis (PCA) have been commonly used. There are also other methods, such as

multidimensional scaling (MDS), minimum description length (MDS), gene wshaving (GS),

decision trees, and support vector machines (SVMs).

2.3 Hierarchical clustering

Hierarchical clustering is a statistical method for finding relatively homogeneous clusters.

The hierarchical clustering algorithm either iteratively joins the two closest clusters starting

from single clusters (agglomerative, bottom-up approach) or iteratively partitions clusters

starting from the complete set (divisive, top-down approach). After each step, a new distance

matrix between the newly formed clusters and the other clusters is recalculated. If there are N

cases, N-1 clustering steps are needed. There are several methods of hierarchical cluster

analysis including

Single linkage (minimum method, nearest neighbor)

complete linkage (maximum method, furthest neighbor)

Average linkage (UPGMA).

For a set of N genes to be clustered, and an NxN distance (or similarity) matrix, the

hierarchical clustering is performed as follows:

1. Assign each gene to a cluster of its own.

2. Find the closest pair of clusters and merge them into a single cluster.

3. Compute the distances (similarities) between the new cluster and each of the old clusters.

4. Repeat steps 2 and 3 until all genes are clustered.


15/42

15

Step 3 can be performed in different ways depending on the chosen approach. In single

linkage clustering, the distance between one cluster and another is considered to be equal to

the shortest distance from any member of one cluster to any member of the other cluster. In

complete linkage clustering, the distance between one cluster and another cluster is

considered to be equal to the longest distance from any member of one cluster to any member

of the other cluster. In average linkage, the distance between one cluster and another cluster

is considered to be equal to the average distance from any member of one cluster to any

member of the other cluster. The hierarchical clustering can be represented as a tree, or a

dendrogram. Branch lengths represent the degree of similarity between the genes. The

method does not provide clusters as such. Conceptually, different clusters and sizes of

clusters can be obtained by moving along the trunk or branches of the tree and deciding onwhich level to put forth branches (cut the tree).Hierarchical clustering is often applied in the

analysis of patient samples to organize the data based on the cases. Indeed, for patient

samples the hierarchical clustering method is most often the best option. Usually, in addition

to patient based organization, genes are also clustered by applying two-way clustering. On

one axis are the samples (patients) and on the other axis the genes

2.4 Self-organizing map

Kohonens self-organizing map (SOM) is a neural net that uses unsupervised learning for

which no prior knowledge of classes is required. SOMs are usually used to visualize and

interpret large high-dimensional data sets. In SOM, every input is connected to every output

via connections with variable weights. Also, the output nodes are highly interconnected.

SOM tries to learn to map similar input vectors (gene expression profiles) to similar regions

of the output array of nodes. The method maps the multidimensional distances of the feature

space to two-dimensional distances in the output map. The SOM algorithm is iterative. The

map attempts to represent all the available observations with optimal accuracy using arestricted set of models. At the same time, the models become ordered on the grid so that

similar models are close to each other and dissimilar models far from each other. So the order

and organization of the nodes (tentative clusters) contain more information than just the

actual partition of genes to clusters. In SOMs, the number of clusters has to be

predetermined. The value is given by the dimensions of the two-dimensional grid or array.


16/42

16

One has to experiment with the actual number of clusters. Most programs facilitate the

analysis of all the genes within a cluster or clusters. One should find such array dimensions

that there is a minimum number of poorly fitting genes in the clusters. The actual number of

clusters is also difficult to assign, but the square root of the number of genes is a good initial

estimate. However, it depends on the data set.

2.5 K-means clustering

K-means is a least-squares partitioning method for which the number of groups, K, has to be

provided. The algorithm computes cluster centroids and uses them as new cluster seeds, and

assigns each object to the nearest seed. However, it is also possible to estimate K from the

data, taking the approach of a mixture density estimation problem.

1. The genes are arbitrarily divided into K centroids. The reference vector i.e., location of

each centroid, is calculated.

2. Each gene is examined and assigned to one of the clusters depending on the minimum

distance.

3. The centroids position is recalculated.

4. Steps 2 and 3 are repeated until all the genes are grouped into the final required number of

clusters.

During the course of iterations, the program tries to minimize the sum, over all groups, of the

squared within-group residuals, which are the distances of the objects to the respective group

centroids . Convergence is reached when the objective function (i.e. the residual sum-of-

squares) cannot be lowered any more. The obtained groups are geometrically as compact as

possible around their respective centroids. K-means partitioning is a so-called NP-hard

problem (there is no known algorithm that would be able to solve the problem in polynomial

time), thus there is no guarantee that the absolute minimum of the objective function has

been reached. Therefore, it is good practice to repeat the analysis several times using

randomly selected initial group centroids, and check whether these analyses produce

comparable results.


17/42

17

MATERIALS AND METHODS

Progress in micro array data generation for breast cancer has now provides considerable

resources that may be used for the in-silico analysis of its gene expression; define genes that

are important in the development of breast cancer. This information can further be used to

identify and validate novel drug targets.

The next advance in availability of computational tools, which can be, used for the analysis

of micro array data. Numerous databases are now available which contain micro array data

for various kinds of cancers. Most of these are accessible over the Internet through

convenient Web browser interfaces. Many also permit downloading of micro array data.

The whole work for the present study can be summarized in the following steps:

1. Organisms Disease - Breast Cancer2. Operating systemsWindows and Linux3. Tools and Softwares Available online/Offline e.g. Genesis, Bioinformatics Research

Tools, etc

4. Research Hub RecoursesSMD, PMC etc.5. Retrieval of micro array dataPrepare datasets using different databases.6. Analysis of micro array data by using different clustering techniques for example

Hierarchical clustering, K-mean clustering, Self organizing map, Principle component

analysis etc.

For the identification of the genes responsible for the occurrence of breast cancer with the

help of microarray technology, the gene expression data from wet labs is collected and

maintained in online data bases to be further worked upon with the help of bioinformatics

tools and techniques. By application of tools that help to demarcate the genes of interest and

by studying their expression levels our objectives can be solved. Such microarray based

approaches are enumerated below


18/42

18

METHOD

Algorithm for Clustering Analysis:

Raw Data from SMD werecollected and downloaded

Five experiments were taken

and their log transformationratios were arranged in a

separate excel file for 5000

genes with all the parameters

such as clone id, cluster id,accession number etc.same.

The data sets were uploaded

in the GENESIS software.

K-means clustering and Self

Organizing Maps weregenerated.

Expression profile images

were executed.

Comparision of clusters of

gene expression based ontheir similarity.

Calculation of number ofgenes manually that were

co-expressed.


19/42

19

3.1 Softwares for clustering and visualization

Several softwares are available for clustering and visualization of gene expression patterns.

Many programs are freely available on the Internet. Both the Gene Spring and Kensington

package available at CSC contain several options for clustering. Despite the initial time

required for learning the use of these programs, they provide some benefits, mainly due to

allowing several analyses to be done within a single package. These programs are not as such

any better than freely available programs. They, however, might have a more user-friendly

interface.

Table 3.1 Software for cluster analysis and visualization.

Program Author Platform DescriptionGenesis Alexander Sturn

(2000)

Windows, Linux Clustering,

Visualization

Cluster

(Version 2.11)

Michael Eisen

(1998)

Windows Performs hierarchicalclustering, self

organizing maps

SAM

(Version 2.0)

Rob Tibshirani

(2005)

Excel Add-in SignificanceAnalysis of

Microarrays:

Supervised LearningScanAlyze

(Version 2.5.3)

Michael Eisen

(2002)

Windows Processes fluorescent

images of

microarrays

TreeView

(Version 1.6)

Michael Eisen

(2002)

Windows Graphical results of

analysis from Cluster

Expression Profiler

(Version 1.0)

EBI

(2007)

Web Analysis and

clustering of gene

expression data

GeneCluster

(Version 2.0)

Whitehead Institute/

MIT(2004)

Java Self-organizing maps

J-Express

(Version7.8.1,7.8.2,7.8.3)

MolMine

(2011)

Java Clustering andvisualization


20/42

20

SMD (Stanford Microarray Database)

1. The SMD (http://www.smd.Stanford.EDU/) project stores raw and normalized datafrom microarray experiments, as well as their corresponding image files.

2. In addition, SMD provides interfaces for data retrieval, analysis and visualization. SMDis funded by the National Cancer Institute at the US National Institutes of Health, the

National Science Foundation, and the Howard Hughes Medical Institute fund the

Microarray Database. The database is a joint project in the Departments of Biochemistry

and Genetics at the School of Medicine, Stanford University.

3. Therefore, already normalized data for breast cancer cell lines was obtained for its geneexpression analysis. The details of the data are given below.

3.3 Normalization

1. Normalization refers to computational data transformations intended to remove certain

systematic biases from microarray data, such as dye effects, intensity dependence and spatial

or print-tip effects. (In this context, it doesn't necessarily have anything to do with the normal

or Gaussian distribution.)

2. The normalization constant is calculated by the database (Stanford Microarray

Database), the first step is to select good spots on which to base the normalization. The

threshold value is initially set to > 0.65. If fewer than 10% of the spots in the print pass these

criteria, the program will use > 0.60. If fewer than 10% of the spots in the print pass the .60

threshold, the program will use > 0.55. All spots that pass the 0.55 threshold are used in the

normalization calculation, regardless of how many there are. If more than 10% of the spots

pass any threshold, the program uses those passing spots in the calculation and does not try a

lower threshold value.


21/42

21

3.1.Normalized microarray data analysis

Total intensity normalization relies on the assumption that most genes do not respond to

experimental conditions and so the average log ratio on the array should be zero. A single,

global, multiplicative adjustment is performed so that the average log ratio is zero for well

measured spots. All spots are normalized using the same constant, regardless of whether they

were used in the calculation. In the database, this is performed by computing normalized

values for all intensities by dividing the raw value by normalization constant. (Fig.3.1)

Fig.3.1: Normalization of experiments by log transformation ratios.

3.3 Centralization

1. It is the process of moving a distribution so that it is centered over the expected mean

2. For the log-transformed intensity ratios, an intensity dependent centralization might help to

correct the dye bias.


22/42

22

3.2 Centroidal K-means clustering

Here each gene is represented by a series of ratio values that are relative to the expression

level of that gene in the reference sample. Since the reference sample is usually independent

from the experiment, the analysis is preferred to be independent from the gene expression

observed in the reference sample, and that is exactly what is achieved by mean and/or median

centering. After applying this procedure the values of each gene reflect the variation from

some property of the series of observed values such as the mean or median (Fig.3.2)

Fig.3.2: Centroidal K-means clustering


23/42

23

3.4 Genesis

1. Genesis a versatile and transparent software suite for large-scale gene expression cluster

analysis. (Release: Genesis 1.7.6, Date:2010-09-20,Genesis Server: Release 1.1.0)

2. The software enables data import and visualization, data normalization, and clustering

using: (1) Hierarchical Clustering, (2) k-means, (3) Self Organizing Maps, (4) Principal

Component Analysis and (5) Support Vector Machines.

3. An important and valuable feature of this software is to calculate and compare clustering

results from different algorithmic approaches. For example, one can begin with Hierarchical

Clustering to get a first impression on the number of patterns hidden in the dataset and then

use this information to adjust the parameters for k-means and SOM clustering.

4. PCA can be used to visualize these clusters in 3D space and get an impression on cluster

size, integrity, and distribution, and to retrieve the most significant patterns in a study. It can

also reveal some information about the number of clusters in the dataset, provided that data

clouds of genes in the principal component space representing a cluster can be distinguished.

3.3Clustering and visualization procedures are executed with this software (Fig.3.3).

Fig.3.3 GENESIS Software


24/42

24

3.5 Visualization

1. Scatter plots are very useful for the initial analysis and comparison of data sets. It is

customary to present the clustering results by grouping genes of clusters next to each other.

2. In SOM based analysis, also the order of the clusters contains information about the

relationships between clusters.

3. For the manual analysis of the goodness of clustering, one has to look at the expression

patterns within the generated clusters.

4. Genes within a cluster should follow the average expression pattern of the cluster. Because

every gene has to be assigned to a cluster, genes with unique expression patterns do not fit

well in any group.


25/42

25

RESULTS AND DISCUSSION

4.1 Depending on the raw data 5000 genes were divided into 9 different clusters in Self

Organizing Map. The genes (Fig.4.1) which were distributed showed expression of gene

pattern. Some genes are under expressed and some genes are over expressed.

Fig. 4.1: Expression profile for k-means clustering


26/42

26

4.2 Single cluster image for k means clustering:

The cluster image for a particular cluster for K-means clustering showing the expression of

genes based upon their log transformation of intensity ratios. The five LOG_RAT2N_MEAN

denotes the log transformation for five different experiments based on their intensity and all

the other parameters such as Clone ID, Gene Symbol, Gene Name, Cluster ID and Accession

Number remains same for all the five experiments. (Fig.4.2)

Fig.4.2:Single cluster image for k means clustering


27/42

27

4.3 Expression profile for Self Organizing Image(SOM) clustering:

Depending on the raw data 5000 genes were divided into 9 different clusters in Self

Organizing Map. The genes (Fig.4.3) which were distributed showed expression of gene

pattern. Some genes were under expressed and some genes were over expressed.

Fig.4.3:Expression profile for Self Organizing Image(SOM) clustering.


28/42

28

4.4 Single cluster image for k means clustering:

The cluster image for a particular cluster for SOM clustering showing the expression of

genes based upon their log transformation of intensity ratios. The five LOG_RAT2N_MEAN

denotes the log transformation for five different experiments based on their intensity and all

the other parameters such as Clone ID, Gene Symbol, Gene Name, Cluster ID and Accession

Number remains same for all the five experiments. (Fig.4.4)

Fig.4.4:Single Cluster Image for Self Organizing Map


29/42

29

4.1 Distribution of Genes in different cluster by using SOM and K-mean clustering

algorithm:

After the genes were expressed in K-means clustering and Self Organizing Maps, they were

compared for the distribution of the genes in both the clustering algorithm (Table 4.1).

Table 4.1: Distribution of Genes in different cluster by using SOM and K-mean

clustering algorithm.

Self Organizing Map K-means

Cluster Genes Genes(%) Cluster Genes Genes (%)

S1 936 19 K1 455 9

S2 209 4 K2 766 15

S3 1052 21 K3 537 11

S4 40 1 K4 679 14

S5 49 1 K5 435 9

S6 1099 22 K6 587 12S7 880 18 K7 358 7

S8 280 6 K8 581 12

S9 455 9 K9 602 12


30/42

30

The expression of genes in cluster S4 and K5 (Figure 4.5) was found to be similar and over

expressed as they have similar pattern of peaks, which represents the status of gene expressions

because those genes were expressed to the usual/expected degree leading to disruption of normal

growth control processes of cells. Depending on the fluorescence ratios, the over expressed genes

are assigned positive values. The values may range from 0 to +

K5 S4Figure 4.5: Comparison of Cluster no.4 of SOM & Cluster no.5 of K-mean based on their

expression profile.

The expression of genes in cluster S7 and K9 (Figure 4.6) was found to be similar and most of the

genes were under expressed because those genes were not expressed to the usual/expected degree

leading to disruption of normal growth control processes of cells. Depending on the fluorescence

ratios, the under expressed genes are assigned negative values. The values may range from 0 to +1.

K9 S7Figure 4.6: Comparison of Cluster no.7 of SOM & Cluster no.9 of K-mean based on their

expression profile.


31/42

31

The expression of genes in cluster S1 and K8 of (Figure 4.7) was found to be similar and some

genes are showing under express while others are over expressed because some genes were

transcribed into m RNA and some are not transcribed leading to disruption of normal growth

control processes of cells.

K8 S1

Figure 4.7: Comparison of Cluster no.1 of SOM & Cluster no.8 of K-mean based on their

expression profile.

The expression of genes in cluster S8 and K1 of (Figure 4.8) was found to be similar and some

genes are showing under express while others are over expressed because some genes were

transcribed into m RNA and some are not transcribed leading to disruption of normal growth

control processes of cells..

K1 S8

Figure.4.8: Comparison of Cluster no.8 of SOM & Cluster no.1 of K-mean based on their

expression profile.


32/42

32

STATISTICAL DATA FOR CLUSTER ANALYSIS:

From the above table (Table 4.1), it was seen that some of the genes distributed in the two

clusters were highly co expressed but some genes were having same expression profile image

(S1,K8), (S8,K1), (S7,K9) and (S4,K5) which were used in the K-means clustering and Self

Organizing Maps(SOM).As we are using a breast tumor cell(one sample),therefore we are

conducting t-test for one sample to test the effectiveness of the sample and to know the

accuracy of the distribution of genes.

Table 4.2. Similar clusters showing overexpressed and underexpressed genes:

SOM Cluster(x) x K-Means Cluster(y) y

936 (S1) 876096 455 (K1) 207025

40 (S4) 1600 435 (K5) 189225

880 (S7) 774400 581 (K8) 337561

280 (S8) 78400 602 (K9) 362404

x=2136 x=1730496 y=2073 y=1096215

Mean, x=x/n=2136/4 Mean, y=y/n=2073/4

=534 =518.25

Sample variance, Sample variance.

s=x-nx/n-1 s=y-ny /n-1

=1730496-4*(534)/4-1 =1096215-4*(518.25)/4-1

=1730496-1140624/3 =1096215-1074332.25/3

=589872/3 =21882.75/3

=196624 =7294.25

S=196624 S=7294.25

=443.42 =85.4

t=x -/s/n t= x -/s/n

=534-520/443.42 =518.25-520/85.4


33/42

33

=0.0315*2 =-0.031*2

=0.063 =-0.062

Table value, |t|=0.062

t9(5%)=2.262 t9(5%)=2.262(Table value)

Since the calculated value of t is smaller than the table value at 5% probability level and on

nine degrees of freedom. Therefore it can be concluded that there is no significant difference

between sample mean(518.25,534) and population mean(520).In other words, random

sample of the four clusters (K5, S4), (K9,S7), (K8,S1) and (K1,S8) has been selected from

the population whose population average efficiency can be 520 genes.

From the statistical analysis of the distribution of the genes in both the clusters that wereunder expressed and over expressed, it was observed that both the clusters are having same

t value.

Therefore, the genes present in both the clustering algorithms were same and accurate.

4.2. Genes present in clusters of similar expression profile and co- expressed:

By observing the results of SOM and K-mean, it was observed that the expression patterns

of most of the clusters were almost same and these clusters were also having almost samegenes. The list (Table 4.2) of coexpressed genes, which was selected after the comparative

analysis of clusters showing similar expression profile.

Table 4.3: Genes which were present in clusters of similar expression profile and co-expressed.

S.No Gene No. Gene Name

1 36 Paxillin

2 55 G-protein coupled receptor 19

3 67 Homo sapiens c DNA FLI40901 fis,clone UTERU2003704

4 90 Complement component 8,alpha polypeptide

5 146 Damage-specific DNA binding protein1,127kDa

6 156 Connective tissue growth factor


34/42

34

7 186 Rhotekin

8 211 Hypothetical protein MGC2747

9 212 Homo sapiens HC6(HC6),m RNA,complete cds

10 216 Brain acyl-coA hydrolase

11 231 ESTs, weakly similar to A43932 mucin 2 precursor, intestinal-

human(fragments) [Homo sapiens]

12 250 Glycoprotein IX (platelet)

13 253 Protein kinase C,theta

14 256 Potassium voltage-gated channel, shaker-related subfamily

,member1(episodic ataxia with myokymia)

15 261 Nucleoporin 214 kDa

16 342 Sperm associated antigen 11

17 347 Microtubule-associated protein 1B

18 349 Ribosomal protein L15

19 357 JTV1 gene

20 368 DKFZP566C134 protein

21 384 Spindlin

22 391 EST23 419 Synaptotagmin VI

24 431 Cofilin 2(muscle)

25 437 BRCA1 associated RING domain 1

26 453 Integrin, alpha 5(fibronectin receptor, alpha polypeptide)

27 497 Uncharacterized hematopoietic stem/progenitor cells protein

MDS029)

28 514 Methyl-CpG binding domain protein

29 542 ESTs

30 582 Steroid sulfatase (microsomal),aryl sulfatase C,isozyme S

31 665 Annexin A7

32 786 ESTs, weakly similar to L1 repeat, Tf subfamily, member 18[Mus


35/42

35

musculus]

33 793 Homo sapiens c DNA FL1 37399 fis, clone BRAMY 2027587

34 816 ESTs

35 819 Major Histocompatibility Complex, class I,C

36 827 Zinc finger protein 177

37 862 MAD, mothers against decapentaplegic homolog 3(Drosophila)

38 892 Homo Sapiens, clone IMAGE:3610040, m RNA

39 904 Copin-like protein

40 906 Pleckstrin homology,Sec 7 and coiled/coil domains 4

41 912 Sulfide dehydrogenase like(yeast)

42 929 CLLL8 protein

43 941 ESTs

44 972 Cytochrome P450,subfamily XXVIA, polypeptide 1

45 1001 Tripartite motif-containing 5

Clustering of the normalized data by using K-mean and Self Organizing Maps were carried

out and in both the cases entire dataset was clustered into nine different clusters based on the

expression profile of genes in the breast cancer. The data set was retrieved from SMD.

Genesis was used to carry out for normalization of raw data retrieved from Stanford

Microarray Database (SMD),Department of Pathology, Stanford University and to carry out

clustering of the genes by clustering algorithms The Genes were clustered into nine different

clusters in case of both techniques, based on the expression profile of those genes in breast

cancer. In case of SOM; clusters S1, S3, S6 and S7 had maximum number of genes. Genes

that were clustered in S6, S3, S1 and S7 were highly co-expressed. Similarly, In case of K-

mean Cluster: cluster K2, K4, K6 and K9 had maximum number of genes. Genes that were

clustered in cluster number K2, K4 and K9 are highly co-expressed. The expression of genes

in some of clusters has found to be similar and some genes were under express while others

were over expressed. The genes that were over expressed showed that they were expressed to

the usual/expected degree. Depending on the fluorescence ratios, the over expressed genes


36/42

36

were assigned positive values. In contrary to the genes that were under expressed showed

that they were not expressed up to a certain degree. The under expressed genes were assigned

negative values. The value of the log transformation ratio ranges from 0 to + in case ofover

expressed genes and in case of under expressed genes, the value of log transformation ratio

ranges from 0 to -1.Due to the over expressed and under expressed genes in cancer cells,

disruption of normal growth control processes of cells occurs.

\


37/42

37

SUMMARY AND CONCLUSION




ability to stop dividing, to attach to other cells and to stay where they belong. The mutations

known to cause cancer, such as p53, BRCA1 and BRCA2, occur in the error-correcting

mechanisms. Microarray experiments generate mountains of data, which has to be stored and

analyzed. A number of clustering techniques have been used to group genes based on their

expression patterns. The user has to choose the appropriate method for each task. The basic

concept in clustering is to try to identify and group together similarly expressed genes and

then tries to correlate the observations to biology. The idea is that co-regulated and

functionally related genes are grouped into clusters .The micro array data analysis for breast

cancer was carried out by using Genesis. The data set was retrieved from SMD. Genesis was

used to carry out for normalization of raw data retrieved from SMD and to carry out

clustering of the genes by clustering algorithms. The results were generated with the help of

SOM and Kmean techniques. The Genes were clustered into 9 different clusters in case of

both techniques, based on the expression profile of those genes in breast cancer. In case of

SOM; clusters S1, S3, S6 and S7 had maximum number of genes. Genes that were clustered

in S6, S3,S1 and S7 were highly co-expressed. Similarly, In case of K-mean Cluster number

K2, K4, K6 and K9 has maximum number of genes. Genes that were clustered in cluster

number K2, K4 and K9 are highly co-expressed. The expression of genes in some of clusters

has found to be similar and some genes were under express while others were over

expressed. The expression of genes in some of clusters has found to be similar and some

genes were under express while others were over expressed.


38/42

38

CONCLUSION

Thus, it was concluded from the statistical analysis that clustering results obtained by two

techniques were same and approximately accurate. Forty-five genes have been identified

which were co expressed in different clusters. In future, work promoter analysis can be

carried out to analyze the regulatory systems of these forty-five genes. Drug target can be

identified with the help of this regulatory system analysis. Functions of these forty-five genes

are unknown and can be predicted on the bases of the known genes of similar cluster.


39/42

39

REFERENCES

Anderson, K.S., Sibani, S., Wong, J. and Raphael, J. (2009). Using custom protein

microarrays to identify autoantibody biomarkers for the early detection of breast cancer:

Cancer Res; 69: 223-233

Bertucci, F., Birnbaum, D. and Hassoun, J. (2004). Protein Expression Profiling Identifies

Subclasses of Breast Cancer and Predicts Prognosis, American Association for Cancer

Research.2: 12-29

Calvo, M.B., Antolin, N. S. and Vilamil, M.V. (2009). MicroRNA for circulating tumor

cells detection in breast cancer: In-silico and in-vitro analysis, Tumor Biology,11:280-285

Cannaves, J. M. and Leon, D. A. (2003). In silico stydy of breast cancer associated gene 3

using LION Target Engine and other tools. Biotechniques 7: 1222-1228

David, S., Larson, G., Glackin,C. and Shove, O. (2008). Meta-analysis of breast cancer

microarray studies in conjunction with conserved cis-elements suggest patterns for

coordinate regulation.BMC Bioinformatics 9: 1186-1471

Goodison, S., Sun, Y. and Urquidi, V. (2010). Derivation of cancer diagnostic and

prognostic signatures from gene expression data.Bioanalysis 5: 855-862

Holko, M., Scholtens, D. and Khan, S. A. (2009). Differential gene expression associated

with estrogen receptor status of breast cancer identified by microarray meta analysis.

American Association for Cancer Research 69: 5472

Hyat, M., Tramm,T., Alsner, J. and Finak,G. (2010). In-silico ascription of gene

expression differences to tumor and stromal cells in a model to study impact on breast cancer

outcome.Public Library of Science 5:1371


40/42

40

Ingrid, A. H., Kristen, M.C. and Sofia, K.G. (2006). Microarrays in breast cancer research

and clinical practice. Endocrine Related Cancer 13: 1017-1031

Jones, C., Mckay, A., Cossu, A. and Davies, S. (2004). Gene expression profiling for

Molecular Characterization of Inflammatory Breast Cancer and Prediction of Response to

Chemotherapy.Cancer Res 64:8558

Kristine, D. N. (2002). Microarrays and Breast Cancer: Predicting Metastatic Disease.

Medscape Hematology-Oncology.Absract nr 2067

Kallioniemi, A., Veldman, R. and Jiang, Y. (2000). Comparative Genomic HybridizationAnalysis of 38 Breast Cancer cell lines-A Basis for Interpreting Complementary DNA

Microarray data.American Association of Cancer Research, 60:4519

Macgregor, P.F. (2003). Gene expression in cancer-applications of microarrays. Expert

Review of Molecular Diagnostics, 3:185-200

Margossian, A., Diaz, J. and Corvalan, A. (2009). In silico analysis of Breast Cancer

Transcriptome Libraries distinguishes Tumor Subclasses. Cancer Res; 69:1165

Miller, D.L. and Edison, T. L. (2007). Expression genomics in breast cancer research-

microarrays at the crossroads of biology and medicine. Breast Cancer Research, 9:206

Pandey, A., Sukumar, S., Cole, R.N. and Chen, H. (2010). Proteomic characterization of

Her 2/neu over expressing breast cancer cells. Proteomics, John Wiley and Sons Ltd,10

:3800-3810

Potamias, G., Analyti, A. and Tollis, Y. (2004). Breast cancer microarrays and Biomedical

Informatics-The Prognochip Project.1st

International Advanced Research Workshop on In

silico Oncology: Advances and Challenges, Sparta, Greece.


41/42

41

Shen, D., He, J. and Chang, H.R. (2005). In silico identification of breast cancer genes by

combined multiple high throughput analysis. International Journal of Molecular Medicine 15

:205-212

Sultana, Z., Craig, D. B. and Dombkowski, A.A. (2011).In silico analysis of combinatorial

microRNA Activity Reveals Target Genes and Pathways Associated with Breast Cancer

Metastasis. Cancer Inform 17 :13-29

Tommasi, S., Pilato, B., Pinto, R. and Bruno, M. (2008). Molecular and in silico analysis

of BRCA1 and BRCA2 variants.Mutat Res 644 :64-70

Vinals, C., Gauli, S. and Coche.T.(2001). Using in silico transcriptomics to search for

tumor-associated antigens for immunotherapy. Vaccine 19 :2607-2614

Wendy, K., Andrews, J., Pilon, J. and Hodgson, A. (2010). Gene expression Profiling

predicts clinical outcomes of breast cancerNature415:530-536

Yang, Y., Choi, J. and Yoon, S. (2006). Large scale data mining approach for gene-specific

standardization of microarray gene expression data. Oxford Journal, Bioinformatics 22

:2818-2904


42/42

WEBSITES

1. http://www.helsinki.fi/biochipcenter

2. http://www.microarrays.btk.utu.fi

3. http://www.med.uio.no/dnr/microarray/

4. http://www.genome.tugraz.at/genesisclient/1.7.2/install.htm

5. http://www.smd.Stanford.EDU/

6. http://www.ncbi.nlm.nih.gov.in

7. http://www.aacr.gov
http://www.helsinki.fi/biochipcenterhttp://www.microarrays.btk.utu.fi/http://www.med.uio.no/dnr/microarray/http://www.genome.tugraz.at/genesisclient/1.7.2/install.htmhttp://www.smd.stanford.edu/http://www.ncbi.nlm.nih.gov.in/http://www.aacr.gov/http://www.aacr.gov/http://www.aacr.gov/http://www.ncbi.nlm.nih.gov.in/http://www.smd.stanford.edu/http://www.genome.tugraz.at/genesisclient/1.7.2/install.htmhttp://www.med.uio.no/dnr/microarray/http://www.microarrays.btk.utu.fi/http://www.helsinki.fi/biochipcenter

in silico analysis of microarray data for breast cancer

Documents