in silico analysis of microarray data for breast cancer

Upload: odisha751016

Post on 04-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    1/42

    1

    INTRODUCTION

    Cancer is a class ofdiseases or disorders characterized by uncontrolled division ofcells and

    the ability of these to spread, either by direct growth into adjacent tissue through invasion, or

    by implantation into distant sites by metastasis (where cancer cells are transported through

    the bloodstream or lymphatic system). Cancer may affect people at all ages, but risk tends to

    increase with age. It is one of the principal causes of death in developed countries.

    There are many types of cancer. Severity of symptoms depends on the site and character of

    the malignancy and whether there is metastasis. A definitive diagnosis usually requires the

    histological examination of tissue by a pathologist. This tissue is obtained by biopsy or

    surgery. Most cancers can be treated and some cured, depending on the specific type,location, and stage. Once diagnosed, cancer is usually treated with a combination of surgery,

    chemotherapy and radiotherapy. As research develops, treatments are becoming more

    specific for the type of cancer pathology. Drugs that target specific cancers already exist for

    several types of cancer. If untreated, cancers may eventually cause illness and death, though

    this is not always the case.

    The unregulated growth that characterizes cancer is caused by damage to DNA, resulting in

    mutations to genes that encode for proteins controlling cell division. Many mutation eventsmay be required to transform a normal cell into a malignant cell. These mutations can be

    caused by radiation, chemicals or physical agents that cause cancer, which are called

    carcinogens, or by certain viruses that can insert their DNA into the human genome.

    Mutations occur spontaneously, and may be passed down from one cell generation to the

    next as a result of mutations within germ lines. However, some carcinogens also appear to

    work through non-mutagenic pathways that affect the level of transcription of certain genes

    without causing genetic mutation .Many forms of cancer are associated with exposure to

    environmental factors such as tobacco smoke, radiation, alcohol, and certain viruses. Some

    risk factors can be avoided or reduced.

    http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Diseasehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_divisionhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_%28biology%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Metastasishttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Bloodstreamhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Lymphatic_systemhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Causes_of_deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Developed_countryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Histologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Anatomical_pathologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Biopsyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cancer_staginghttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Chemotherapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiation_therapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/DNAhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Mutationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Genehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Proteinhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Germ_linehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Transcription_%28genetics%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Environmental_factorhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Tobacco_smokehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Effects_of_alcohol_on_the_bodyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Virushttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Virushttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Effects_of_alcohol_on_the_bodyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Tobacco_smokehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Environmental_factorhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Transcription_%28genetics%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Germ_linehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Carcinogenshttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Proteinhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Genehttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Mutationhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/DNAhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Radiation_therapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Chemotherapyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cancer_staginghttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Surgeryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Biopsyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Anatomical_pathologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Histologyhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Developed_countryhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Causes_of_deathhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Lymphatic_systemhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Bloodstreamhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Metastasishttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_%28biology%29http://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Cell_divisionhttp://localhost/var/www/apps/conversion/tmp/scratch_1/wiki/Disease
  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    2/42

    2

    Breast tissue is a cancer originating from breast tissue, most commonly from the inner lining

    of milk ducts or the lobules that supply the ducts with milk. The primary risk factors for

    breast cancer are sex, age, lack of childbearing or breastfeeding, high hormone levels, race,

    etc. The classification of breast tissue includes TNM classification (tumor nodes metastatic

    grade, receptor status and the presence/absence of genes as determined by DNA testing).

    Breast cancer, like other cancers, occurs because of an interaction between the environment

    and a defective gene. Normal cells divide as many times as needed and stop. They attach to

    other cells and stay in place in tissues. Cells become cancerous when mutations destroy their

    ability to stop dividing, to attach to other cells and to stay where they belong. When cells

    divide, their DNA is normally copied with many mistakes. Error-correcting proteins fix those

    mistakes. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in

    the error-correcting mechanisms. These mutations are either inherited or acquired after birth.

    Presumably, they allow the other mutations, which allow uncontrolled division, lack of

    attachment, and metastasis to distant organs. Normal cells will commit cell suicide

    (apoptosis) when they are no longer needed. Until then, they are protected from cell suicide

    by several protein clusters and pathways. One of the protective pathways is the PI3K/AKT

    pathway; another is the RAS/MEK/ERK pathway. Sometimes the genes along these

    protective pathways are mutated in a way that turns them permanently "on", rendering the

    cell incapable of committing suicide when it is no longer needed. This is one of the steps that

    cause cancer in combination with other mutations. Normally, the PTEN protein turns off the

    PI3K/AKT pathway when the cell is ready for cell suicide. In some breast cancers, the gene

    for the PTEN protein is mutated, so the PI3K/AKT pathway is stuck in the "on" position, and

    the cancer cell does not commit suicide.

    Mutations that can lead to breast cancer have been experimentally linked to estrogen

    exposure.

    Failure of immune surveillance, the removal of malignant cells throughout one's life by the

    immune system. Abnormal growth factor signalling in the interaction between stoma cells

    and epithelial cells can facilitate malignant cell growth. The extensive use of DNA

    microarray technology in the characterization of the cell transcriptome is leading to an ever

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    3/42

    3

    increasing amount of microarray data from cancer studies. Although similar questions for the

    same type of cancer are addressed in different studies, a comparative analysis of their results

    is hampered by the use of heterogeneous microarray platforms and analysis methods.

    Gene expression profiling by DNA microarrays has become an important tool for studying

    the transcriptome of cancer cells, and has been successfully used in many studies of tumour

    classification and of identification of marker genes associated with cancer. With an

    increasing number of microarray data becoming available, the comparison of studies with

    similar research goals, to identify genes being differentially expressed in normal versus

    tumour tissue, has gained high importance. In general, the evaluation of multiple data sets

    promises to yield more reliable and more valid results since these results are based on a

    larger number of samples and the effects of individual study-specific biases are weakened.However, the comparison of results from different microarray studies is hampered by the fact

    that different studies use different protocols, microarray platforms and analysis techniques.

    The question whether the results of gene expression measurements obtained by different

    platforms can be compared has been addressed in several studies. It has been found that

    results derived from the measurements like lists of tumour subtype marker genes or measures

    of intra-study correlation of gene expression patterns can be compared and thus inter-

    validated between different platforms. However, the measures of gene expression themselves

    could not be directly compared between different platforms. Some studies propose methods

    for meta-analysis of microarray data with the goal to identify significantly differentially

    expressed genes across studies by using statistical techniques that avoid the direct

    comparison of gene expression values.

    The goal of this study is to investigate the benefit of performing supervised classification

    analyses across disparate sources of microarray data. Methods of supervised classification

    analysis render it possible to automatically build classifiers that distinguish among specimens

    on the basis of predefined class label information (phenotypes), and in many cancer research

    studies the application of these methods has shown promising results of improved tumor

    diagnosis and prognosis.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    4/42

    4

    Objectives:

    1. To identify and retrieve Microarray data for breast tumor cells.

    2. To carry out the clustering of tumorous breast cells based on the expression profile of the

    genes.

    3. To carry out the clustering algorithms such as Self Organizing Maps(SOM) and K-means.

    4. To compare and analyze the expression patterns on the basis of similarity.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    5/42

    5

    REVIEW OF LITERATUREWorldwide breast cancer comprises 22.9% of all non-skin cancer incidence among women,

    making it the most common cause of cancer death. In 2008, breast cancer caused 458,503

    deaths worldwide. The World Cancer Research Fund estimated that 38% of breast cancer

    cases in US are preventable through reducing alcohol intake, increasing physical activity

    levels and maintaining a healthy weight. Smoking tobacco also increases risk of breast

    cancer.

    Breast cancer, like other cancers, occurs because of an interaction between the environment

    and a defective gene. Normal cells divide as many times as needed and stop. They attach to

    other cells and stay in place in tissues. Cells become cancerous when mutations destroy their

    ability to stop dividing, to attach to other cells and to stay where they belong. When cells

    divide, their DNA is normally copied with many mistakes. Error-correcting proteins fix those

    mistakes. The mutations known to cause cancer, such as p53, BRCA1 and BRCA2, occur in

    the error-correcting mechanisms. These mutations are either inherited or acquired after birth.

    Presumably, they allow the other mutations, which allow uncontrolled division, lack of

    attachment, and metastasis to distant organs. Normal cells will commit cell suicide

    (apoptosis) when they are no longer needed. Until then, they are protected from cell suicide

    by several protein clusters and pathways. One of the protective pathways is the PI3K/AKT

    pathway; another is the RAS/MEK/ERK pathway. Sometimes the genes along these

    protective pathways are mutated in a way that turns them permanently "on", rendering the

    cell incapable of committing suicide when it is no longer needed. This is one of the steps that

    causes cancer in combination with other mutations. Normally, the PTEN protein turns off the

    PI3K/AKT pathway when the cell is ready for cell suicide. In some breast cancers, the gene

    for the PTEN protein is mutated, so the PI3K/AKT pathway is stuck in the "on" position, and

    the cancer cell does not commit suicide. Mutations that can lead to breast cancer have beenexperimentally linked to estrogen exposure. Failure of immune surveillance, the removal of

    malignant cells throughout one's life by the immune system. Abnormal growth factor

    signalling in the interaction between stromal cells and epithelial cells can facilitate malignant

    cell growth. In the United States, 10 to 20 percent of patients with breast cancer and patients

    with ovarian cancer have a first- or second-degree relative with one of these diseases.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    6/42

    6

    Mutations in either of two major susceptibility genes, breast cancer susceptibility gene 1

    (BRCA1) and breast cancer susceptibility gene 2 (BRCA2), confer a lifetime risk of breast

    cancer of between 60 and 85 percent and a lifetime risk of ovarian cancer of between 15 and

    40 percent. However, mutations in these genes account for only 2 to 3 percent of all breast

    cancers.

    Breast cancer cell lines provide a useful starting point for the discovery and functional

    analysis of genes involved in breast cancer.38 established breast cancer cell lines were

    studied to determine recurrent genetic alterations and the extent to which these cell lines

    resemble uncultured tumors.DNA copy number profiles were generated by comparative

    genome hybridization (CGH) for most of the publicly available breast cancer cell lines

    (Kallioniemietal., 2000).

    Immunotherapy approaches to fight cancer are based on the principle of mounting an

    immune response against a self-antigen expressed by the tumor cells. In order to reduce

    potential autoimmunity side-effects, the antigens used should be as tumor-specific as

    possible. A complementary approach to experimental tumor antigen discovery is to screen

    the human genome in silico, particularly the databases of "Expressed Sequence Tags"

    (ESTs), in search of tumor-specific and tumor-associated antigens. The public databases

    currently provide a massive amount of ESTs from several hundreds of cDNA tissue libraries,

    including tumoral tissues from various types. We describe a novel method of EST database

    screening that allows new potential tumor-associated genes to be efficiently selected. The

    resulting list of candidates is enriched in known genes, described as being expressed in tumor

    cells (Vinals et al., 2001).

    The ways in which molecular biology has affected the clinical management of breast cancer

    is very challenging. There is a technique which can be used to compare expression patterns

    of thousands of genes between different cell types. Cancer researchers use a tumors geneexpression profile to distinguish it from other tumor types. Only 5% of patients with a

    good profile developed metastasis by 5 years after initial therapy and 15% developed

    metastasis by 10 years after therapy. Using multivariate analysis, the gene expression

    signature was the strongest predictor for metastasis-free survival. This gene expression

    profile may be used in tailoring therapy for breast cancer and it can greatly reduce the

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    7/42

    7

    number of patients who would receive unnecessary adjuvant systemic treatment (Kristine,

    2002).

    Sequence analysis of individual targets is an important step in annotation and validation.

    Investigation of human breast cancer BCA3 gene was carried with LION Target Engine and

    with other bioinformatics tools. LION Target Engine confirmed that BCA3 gene is located

    11p154.A significant numbers of new orthologs were also identified and these were the basis

    for a high quality protein secondary structure prediction. Sequence conservation from

    multiple sequence alignment (MSA), splice variant identification (SVI), secondary structure

    prediction and predicted phosphorylation sites suggested that the removal of interaction sites

    through alternative splicing might play a modulatory role in BCA3 (Canaves and Leon,

    2003).

    Genome wide monitoring of gene expression using DNA microarrays represents one of the

    latest breakthroughs in experimental biology. As microarray analysis emerges from its

    infancy, there is widespread hope that microarrays will significantly impact on our ability to

    explore the genetic changes associated with cancer etiology and development and ultimately

    lead to the discovery of new biomarkers for disease diagnosis (Macgregor, 2003).

    Breast cancer is a heterogeneous disease whose evolution is difficult to predict by using

    classic histoclinical prognostic factors. Prognostic classification can benefit from molecular

    analysis such as large scale expression Profiling. Hierarchical Clustering (HCL) identified

    relevant clusters of co expressed proteins and clusters of tumors. Protein expression profiling

    may be a clinically useful approach to assess breast cancer heterogeneity (Bertucci et al.,

    2004).

    Inflammatory breast cancer (IBC) is a rare but aggressive form of breast cancer with a five

    year survival limited to 40% .Diagnosis based on clinical/pathological criteria, may be

    difficult. The potential of gene expression profiling to contribute to a better understanding of

    IBC and to provide new diagnostic and predictive factors for IBC as well as for potential

    therapeutic targets (Jones et al., 2004).

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    8/42

    8

    Correlations of risk factors with genomic data promises to provide special treatment for

    individual patients and needs interpretation of complex, multivariate patterns in gene

    expression data, as well as assessment of their ability to improve clinical predictions.DNA

    microarray data were analyzed from samples of primary breast tumors using non-linear

    statistical analysis to assess multiple patterns of interactions of group of genes that have

    predictive value for the individual patient with respect to lymph node metastasis and cancer

    recurrence. Aggregate patterns of gene expression (metagenes) were identified with about

    90% accuracy (Potamias etal.,2004).

    Publicly available human genomic sequence data provide an unprecedented opportunity for

    researchers to decode the functionality of human genomic Cancer Genome Anatomy Project

    (CGAP) and Gene Expression Omnibus (GEO) are two bioinformatics infrastructures for

    studying functional genomics. The feasibility was explored for incorporating the Internet-

    available bioinformatics databases to discover human breast cancer-related genes. Several

    tools including Gene Finder, Virtual Northern and SAGE digital gene expression displayer

    (DGED) were used to analyze differential gene expression between benign and malignant

    breast tissue. A pilot study was performed using both expressed sequence tags(EST) and

    SAGE to analyze the expression of panel of known genes including high abundance genes -

    actin and G3PDH, BRCA1 and p53 and two breast cancer related genes, Her2/heu and

    MUC1.They produced 53 differentially expressed genes according to the screening of a

    criteria greater than 5-fold difference and p< 0.01.Combined multiple high throughput

    analysis was an effective data mining strategy in cancer gene identification (Shen et al.,

    2005).

    Characteristic patterns of gene expression have emerged, reflecting molecular differences

    among different types of cancer. The application of microarray based technologies in the

    investigations of molecular markers has evolved from descriptive biological definitions of

    the tumor landscape to more functional and mechanistic studies with correlations to clinically

    important traits within the field of breast cancer management. Array based molecular

    profiling represents a technological breakthrough in how biological samples for instance

    tumor specimens, can be analyzed (Ingrid et al., 2006).

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    9/42

    9

    The identification of the change of the gene expression in multifactorial diseases such as

    breast cancer is a major goal of DNA microarray experiment. Consideration of genes

    behaviour in a wide variety of experiment can improve the statistical reliability on

    identifying genes with moderate changes between samples. The approach was evaluated via

    the reidentification of breast cancer specific genes experiment. It successfully prioritized

    several genes associated with breast tumor for which the experiment differs between normal

    and breast cancer cells was marginal and would have been difficult to recognize using

    conventional analysis methods (Yang et al., 2006).

    Genome-wide expression microarray studies have revealed that the biological and clinical

    heterogeneity of breast cancer can be partly explained by information embedded within a

    complex but ordered transcriptional architecture. Comprising this architecture are gene

    expression networks, or signatures, reflecting biochemical and behavioral properties of

    tumors that might be harnessed to improve disease sub typing, patient prognosis and

    prediction of therapeutic response. Emerging 'hypothesis-driven' strategies that incorporate

    knowledge of pathways and other biological phenomena in the signature discovery process

    are linking prognosis and therapy prediction with transcriptional readouts of tumorigenic

    mechanisms that better inform therapeutic options (Miller and Edison, 2007).

    Gene expression measurements from breast cancer (BRCA) tumors are established clinical

    predictive tools to identify tumor subtypes, identify patients showing poor/good prognosis

    and identify patients likely to have disease recurrence. Experiment data from 9 published

    microarray studies examining estrogen receptor negative(ER-) BRCA tumor cases from the

    Oncomine database. Some of the genes identified, distinguish key transcripts previously seen

    in array studies, while others are newly defined. Many of the genes identified as over

    expressed in ER-tumors were previously identified as expression markers for neoplastic

    transformation in multiple human cancers (David et al., 2008).

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    10/42

    10

    Germ line mutations of high penetrant BRCA1 and BRCA2 genes have been associated to

    hereditary breast cancer risk, while polymorphic variants of the two genes still have an

    unknown role in breast pathogenesis. The aim of our study was to characterize BRCA1 and

    BRCA2 genes polymorphic variants in familial breast cancer. 110 patients affected by

    familial breast and/or ovarian cancer have been consecutively enrolled according to family

    history and BRCA mutation risk. All of them have been screened for BRCA1 and BRCA2

    pathogenetic mutations, SNPs and intronic variants. In silico analyses have been also

    performed using different computational methods to individualize genetic variations that can

    alter the two genes expression and function. BRCA1 resulted mutated in 14% while BRCA2

    in 3% of cases, while 80% of patients presented at least one polymorphism. A neural network

    splicing prediction model individualized one BRCA1 and one BRCA2 intronic variants ableto determine alternative splicing. Furthermore, Q356R BRCA1 and N289H BRCA2 appear

    to show a possible harmful role also due to their location in functional regions of the two

    genes. However, in silico data are not always consistent with biological evidences. In

    conclusion, SNPs profile provides a basis for DNA-based cancer risk classification and help

    to define the gene alterations that could influence biochemistry activity protein or could

    modify drug sensitivity (Tommasi et al., 2008).

    Cancer patients make antibodies to tumor derived proteins that are potential biomarkers for

    early detection. To detect antibodies to tumor antigens in patient sera, novel high-density

    custom protein microarrays (NAPPA) were adapted. They were probed with sera from

    patients with early stage breast cancer and healthy women. Custom in situ protein

    microarrays can be used to detect serum tumor antigen-specific antibodies and enables the

    rapid, simultaneous detection of immunogenic tumor antigens from patient sera. These auto

    antibodies were being evaluated as potential biomarkers for the early diagnosis of breast

    cancer (Anderson et al., 2009).

    Detection of circulating tumor cells (CTC) may provide diagnostic and prognostic

    information in breast cancer patients. Deregulation of miRNAs is frequent in tumor

    progression. Phase I preclinical study was performed by means of computational tools for

    miRNA profiling including MIRGATOR, MIRBASE, SMIRNAdb, Gene HUB-GEPIS,

    MICRORNA. ORG. Computational tools identify a set of miRNA bioinformatics approach is

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    11/42

    11

    a useful high-throughput method to select associated miRNAs. The selected miRNAs should

    be further evaluated for CTC detection (Calvo et al., 2009).

    Estrogen receptor positive breast cancer can be prevented by Tamoxifen. Microarray

    technology has identified genomic classifiers for breast cancer prognosis and prediction.

    Meta-analysis of microarray data was conducted to identify individual genes associated with

    estrogen receptor(ER) status of breast cancer (Holko et al., 2009).

    Breast carcinoma is one of the most common causes of cancer related death. Serial Analysis

    of Gene Expression (SAGE) is a comprehensive profiling method that allows for global,

    unbiased and quantitative characterization of transcriptomes. Four normal breast tissues,

    eleven primary breast tumors and three breast metastatic tissues were retrieved and the data

    were analyzed by Correspondence Analysis, Hierarchical Clustering, Support Tree and

    Significance Analysis of Microarray. Comprehensive analysis from whole transcriptomes of

    primary breast cancer tissues compared with normal breast tissue and breast cancer

    metastatic tissues revealed that clinically disparate outcomes could be linked to a relatively

    small number of transcripts (Margossian et al., 2009).

    The ability to compare genome-wide expression profiles in human tissue samples has the

    potential to add an invaluable molecular pathology aspect to the detection and evaluation of

    multiple diseases. Applications include initial diagnosis, evaluation of disease subtype,

    monitoring of response to therapy and the prediction of disease recurrence. The derivation of

    molecular signatures that can predict tumor recurrence in breast cancer has been a

    particularly intense area of investigation and a number of studies have shown that molecular

    signatures can outperform currently used clinic pathologic factors in predicting relapse in this

    disease. However, many of these predictive models have been derived using relatively simple

    computational algorithms and whether these models are at a stage of development worthy of

    large-cohort clinical trial validation is currently a subject of debate. In this review, we focuson the derivation of optimal molecular signatures from high-dimensional data and discuss

    some of the expected future developments in the field (Goodison et al., 2010).

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    12/42

    12

    Breast tumors consist of several different tissue components. Despite the heterogeneity, most

    gene expression analysis has traditionally been performed without prior micro dissection of

    the tissue sample. Thus the gene expression profiles obtained reflect the m RNA contribution

    from the various tissue components. The differentially expressed genes were validated in

    independent gene expression data from a set of laser capture micro dissected invasive ductal

    carcinomas (Hyat etal., 2010).

    The receptor tyrosine kinaseHER2 is an oncogene amplified in invasive breast cancer and its

    over expression in mammary epithelial cell lines is a strong determinant of a tumorigenic

    phenotype. Accordingly, HER2-overexpressing mammary tumors are commonly indicative

    of poor prognosis in patients. Several quantitative proteomic studies have employed two-

    dimensional gel electrophoresis in combination with MS/MS, which provides only limited

    information about the molecular mechanisms underlying HER2/neu signaling. In the present

    study, we used a SILAC-based approach to compare the proteomic profile of normal breast

    epithelial cells with that of Her2/neu-overexpressing mammary epithelial cells, isolated from

    primary mammary tumors arising in mouse mammary tumor virus-Her2/neu transgenic mice.

    We identified 23 proteins with relevant annotated functions in breast cancer, showing a

    substantial differential expression. This included over expression of creatine kinase, retinol-

    binding protein 1, thymosin 4 and tumor protein D52, which correlated with the tumorigenic

    phenotype of Her2-overexpressing cells. The differential expression pattern of two genes,

    gelsolin and retinol binding protein 1, was further validated in normal and tumor tissues.

    Finally, an in silico analysis of published cancer microarray data sets revealed a 23-gene

    signature, which can be used to predict the probability of metastasis-free survival in breast

    cancer patients (Pandey et al., 2010).

    Genome wide DNA methylation changes in a cell line model of breast cancer metastasis have

    been identified. The complex epigenetic change that was observed along with concurrent

    karyotype analysis have led to hypothesize that complex genomic alterations in cancer cells

    are superimposed over promoter specific methylation events that are responsible for gene

    specific changes observed in breast cancer metastasis (Wendy et al., 2010).

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    13/42

    13

    Aberrant miRNA activity has been reported in many diseases and often numerous miRNAs

    concurrently deregulated. There was coincident loss of expression of 6 miRNAs with

    metastatic potential in breast cancer. A computational method, miR-AT! was used to

    investigate combinatorial activity among this group of miRNAs. A number of genes

    previously implicated in cancer metastasis are among the predicted combinatorial targets,

    including TGFB1, ARPC3 and RANKL (Sultana et al., 2011).

    2.1 Cluster analysis of microarray information

    Microarray experiments generate mountains of data, which has to be stored and analyzed.

    For humans it is difficult to handle very large numeric data sets. Therefore a general conceptis to try to reduce the dimensionality of the data. A number of clustering techniques have

    been used to group genes based on their expression patterns. The user has to choose the

    appropriate method for each task. The basic concepts in clustering are to try to identify and

    group together similarly expressed genes and then try to correlate the observations to

    biology. The idea is that co-regulated and functionally related genes are grouped into

    clusters. Clustering provides the framework for this analysis. The hard part is to analyze the

    biological processes and consequences. However, clustering can be a very useful tool. The

    other side of the coin is the visualization of the information. Before clustering a large number

    of requirements have to be met. The data has to be of good quality, the chip design should be

    correct, one should have interesting genes contained in the chip, sample preparation has to be

    flawless, data has to be properly treated (outliers, normalization etc). Clustering does not

    improve the quality of the data. If poor-quality data is used then also the outcome is useless.

    2.2 Principles of clustering

    Clustering organizes the data into a small number of (relatively) homogeneous groups.Usually normalization of the expression values is used. At this stage, it is of interest to look

    at the changes in expression patterns, not to follow the actual numeric changes. Thus, the

    methods are used to find similar expression motifs irrespective of the expression level.

    Therefore, both low and high expression level genes can end up in the same cluster if the

    expression profiles are correlated by shape. The majority of clustering methods has been

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    14/42

    14

    available already for long. During the last few years, some new microarray analysis

    dedicated methods have been developed, too. The methods can be described and classified in

    different ways. The most commonly used methods are described here. Clustering methods

    can be grouped as supervised and unsupervised. Supervised methods assign some predefined

    classes to a data set, whereas in unsupervised methods no prior assumptions are applied.

    Hierarchical clustering, K-means, self-organizing maps (SOMs), and principal component

    analysis (PCA) have been commonly used. There are also other methods, such as

    multidimensional scaling (MDS), minimum description length (MDS), gene wshaving (GS),

    decision trees, and support vector machines (SVMs).

    2.3 Hierarchical clustering

    Hierarchical clustering is a statistical method for finding relatively homogeneous clusters.

    The hierarchical clustering algorithm either iteratively joins the two closest clusters starting

    from single clusters (agglomerative, bottom-up approach) or iteratively partitions clusters

    starting from the complete set (divisive, top-down approach). After each step, a new distance

    matrix between the newly formed clusters and the other clusters is recalculated. If there are N

    cases, N-1 clustering steps are needed. There are several methods of hierarchical cluster

    analysis including

    Single linkage (minimum method, nearest neighbor)

    complete linkage (maximum method, furthest neighbor)

    Average linkage (UPGMA).

    For a set of N genes to be clustered, and an NxN distance (or similarity) matrix, the

    hierarchical clustering is performed as follows:

    1. Assign each gene to a cluster of its own.

    2. Find the closest pair of clusters and merge them into a single cluster.

    3. Compute the distances (similarities) between the new cluster and each of the old clusters.

    4. Repeat steps 2 and 3 until all genes are clustered.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    15/42

    15

    Step 3 can be performed in different ways depending on the chosen approach. In single

    linkage clustering, the distance between one cluster and another is considered to be equal to

    the shortest distance from any member of one cluster to any member of the other cluster. In

    complete linkage clustering, the distance between one cluster and another cluster is

    considered to be equal to the longest distance from any member of one cluster to any member

    of the other cluster. In average linkage, the distance between one cluster and another cluster

    is considered to be equal to the average distance from any member of one cluster to any

    member of the other cluster. The hierarchical clustering can be represented as a tree, or a

    dendrogram. Branch lengths represent the degree of similarity between the genes. The

    method does not provide clusters as such. Conceptually, different clusters and sizes of

    clusters can be obtained by moving along the trunk or branches of the tree and deciding onwhich level to put forth branches (cut the tree).Hierarchical clustering is often applied in the

    analysis of patient samples to organize the data based on the cases. Indeed, for patient

    samples the hierarchical clustering method is most often the best option. Usually, in addition

    to patient based organization, genes are also clustered by applying two-way clustering. On

    one axis are the samples (patients) and on the other axis the genes

    2.4 Self-organizing map

    Kohonens self-organizing map (SOM) is a neural net that uses unsupervised learning for

    which no prior knowledge of classes is required. SOMs are usually used to visualize and

    interpret large high-dimensional data sets. In SOM, every input is connected to every output

    via connections with variable weights. Also, the output nodes are highly interconnected.

    SOM tries to learn to map similar input vectors (gene expression profiles) to similar regions

    of the output array of nodes. The method maps the multidimensional distances of the feature

    space to two-dimensional distances in the output map. The SOM algorithm is iterative. The

    map attempts to represent all the available observations with optimal accuracy using arestricted set of models. At the same time, the models become ordered on the grid so that

    similar models are close to each other and dissimilar models far from each other. So the order

    and organization of the nodes (tentative clusters) contain more information than just the

    actual partition of genes to clusters. In SOMs, the number of clusters has to be

    predetermined. The value is given by the dimensions of the two-dimensional grid or array.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    16/42

    16

    One has to experiment with the actual number of clusters. Most programs facilitate the

    analysis of all the genes within a cluster or clusters. One should find such array dimensions

    that there is a minimum number of poorly fitting genes in the clusters. The actual number of

    clusters is also difficult to assign, but the square root of the number of genes is a good initial

    estimate. However, it depends on the data set.

    2.5 K-means clustering

    K-means is a least-squares partitioning method for which the number of groups, K, has to be

    provided. The algorithm computes cluster centroids and uses them as new cluster seeds, and

    assigns each object to the nearest seed. However, it is also possible to estimate K from the

    data, taking the approach of a mixture density estimation problem.

    1. The genes are arbitrarily divided into K centroids. The reference vector i.e., location of

    each centroid, is calculated.

    2. Each gene is examined and assigned to one of the clusters depending on the minimum

    distance.

    3. The centroids position is recalculated.

    4. Steps 2 and 3 are repeated until all the genes are grouped into the final required number of

    clusters.

    During the course of iterations, the program tries to minimize the sum, over all groups, of the

    squared within-group residuals, which are the distances of the objects to the respective group

    centroids . Convergence is reached when the objective function (i.e. the residual sum-of-

    squares) cannot be lowered any more. The obtained groups are geometrically as compact as

    possible around their respective centroids. K-means partitioning is a so-called NP-hard

    problem (there is no known algorithm that would be able to solve the problem in polynomial

    time), thus there is no guarantee that the absolute minimum of the objective function has

    been reached. Therefore, it is good practice to repeat the analysis several times using

    randomly selected initial group centroids, and check whether these analyses produce

    comparable results.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    17/42

    17

    MATERIALS AND METHODS

    Progress in micro array data generation for breast cancer has now provides considerable

    resources that may be used for the in-silico analysis of its gene expression; define genes that

    are important in the development of breast cancer. This information can further be used to

    identify and validate novel drug targets.

    The next advance in availability of computational tools, which can be, used for the analysis

    of micro array data. Numerous databases are now available which contain micro array data

    for various kinds of cancers. Most of these are accessible over the Internet through

    convenient Web browser interfaces. Many also permit downloading of micro array data.

    The whole work for the present study can be summarized in the following steps:

    1. Organisms Disease - Breast Cancer2. Operating systemsWindows and Linux3. Tools and Softwares Available online/Offline e.g. Genesis, Bioinformatics Research

    Tools, etc

    4. Research Hub RecoursesSMD, PMC etc.5. Retrieval of micro array dataPrepare datasets using different databases.6. Analysis of micro array data by using different clustering techniques for example

    Hierarchical clustering, K-mean clustering, Self organizing map, Principle component

    analysis etc.

    For the identification of the genes responsible for the occurrence of breast cancer with the

    help of microarray technology, the gene expression data from wet labs is collected and

    maintained in online data bases to be further worked upon with the help of bioinformatics

    tools and techniques. By application of tools that help to demarcate the genes of interest and

    by studying their expression levels our objectives can be solved. Such microarray based

    approaches are enumerated below

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    18/42

    18

    METHOD

    Algorithm for Clustering Analysis:

    Raw Data from SMD werecollected and downloaded

    Five experiments were taken

    and their log transformationratios were arranged in a

    separate excel file for 5000

    genes with all the parameters

    such as clone id, cluster id,accession number etc.same.

    The data sets were uploaded

    in the GENESIS software.

    K-means clustering and Self

    Organizing Maps weregenerated.

    Expression profile images

    were executed.

    Comparision of clusters of

    gene expression based ontheir similarity.

    Calculation of number ofgenes manually that were

    co-expressed.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    19/42

    19

    3.1 Softwares for clustering and visualization

    Several softwares are available for clustering and visualization of gene expression patterns.

    Many programs are freely available on the Internet. Both the Gene Spring and Kensington

    package available at CSC contain several options for clustering. Despite the initial time

    required for learning the use of these programs, they provide some benefits, mainly due to

    allowing several analyses to be done within a single package. These programs are not as such

    any better than freely available programs. They, however, might have a more user-friendly

    interface.

    Table 3.1 Software for cluster analysis and visualization.

    Program Author Platform DescriptionGenesis Alexander Sturn

    (2000)

    Windows, Linux Clustering,

    Visualization

    Cluster

    (Version 2.11)

    Michael Eisen

    (1998)

    Windows Performs hierarchicalclustering, self

    organizing maps

    SAM

    (Version 2.0)

    Rob Tibshirani

    (2005)

    Excel Add-in SignificanceAnalysis of

    Microarrays:

    Supervised LearningScanAlyze

    (Version 2.5.3)

    Michael Eisen

    (2002)

    Windows Processes fluorescent

    images of

    microarrays

    TreeView

    (Version 1.6)

    Michael Eisen

    (2002)

    Windows Graphical results of

    analysis from Cluster

    Expression Profiler

    (Version 1.0)

    EBI

    (2007)

    Web Analysis and

    clustering of gene

    expression data

    GeneCluster

    (Version 2.0)

    Whitehead Institute/

    MIT(2004)

    Java Self-organizing maps

    J-Express

    (Version7.8.1,7.8.2,7.8.3)

    MolMine

    (2011)

    Java Clustering andvisualization

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    20/42

    20

    SMD (Stanford Microarray Database)

    1. The SMD (http://www.smd.Stanford.EDU/) project stores raw and normalized datafrom microarray experiments, as well as their corresponding image files.

    2. In addition, SMD provides interfaces for data retrieval, analysis and visualization. SMDis funded by the National Cancer Institute at the US National Institutes of Health, the

    National Science Foundation, and the Howard Hughes Medical Institute fund the

    Microarray Database. The database is a joint project in the Departments of Biochemistry

    and Genetics at the School of Medicine, Stanford University.

    3. Therefore, already normalized data for breast cancer cell lines was obtained for its geneexpression analysis. The details of the data are given below.

    3.3 Normalization

    1. Normalization refers to computational data transformations intended to remove certain

    systematic biases from microarray data, such as dye effects, intensity dependence and spatial

    or print-tip effects. (In this context, it doesn't necessarily have anything to do with the normal

    or Gaussian distribution.)

    2. The normalization constant is calculated by the database (Stanford Microarray

    Database), the first step is to select good spots on which to base the normalization. The

    threshold value is initially set to > 0.65. If fewer than 10% of the spots in the print pass these

    criteria, the program will use > 0.60. If fewer than 10% of the spots in the print pass the .60

    threshold, the program will use > 0.55. All spots that pass the 0.55 threshold are used in the

    normalization calculation, regardless of how many there are. If more than 10% of the spots

    pass any threshold, the program uses those passing spots in the calculation and does not try a

    lower threshold value.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    21/42

    21

    3.1.Normalized microarray data analysis

    Total intensity normalization relies on the assumption that most genes do not respond to

    experimental conditions and so the average log ratio on the array should be zero. A single,

    global, multiplicative adjustment is performed so that the average log ratio is zero for well

    measured spots. All spots are normalized using the same constant, regardless of whether they

    were used in the calculation. In the database, this is performed by computing normalized

    values for all intensities by dividing the raw value by normalization constant. (Fig.3.1)

    Fig.3.1: Normalization of experiments by log transformation ratios.

    3.3 Centralization

    1. It is the process of moving a distribution so that it is centered over the expected mean

    2. For the log-transformed intensity ratios, an intensity dependent centralization might help to

    correct the dye bias.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    22/42

    22

    3.2 Centroidal K-means clustering

    Here each gene is represented by a series of ratio values that are relative to the expression

    level of that gene in the reference sample. Since the reference sample is usually independent

    from the experiment, the analysis is preferred to be independent from the gene expression

    observed in the reference sample, and that is exactly what is achieved by mean and/or median

    centering. After applying this procedure the values of each gene reflect the variation from

    some property of the series of observed values such as the mean or median (Fig.3.2)

    Fig.3.2: Centroidal K-means clustering

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    23/42

    23

    3.4 Genesis

    1. Genesis a versatile and transparent software suite for large-scale gene expression cluster

    analysis. (Release: Genesis 1.7.6, Date:2010-09-20,Genesis Server: Release 1.1.0)

    2. The software enables data import and visualization, data normalization, and clustering

    using: (1) Hierarchical Clustering, (2) k-means, (3) Self Organizing Maps, (4) Principal

    Component Analysis and (5) Support Vector Machines.

    3. An important and valuable feature of this software is to calculate and compare clustering

    results from different algorithmic approaches. For example, one can begin with Hierarchical

    Clustering to get a first impression on the number of patterns hidden in the dataset and then

    use this information to adjust the parameters for k-means and SOM clustering.

    4. PCA can be used to visualize these clusters in 3D space and get an impression on cluster

    size, integrity, and distribution, and to retrieve the most significant patterns in a study. It can

    also reveal some information about the number of clusters in the dataset, provided that data

    clouds of genes in the principal component space representing a cluster can be distinguished.

    3.3Clustering and visualization procedures are executed with this software (Fig.3.3).

    Fig.3.3 GENESIS Software

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    24/42

    24

    3.5 Visualization

    1. Scatter plots are very useful for the initial analysis and comparison of data sets. It is

    customary to present the clustering results by grouping genes of clusters next to each other.

    2. In SOM based analysis, also the order of the clusters contains information about the

    relationships between clusters.

    3. For the manual analysis of the goodness of clustering, one has to look at the expression

    patterns within the generated clusters.

    4. Genes within a cluster should follow the average expression pattern of the cluster. Because

    every gene has to be assigned to a cluster, genes with unique expression patterns do not fit

    well in any group.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    25/42

    25

    RESULTS AND DISCUSSION

    4.1 Depending on the raw data 5000 genes were divided into 9 different clusters in Self

    Organizing Map. The genes (Fig.4.1) which were distributed showed expression of gene

    pattern. Some genes are under expressed and some genes are over expressed.

    Fig. 4.1: Expression profile for k-means clustering

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    26/42

    26

    4.2 Single cluster image for k means clustering:

    The cluster image for a particular cluster for K-means clustering showing the expression of

    genes based upon their log transformation of intensity ratios. The five LOG_RAT2N_MEAN

    denotes the log transformation for five different experiments based on their intensity and all

    the other parameters such as Clone ID, Gene Symbol, Gene Name, Cluster ID and Accession

    Number remains same for all the five experiments. (Fig.4.2)

    Fig.4.2:Single cluster image for k means clustering

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    27/42

    27

    4.3 Expression profile for Self Organizing Image(SOM) clustering:

    Depending on the raw data 5000 genes were divided into 9 different clusters in Self

    Organizing Map. The genes (Fig.4.3) which were distributed showed expression of gene

    pattern. Some genes were under expressed and some genes were over expressed.

    Fig.4.3:Expression profile for Self Organizing Image(SOM) clustering.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    28/42

    28

    4.4 Single cluster image for k means clustering:

    The cluster image for a particular cluster for SOM clustering showing the expression of

    genes based upon their log transformation of intensity ratios. The five LOG_RAT2N_MEAN

    denotes the log transformation for five different experiments based on their intensity and all

    the other parameters such as Clone ID, Gene Symbol, Gene Name, Cluster ID and Accession

    Number remains same for all the five experiments. (Fig.4.4)

    Fig.4.4:Single Cluster Image for Self Organizing Map

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    29/42

    29

    4.1 Distribution of Genes in different cluster by using SOM and K-mean clustering

    algorithm:

    After the genes were expressed in K-means clustering and Self Organizing Maps, they were

    compared for the distribution of the genes in both the clustering algorithm (Table 4.1).

    Table 4.1: Distribution of Genes in different cluster by using SOM and K-mean

    clustering algorithm.

    Self Organizing Map K-means

    Cluster Genes Genes(%) Cluster Genes Genes (%)

    S1 936 19 K1 455 9

    S2 209 4 K2 766 15

    S3 1052 21 K3 537 11

    S4 40 1 K4 679 14

    S5 49 1 K5 435 9

    S6 1099 22 K6 587 12S7 880 18 K7 358 7

    S8 280 6 K8 581 12

    S9 455 9 K9 602 12

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    30/42

    30

    The expression of genes in cluster S4 and K5 (Figure 4.5) was found to be similar and over

    expressed as they have similar pattern of peaks, which represents the status of gene expressions

    because those genes were expressed to the usual/expected degree leading to disruption of normal

    growth control processes of cells. Depending on the fluorescence ratios, the over expressed genes

    are assigned positive values. The values may range from 0 to +

    K5 S4Figure 4.5: Comparison of Cluster no.4 of SOM & Cluster no.5 of K-mean based on their

    expression profile.

    The expression of genes in cluster S7 and K9 (Figure 4.6) was found to be similar and most of the

    genes were under expressed because those genes were not expressed to the usual/expected degree

    leading to disruption of normal growth control processes of cells. Depending on the fluorescence

    ratios, the under expressed genes are assigned negative values. The values may range from 0 to +1.

    K9 S7Figure 4.6: Comparison of Cluster no.7 of SOM & Cluster no.9 of K-mean based on their

    expression profile.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    31/42

    31

    The expression of genes in cluster S1 and K8 of (Figure 4.7) was found to be similar and some

    genes are showing under express while others are over expressed because some genes were

    transcribed into m RNA and some are not transcribed leading to disruption of normal growth

    control processes of cells.

    K8 S1

    Figure 4.7: Comparison of Cluster no.1 of SOM & Cluster no.8 of K-mean based on their

    expression profile.

    The expression of genes in cluster S8 and K1 of (Figure 4.8) was found to be similar and some

    genes are showing under express while others are over expressed because some genes were

    transcribed into m RNA and some are not transcribed leading to disruption of normal growth

    control processes of cells..

    K1 S8

    Figure.4.8: Comparison of Cluster no.8 of SOM & Cluster no.1 of K-mean based on their

    expression profile.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    32/42

    32

    STATISTICAL DATA FOR CLUSTER ANALYSIS:

    From the above table (Table 4.1), it was seen that some of the genes distributed in the two

    clusters were highly co expressed but some genes were having same expression profile image

    (S1,K8), (S8,K1), (S7,K9) and (S4,K5) which were used in the K-means clustering and Self

    Organizing Maps(SOM).As we are using a breast tumor cell(one sample),therefore we are

    conducting t-test for one sample to test the effectiveness of the sample and to know the

    accuracy of the distribution of genes.

    Table 4.2. Similar clusters showing overexpressed and underexpressed genes:

    SOM Cluster(x) x K-Means Cluster(y) y

    936 (S1) 876096 455 (K1) 207025

    40 (S4) 1600 435 (K5) 189225

    880 (S7) 774400 581 (K8) 337561

    280 (S8) 78400 602 (K9) 362404

    x=2136 x=1730496 y=2073 y=1096215

    Mean, x=x/n=2136/4 Mean, y=y/n=2073/4

    =534 =518.25

    Sample variance, Sample variance.

    s=x-nx/n-1 s=y-ny /n-1

    =1730496-4*(534)/4-1 =1096215-4*(518.25)/4-1

    =1730496-1140624/3 =1096215-1074332.25/3

    =589872/3 =21882.75/3

    =196624 =7294.25

    S=196624 S=7294.25

    =443.42 =85.4

    t=x -/s/n t= x -/s/n

    =534-520/443.42 =518.25-520/85.4

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    33/42

    33

    =0.0315*2 =-0.031*2

    =0.063 =-0.062

    Table value, |t|=0.062

    t9(5%)=2.262 t9(5%)=2.262(Table value)

    Since the calculated value of t is smaller than the table value at 5% probability level and on

    nine degrees of freedom. Therefore it can be concluded that there is no significant difference

    between sample mean(518.25,534) and population mean(520).In other words, random

    sample of the four clusters (K5, S4), (K9,S7), (K8,S1) and (K1,S8) has been selected from

    the population whose population average efficiency can be 520 genes.

    From the statistical analysis of the distribution of the genes in both the clusters that wereunder expressed and over expressed, it was observed that both the clusters are having same

    t value.

    Therefore, the genes present in both the clustering algorithms were same and accurate.

    4.2. Genes present in clusters of similar expression profile and co- expressed:

    By observing the results of SOM and K-mean, it was observed that the expression patterns

    of most of the clusters were almost same and these clusters were also having almost samegenes. The list (Table 4.2) of coexpressed genes, which was selected after the comparative

    analysis of clusters showing similar expression profile.

    Table 4.3: Genes which were present in clusters of similar expression profile and co-expressed.

    S.No Gene No. Gene Name

    1 36 Paxillin

    2 55 G-protein coupled receptor 19

    3 67 Homo sapiens c DNA FLI40901 fis,clone UTERU2003704

    4 90 Complement component 8,alpha polypeptide

    5 146 Damage-specific DNA binding protein1,127kDa

    6 156 Connective tissue growth factor

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    34/42

    34

    7 186 Rhotekin

    8 211 Hypothetical protein MGC2747

    9 212 Homo sapiens HC6(HC6),m RNA,complete cds

    10 216 Brain acyl-coA hydrolase

    11 231 ESTs, weakly similar to A43932 mucin 2 precursor, intestinal-

    human(fragments) [Homo sapiens]

    12 250 Glycoprotein IX (platelet)

    13 253 Protein kinase C,theta

    14 256 Potassium voltage-gated channel, shaker-related subfamily

    ,member1(episodic ataxia with myokymia)

    15 261 Nucleoporin 214 kDa

    16 342 Sperm associated antigen 11

    17 347 Microtubule-associated protein 1B

    18 349 Ribosomal protein L15

    19 357 JTV1 gene

    20 368 DKFZP566C134 protein

    21 384 Spindlin

    22 391 EST23 419 Synaptotagmin VI

    24 431 Cofilin 2(muscle)

    25 437 BRCA1 associated RING domain 1

    26 453 Integrin, alpha 5(fibronectin receptor, alpha polypeptide)

    27 497 Uncharacterized hematopoietic stem/progenitor cells protein

    MDS029)

    28 514 Methyl-CpG binding domain protein

    29 542 ESTs

    30 582 Steroid sulfatase (microsomal),aryl sulfatase C,isozyme S

    31 665 Annexin A7

    32 786 ESTs, weakly similar to L1 repeat, Tf subfamily, member 18[Mus

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    35/42

    35

    musculus]

    33 793 Homo sapiens c DNA FL1 37399 fis, clone BRAMY 2027587

    34 816 ESTs

    35 819 Major Histocompatibility Complex, class I,C

    36 827 Zinc finger protein 177

    37 862 MAD, mothers against decapentaplegic homolog 3(Drosophila)

    38 892 Homo Sapiens, clone IMAGE:3610040, m RNA

    39 904 Copin-like protein

    40 906 Pleckstrin homology,Sec 7 and coiled/coil domains 4

    41 912 Sulfide dehydrogenase like(yeast)

    42 929 CLLL8 protein

    43 941 ESTs

    44 972 Cytochrome P450,subfamily XXVIA, polypeptide 1

    45 1001 Tripartite motif-containing 5

    Clustering of the normalized data by using K-mean and Self Organizing Maps were carried

    out and in both the cases entire dataset was clustered into nine different clusters based on the

    expression profile of genes in the breast cancer. The data set was retrieved from SMD.

    Genesis was used to carry out for normalization of raw data retrieved from Stanford

    Microarray Database (SMD),Department of Pathology, Stanford University and to carry out

    clustering of the genes by clustering algorithms The Genes were clustered into nine different

    clusters in case of both techniques, based on the expression profile of those genes in breast

    cancer. In case of SOM; clusters S1, S3, S6 and S7 had maximum number of genes. Genes

    that were clustered in S6, S3, S1 and S7 were highly co-expressed. Similarly, In case of K-

    mean Cluster: cluster K2, K4, K6 and K9 had maximum number of genes. Genes that were

    clustered in cluster number K2, K4 and K9 are highly co-expressed. The expression of genes

    in some of clusters has found to be similar and some genes were under express while others

    were over expressed. The genes that were over expressed showed that they were expressed to

    the usual/expected degree. Depending on the fluorescence ratios, the over expressed genes

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    36/42

    36

    were assigned positive values. In contrary to the genes that were under expressed showed

    that they were not expressed up to a certain degree. The under expressed genes were assigned

    negative values. The value of the log transformation ratio ranges from 0 to + in case ofover

    expressed genes and in case of under expressed genes, the value of log transformation ratio

    ranges from 0 to -1.Due to the over expressed and under expressed genes in cancer cells,

    disruption of normal growth control processes of cells occurs.

    \

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    37/42

    37

    SUMMARY AND CONCLUSION

    Breast cancer, like other cancers, occurs because of an interaction between the environment

    and a defective gene. Normal cells divide as many times as needed and stop. They attach to

    other cells and stay in place in tissues. Cells become cancerous when mutations destroy their

    ability to stop dividing, to attach to other cells and to stay where they belong. The mutations

    known to cause cancer, such as p53, BRCA1 and BRCA2, occur in the error-correcting

    mechanisms. Microarray experiments generate mountains of data, which has to be stored and

    analyzed. A number of clustering techniques have been used to group genes based on their

    expression patterns. The user has to choose the appropriate method for each task. The basic

    concept in clustering is to try to identify and group together similarly expressed genes and

    then tries to correlate the observations to biology. The idea is that co-regulated and

    functionally related genes are grouped into clusters .The micro array data analysis for breast

    cancer was carried out by using Genesis. The data set was retrieved from SMD. Genesis was

    used to carry out for normalization of raw data retrieved from SMD and to carry out

    clustering of the genes by clustering algorithms. The results were generated with the help of

    SOM and Kmean techniques. The Genes were clustered into 9 different clusters in case of

    both techniques, based on the expression profile of those genes in breast cancer. In case of

    SOM; clusters S1, S3, S6 and S7 had maximum number of genes. Genes that were clustered

    in S6, S3,S1 and S7 were highly co-expressed. Similarly, In case of K-mean Cluster number

    K2, K4, K6 and K9 has maximum number of genes. Genes that were clustered in cluster

    number K2, K4 and K9 are highly co-expressed. The expression of genes in some of clusters

    has found to be similar and some genes were under express while others were over

    expressed. The expression of genes in some of clusters has found to be similar and some

    genes were under express while others were over expressed.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    38/42

    38

    CONCLUSION

    Thus, it was concluded from the statistical analysis that clustering results obtained by two

    techniques were same and approximately accurate. Forty-five genes have been identified

    which were co expressed in different clusters. In future, work promoter analysis can be

    carried out to analyze the regulatory systems of these forty-five genes. Drug target can be

    identified with the help of this regulatory system analysis. Functions of these forty-five genes

    are unknown and can be predicted on the bases of the known genes of similar cluster.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    39/42

    39

    REFERENCES

    Anderson, K.S., Sibani, S., Wong, J. and Raphael, J. (2009). Using custom protein

    microarrays to identify autoantibody biomarkers for the early detection of breast cancer:

    Cancer Res; 69: 223-233

    Bertucci, F., Birnbaum, D. and Hassoun, J. (2004). Protein Expression Profiling Identifies

    Subclasses of Breast Cancer and Predicts Prognosis, American Association for Cancer

    Research.2: 12-29

    Calvo, M.B., Antolin, N. S. and Vilamil, M.V. (2009). MicroRNA for circulating tumor

    cells detection in breast cancer: In-silico and in-vitro analysis, Tumor Biology,11:280-285

    Cannaves, J. M. and Leon, D. A. (2003). In silico stydy of breast cancer associated gene 3

    using LION Target Engine and other tools. Biotechniques 7: 1222-1228

    David, S., Larson, G., Glackin,C. and Shove, O. (2008). Meta-analysis of breast cancer

    microarray studies in conjunction with conserved cis-elements suggest patterns for

    coordinate regulation.BMC Bioinformatics 9: 1186-1471

    Goodison, S., Sun, Y. and Urquidi, V. (2010). Derivation of cancer diagnostic and

    prognostic signatures from gene expression data.Bioanalysis 5: 855-862

    Holko, M., Scholtens, D. and Khan, S. A. (2009). Differential gene expression associated

    with estrogen receptor status of breast cancer identified by microarray meta analysis.

    American Association for Cancer Research 69: 5472

    Hyat, M., Tramm,T., Alsner, J. and Finak,G. (2010). In-silico ascription of gene

    expression differences to tumor and stromal cells in a model to study impact on breast cancer

    outcome.Public Library of Science 5:1371

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    40/42

    40

    Ingrid, A. H., Kristen, M.C. and Sofia, K.G. (2006). Microarrays in breast cancer research

    and clinical practice. Endocrine Related Cancer 13: 1017-1031

    Jones, C., Mckay, A., Cossu, A. and Davies, S. (2004). Gene expression profiling for

    Molecular Characterization of Inflammatory Breast Cancer and Prediction of Response to

    Chemotherapy.Cancer Res 64:8558

    Kristine, D. N. (2002). Microarrays and Breast Cancer: Predicting Metastatic Disease.

    Medscape Hematology-Oncology.Absract nr 2067

    Kallioniemi, A., Veldman, R. and Jiang, Y. (2000). Comparative Genomic HybridizationAnalysis of 38 Breast Cancer cell lines-A Basis for Interpreting Complementary DNA

    Microarray data.American Association of Cancer Research, 60:4519

    Macgregor, P.F. (2003). Gene expression in cancer-applications of microarrays. Expert

    Review of Molecular Diagnostics, 3:185-200

    Margossian, A., Diaz, J. and Corvalan, A. (2009). In silico analysis of Breast Cancer

    Transcriptome Libraries distinguishes Tumor Subclasses. Cancer Res; 69:1165

    Miller, D.L. and Edison, T. L. (2007). Expression genomics in breast cancer research-

    microarrays at the crossroads of biology and medicine. Breast Cancer Research, 9:206

    Pandey, A., Sukumar, S., Cole, R.N. and Chen, H. (2010). Proteomic characterization of

    Her 2/neu over expressing breast cancer cells. Proteomics, John Wiley and Sons Ltd,10

    :3800-3810

    Potamias, G., Analyti, A. and Tollis, Y. (2004). Breast cancer microarrays and Biomedical

    Informatics-The Prognochip Project.1st

    International Advanced Research Workshop on In

    silico Oncology: Advances and Challenges, Sparta, Greece.

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    41/42

    41

    Shen, D., He, J. and Chang, H.R. (2005). In silico identification of breast cancer genes by

    combined multiple high throughput analysis. International Journal of Molecular Medicine 15

    :205-212

    Sultana, Z., Craig, D. B. and Dombkowski, A.A. (2011).In silico analysis of combinatorial

    microRNA Activity Reveals Target Genes and Pathways Associated with Breast Cancer

    Metastasis. Cancer Inform 17 :13-29

    Tommasi, S., Pilato, B., Pinto, R. and Bruno, M. (2008). Molecular and in silico analysis

    of BRCA1 and BRCA2 variants.Mutat Res 644 :64-70

    Vinals, C., Gauli, S. and Coche.T.(2001). Using in silico transcriptomics to search for

    tumor-associated antigens for immunotherapy. Vaccine 19 :2607-2614

    Wendy, K., Andrews, J., Pilon, J. and Hodgson, A. (2010). Gene expression Profiling

    predicts clinical outcomes of breast cancerNature415:530-536

    Yang, Y., Choi, J. and Yoon, S. (2006). Large scale data mining approach for gene-specific

    standardization of microarray gene expression data. Oxford Journal, Bioinformatics 22

    :2818-2904

  • 7/29/2019 IN SILICO ANALYSIS OF MICROARRAY DATA FOR BREAST CANCER

    42/42

    WEBSITES

    1. http://www.helsinki.fi/biochipcenter

    2. http://www.microarrays.btk.utu.fi

    3. http://www.med.uio.no/dnr/microarray/

    4. http://www.genome.tugraz.at/genesisclient/1.7.2/install.htm

    5. http://www.smd.Stanford.EDU/

    6. http://www.ncbi.nlm.nih.gov.in

    7. http://www.aacr.gov

    http://www.helsinki.fi/biochipcenterhttp://www.microarrays.btk.utu.fi/http://www.med.uio.no/dnr/microarray/http://www.genome.tugraz.at/genesisclient/1.7.2/install.htmhttp://www.smd.stanford.edu/http://www.ncbi.nlm.nih.gov.in/http://www.aacr.gov/http://www.aacr.gov/http://www.aacr.gov/http://www.ncbi.nlm.nih.gov.in/http://www.smd.stanford.edu/http://www.genome.tugraz.at/genesisclient/1.7.2/install.htmhttp://www.med.uio.no/dnr/microarray/http://www.microarrays.btk.utu.fi/http://www.helsinki.fi/biochipcenter