chapter-1 introductionshodhganga.inflibnet.ac.in/bitstream/10603/8277/11/11_chapter 1.pdf ·...
TRANSCRIPT
1
Chapter-1
INTRODUCTION
1.1 INTRODUCTION
In the post-genomic era the field of Bioinformatics is growing at a
phenomenal rate and the task in the early years of the millennium is to demonstrate
how in-silico simulations [1] facilitate experiments in the laboratories and how this
knowledge can be applied in curing human diseases. To achieve this goal is to
translate the genomic data into biological knowledge by studying the function of all
known proteins. Knowledge of the structure is not sufficient for deciphering the
functional mechanisms, which often depend on the flexibility of protein structures and
the native state of a protein. Deviating to some extent from the average coordinates
reported as the experimental structure, the flow of Genetic Information from sequence
to structure to function lies at the core of all medical and biological sciences. This is
referred to as the Central Dogma of Molecular Biology [2]. The process of
Transcription and Translation results DNA Sequence to a sequence of amino acids.
These amino acids fold to form protein structures, which perform specific functions
that together define the phenotype of the organism. Both, the computational and
experimental approaches play a critical role in the identification of the function of
proteins. Experimental observation of conformational motion of bio-molecules is
becoming possible; NMR spectroscopy can be used to determine both the dynamics
and structure of proteins. Computational approaches expand the structural knowledge
when applied to potentially large number of families of related proteins and thus help
fill the gap between the number of known protein sequences and the known
structures.
2
Bioinformatics offers numerous approaches for the prediction of structure and
function of proteins on the basis of sequence and structural similarities [3]. The
protein sequence to structure and function relationship is well established and reveals
that the structural details at atomic level help to understand the molecular function of
proteins. The protein structure is conserved and can accommodate up to 80% of
sequence variation. X-ray crystallography and NMR are the experimental methods to
determine protein structure. They provide 3D structure data for a relatively small
number of proteins due to time-consuming preprocessing requirements such as
purification, crystallization of proteins etc. NMR is limited by the size of the protein
molecule. Predicting a protein's structure from its amino acid sequence is a problem
of great scientific importance. As the major genome projects progress, and as protein
sequencing methods continue to out-pace structure determination methods, it has
become a problem of great practical importance. While this problem is not yet solved,
we have a partial solution. If two protein sequences are at least 30% identical, they
tend to have similar structures. Sensitive methods such as Hidden Markov Models
[4](HMMs) can successfully predict structural similarity when the sequence similarity
is less pronounced. The power of this partial solution should not be underestimated.
By genome sequencing projects, vast amount of data was generated and this
led to the development of computational techniques for the analysis of DNA and
protein sequence data. These focused primarily on the flow of information from
sequence to structure and had much success in predicting function directly from
sequence. The advantages of applying computational solutions to problems in this
area have led to the evolution of the field of Bioinformatics as a cornerstone of
virtually all biological and medical sciences. Bimolecular interactions are all
biochemical pathways that together constitute the process of life. This dissertation
3
contributes to this emerging discipline by addressing specific problems associated
with the study of protein structures and their functional mechanisms. We present
computational models for the comparison of protein structures, the classification of
proteins into structural families and the analysis of ligand interactions with proteins.
There is a need to design drugs for diseases which are being caused by
proteins whose structures are unknown, here we do the prediction of those proteins by
classifying them using known templates [5][6]. Every protein has some sequences of
patterns which will be matched using data mining algorithm and classified using
sequences of those proteins.
1.2 CANCER
Cancer [7][8] is a generic term for a large group of diseases that can affect any
part of the body. It is inevitably one of the most studied yet unsolved non
communicable human diseases. It is an idiopathic disease [9] and doctors and
scientists are constantly trying to evolve new effective drugs for its treatment. There is
no other disease which parallels cancer in diversity of its origin, nature and
treatments. It is deregulated multiplication of cells with the consequence of an
abnormal increase of the cell number in particular organs or tissues. Other terms used
are malignant tumours and neoplasms [10]. One defining feature of cancer is the rapid
creation of abnormal cells that grow beyond their usual boundaries and which can
then invade adjoining parts of the body and spread to other organs. This process is
referred to as metastasis [11][12]. Metastases are the major cause of death from
cancer [13]. Our bodies are made up of billions of cells that grow, divide, and then die
in a predictable manner. Cancer occurs when something goes wrong with the cell
cycle causing uncontrolled cell division and growth. Cancer cells lump together and
form a mass of extra tissue, also known as a tumor, which continues to grow. As it
4
grows, it may damage and invade nearby tissue. If a cancerous tumor outgrows its
birthplace called the primary cancer site and moves on to another place called the
secondary cancer site it's referred to as metastasizing.
Initial stages of the developing cancer are generally confined to the organ of
origin whereas advanced cancers grow beyond the tissue of origin. Advanced cancers
invade the surrounding tissues that are initially connected to the primary cancer. Then
these are distributed via the hematopoietic and lymphatic systems throughout the
body where they can colonize in distant tissues and cause metastasis. The
development of cancers is thought to result from the mutation or damage of the
cellular genome, either due to environmental influences or caused by random
endogenous mechanisms In the attempt to establish how, why and when cancer
occurs, a plethora of genetic pathways and regulatory circuits have been discovered
that are necessary to maintain general cellular functions such as proliferation,
migration and differentiation. Undesirable changes or alterations in this fine-tuned
network of cascades and interactions, due to endogenous failure or to exogenous
challenges by environmental factors, may disable any member of such regulatory
pathways. This could induce the death of the affected cell, may immediately provide
it with a growth advantage within a particular tissue or may mark it for cancerous
development. Since these natural mechanisms always rectify such environmentally
induced abnormalities, apoptosis or programmed cell death is the best example which
can control proliferation in cancer diseases.
1.2.1 Global Burden of Cancer
Cancer is a leading cause of worldwide death. The disease accounted for 7.4
million deaths or around 13% of all deaths worldwide in 2004. The main types of
cancer leading to overall cancer mortality each year are:
5
Lung (1.3 Million Deaths/Year)
Stomach (803 000 Deaths)
Colorectal (639 000 Deaths)
Liver (610 000 Deaths)
Breast (519 000 deaths).
More than 70% of all cancer deaths occurred in low- and middle-income
countries. Deaths from cancer worldwide are projected to continue rising, with an
estimated 12 million deaths in 2030.
The most frequent types of cancer worldwide (in order of the number of global
deaths) are:
Among men - lung, stomach, liver, colorectal, esophagus and prostate
Among women - breast, lung, stomach, colorectal and cervical.
There are over 100 different types of cancer, and each is classified by the type
of cell that is initially affected. Tumors can grow and interfere with the digestive,
nervous, and circulatory systems and they can release hormones that alter body
function. Tumors that stay in one spot and demonstrate limited growth are generally
considered to be benign.
More dangerous, or malignant, tumors form when two things occur:
A cancerous cell manages to move throughout the body using the blood or
lymph systems, destroying healthy tissue in a process called invasion in which
the cell manages to divide and grow, making new blood vessels to feed itself
in a process called angiogenesis.
When a tumour successfully spreads to other parts of the body and grows,
invading and destroying other healthy tissues, it is said to have metastasized.
6
1.2.2 Causes of Cancer
Cancer is ultimately the result of cells that uncontrollably grow and do not die.
Normal cells in the body follow an orderly path of growth, division, and death.
Cancer results due to anomalies in programmed cell death, apoptosis[14]. Unlike
regular cells, cancer cells do not experience programmatic death and instead continue
to grow and divide. This leads to a mass of abnormal cells that grows out of control.
Although most cancers develop and spread via an organ - blood cancers like leukemia
do not. They affect the blood and the organs that form blood and then invade nearby
tissues.
1.2.3 Symptoms of Cancer
1.2.3.1 A broad spectrum of non-specific cancer symptoms may include:
Persistent Fatigue: Fatigue is one of the most commonly experienced cancer
symptoms. It is usually more common when the cancer is advanced, but still occurs in
the early stages of some cancers. Fatigue is a symptom of both malignant and non-
malignant conditions Anemia is a condition that is associated with many types of
cancer, especially types affecting the bowel.
Unintentional Weight Loss: Losing 10 pounds or more unintentionally definitely
warrants a visit to the doctor. This type of weight loss can occur with or without loss
of appetite. Weight loss can be a symptom of cancer, but is also a symptom of many
other illnesses, too.
Pain: Pain is not an early symptom of cancer, except in some cancer types like those
that spread to the bone. Pain generally occurs when cancer spreads and begins to
affect other organs and nerves. Lower back pain is a cancer symptom that is
associated with ovarian cancer and colon cancer. Shoulder pain can also be a
symptom of lung cancer. Pain in the form of headaches can be associated with brain
7
tumors. Stomach pains can be related to types of cancer, like stomach cancer,
pancreatic cancer, and many others.
Fever: A fever is a very non-specific symptom of many mild to severe conditions,
including cancer. Fevers are commonly associated with types of cancer like leukemia
and lymphoma that affects the blood.
Bowel Changes: Constipation, diarrhea, blood in the stools, gas, thinner stools are
most commonly associated with colon cancer, but are also related to other cancer
types.
Chronic Cough: In relation to cancer, a chronic cough with blood or mucus can be a
symptom of lung cancer.
1.2.4 General Classification of Cancer
There are five broad groups that are used to classify cancer [15][16].
Carcinomas are characterized by cells that cover internal and external parts of
the body such as lung, breast, and colon cancer.
Sarcomas are characterized by cells that are located in bone, cartilage, fat,
connective tissue, muscle, and other supportive tissues.
Lymphomas are cancers that begin in the lymph nodes and immune system
tissues.
Leukaemias are cancers that begin in the bone marrow and often accumulate
in the bloodstream.
Adenomas are cancers that arise in the thyroid, the pituitary gland, the adrenal
gland, and other glandular tissues.
8
1.2.5 Diagnosis of Cancer
Early detection of cancer can greatly improve the odds of successful treatment
and survival imaging techniques such as X-rays, CT scans, MRI scans, PET
scans, and ultrasound scans are used regularly in order to detect where a tumor
is located and what organs may be affected by it [17] [18].
Extracting cancer cells and looking at them under a microscope is the only
absolute way to diagnose cancer. This procedure is called a biopsy. Analysis
of proteins polysaccharides etc. released by the cancerous cells can also serve
an index by performing blood tests. Prostate cells release a higher level of a
chemical called PSA (prostate-specific antigen into the bloodstream).
1.2.6 Staging
The most common cancer staging method is called the TNM system. T (1-4)
indicates the size and direct extent of the primary tumor, N (0-3) indicates the degree
to which the cancer has spread to nearby lymph nodes, and M (0-1) indicates whether
the cancer has metastasized to other organs in the body. TNM descriptions then lead
to a simpler categorization of stages, from 0 to 4, where lower numbers indicate that
the intensity is less [19]. While most Stage 1 tumors are curable, most Stage 4 tumors
are inoperable or untreatable.
1.2.7 Treatment
Cancer treatment depends on the type of cancer, the stage of the cancer, age,
health status, and additional personal characteristics [20]. There is no single treatment
for cancer, and patients often receive a combination of therapies and palliative care.
Treatments usually fall into one of the following categories: surgery, radiation,
chemotherapy, immunotherapy, hormone therapy, or gene therapy.
9
1.2.7.1 Surgery: Surgery is the oldest known treatment for cancer [21]. If a cancer
has not metastasized, it is possible to completely cure a patient by surgically removing
the cancer from the body. This is often seen in the removal of the prostate or a breast
or testicle. After the disease has spread, however, it is nearly impossible to remove all
of the cancer cells.
1.2.7.2 Radiation: Radiation treatment, also known as radiotherapy, destroys cancer
by focusing high-energy rays on the cancer cells. Radiotherapy [22][23] utilizes high-
energy gamma-rays that are emitted from metals such as radium or high-energy x-rays
that are created in a special machine. Early radiation treatments caused severe side-
effects because the energy beams would damage normal, healthy tissue. But the
technologies have improved so that beams can be more accurately targeted.
Radiotherapy is used as a stand-alone treatment to shrink a tumor or destroy cancer
cells (including those associated with leukemia and lymphoma), and it is also used in
combination with other cancer treatments.
1.2.7.3 Chemotherapy: Chemotherapy utilizes chemicals that interfere with the cell
division process - damaging proteins or DNA - so that cancer cells will commit
suicide. These treatments target any rapidly dividing cells (not necessarily just cancer
cells), but normal cells usually can recover from any chemical-induced damage while
cancer cells cannot. Chemotherapy [24] is generally used to treat cancer that has
spread or metastasized because the medicines travel throughout the entire body. It is a
necessary treatment for some forms of leukemia and lymphoma. Chemotherapy
treatment occurs in cycles so the body has time to heal between doses. However, there
are still common side effects such as hair loss, nausea, fatigue, and vomiting.
Combination therapies often include multiple types of chemotherapy or chemotherapy
combined with other treatment options.
10
1.2.7.4 Immunotherapy: Immunotherapy [25] aims to get the body's immune system
to fight the tumor. Local immunotherapy injects a treatment into an affected area, for
example, to cause inflammation that causes a tumor to shrink. Systemic
immunotherapy treats the whole body by administering an agent such as the protein
interferon alpha that can shrink tumors cancer cells. These therapies are relatively
young, but researchers have had success with treatments that introduce antibodies to
the body that inhibit the growth of breast cancer cells. Bone marrow transplantation or
hematopoietic stem cell transplantation can also be considered immunotherapy
because the donor's immune cells will often attack the tumor or cancer cells that are
present in the host.
1.2.7.5 Hormone therapy: Several cancers have been linked to some types of
hormones, most notably breast and prostate cancer. Hormone therapy is designed to
alter hormone production in the body so that cancer cells stop growing or are killed
completely. Breast cancer hormone therapies often focus on reducing estrogen levels
(a common drug for this is tamoxifen) and prostate cancer hormone therapies often
focus on reducing testosterone levels. In addition, some leukemia and lymphoma
cases can be treated with the hormone cortisone.
1.2.7.6 Gene therapy: The goal of gene therapy is to replace damaged genes with
ones that work to address a root cause of cancer: damage to DNA. For example,
researchers are trying to replace the damaged gene that signals cells to stop dividing
(the p53 gene) with a copy of a working gene. Other gene-based therapies [26] focus
on further damaging cancer cell DNA to the point where the cell commits suicide.
1.2.8 Prevention
Cancers that are closely linked to certain behaviors are the easiest to prevent
[27][28]. For example, choosing not to smoke tobacco[29] or drink alcohol
11
significantly lower the risk of several types of cancer - most notably lung, throat,
mouth, and liver cancer. Skin cancer can be prevented by staying in the shade,
protecting yourself with a hat and shirt when in the sun, and using sunscreen. Diet is
also an important part of cancer prevention [30] since what we eat has been linked to
the disease.
Certain vaccinations have been associated with the prevention of some
cancers. For example, many women receive a vaccination for the human
papillomavirus because of the virus's relationship with cervical cancer [31]. Hepatitis
B vaccines prevent the hepatitis B virus, which can cause liver cancer.
1.2.9 Cancer Cells in Culture
Normal cells pass through a limited number of cell divisions before they
decline in vigour and die. This is called replicative senescence. It may be caused by
their inability to synthesize telomerase. Cancer cells may be immortal; that is,
proliferate indefinitely in culture. Cancer cells in culture produce telomerase [32]
Normal cells: when placed on a tissue culture dish, they proliferate until the
surface of the dish is covered by a single layer of cells just touching each other. Then
mitosis ceases. This phenomenon is called contact inhibition. Cancer cells show no
contact inhibition. Once the surface of the dish is covered, the cells continue to divide,
piling up into mounds .The cells below are said to be transformed. Radiation, certain
chemicals, and certain viruses are capable of transforming cells.
12
Figure 1. 1: Image of HeLa cells [33]
Cancer cells almost always have an abnormal karyotype [34] with
o abnormal numbers of chromosomes (polyploidy or aneuploidy)
o chromosomes with abnormal structure:
translocations
deletions
duplications
Inversions
13
1.3 GLUTATHIONE S- TRANSFERASE
Glutathione S- transferases [35] constitute a large family of enzymes which
catalyze the addition of glutathione to endogenous or xenobiotic, often toxic
electrophilic chemicals, and major group of detoxification enzymes. Eukaryotic
glutathione S-transferase usually mediate the inactivation, degradation or excretion
of a wide range of compounds by formation of the corresponding glutathione
conjugates. Different classes of GSTs have been defined as α, μ, π, θ, σ, ξ and ω in
mammals, φ and τ in plants, δ in insects and β in bacteria [36], based mainly on
sequence analysis.
All eukaryotic species possess multiple cytosolic and membrane-bound GST
isoenzymes, each of which displays distinct catalytic as well as noncatalytic binding
properties: the cytosolic enzymes are encoded by at least five distantly related gene
families [37] designated class alpha, mu, pi, sigma, and theta GST, whereas the
membrane-bound enzymes, microsomal GST and leukotriene C4 synthetase, are
encoded by single genes and both have arisen separately from the soluble GST.
Evidence suggests that the level of expression of GST is a crucial factor in
determining the sensitivity of cells to a broad spectrum of toxic chemicals [38]. The
biochemical functions of GST are described to show how individual isoenzymes
contribute to resistance to carcinogens, antitumor drugs, environmental pollutants,
polycyclic hydrocarbons [39][40] and products of oxidative stress. A description of
the mechanisms of transcriptional and posttranscriptional regulation of GST
isoenzymes is provided to allow identification of factors that may modulate resistance
to specific noxious chemicals.
14
1.3.1 Enzyme Classification
EC no: 2.5.1.18
Accepted name: glutathione transferase
Other name(s): glutathione S-transferase; glutathione S-alkyltransferase;
glutathione S-aryltransferase; S-(hydroxyalkyl)glutathione lyase; glutathione S-
aralkyltransferase; glutathione S-alkyl transferase; GST
Systematic name: RX: glutathione R-transferase
1.3.2 Reaction Catalyzed
Figure 1. 2: Reaction of RX + glutathione = HX + R-S-glutathione
15
Comments: A group of enzymes of broad specificity. R may be an aliphatic, aromatic
or heterocyclic group; X may be a sulfate, nitrile or halide group. Also catalyses the
addition of aliphatic epoxides and arene oxides to glutathione, the reduction of polyol
nitrate by glutathione to polyol and nitrile, certain isomerization reactions and
disulfide interchange.
1.3.3 Path Ways Involved
Table 1. 1: Glutathione S transferase involved Pathways
Glutathione S transferase is very important enzyme that participates in three important
pathways that are as follows.
Figure 1. 3: Drug Metabolism- Cytochrome P-450 [41]
S.NO PATH WAYS KEGG LINK
1 Drug Metabolism Cytochrome-P450 00982
2 Glutathioine Metabolism 00480
3 Xenobiotic Metabolism by Cytochrome P450 00980
16
1.3.4 Glutathione Metabolism:
Figure 1.4: Xenobiotic Metabolism by Cytochrome P450
1.3.5 Risk of Cancer with GST
Polymorphisms in glutathione-S- transferase genes (GST-M1, GST-T1 and
GST-P1) are susceptibility for prostate cancer among male smokers [42].
Polymorphisms of Genotypes GSTM1, GSTT1, and GSTP1 in Glutathione S-
transferase are susceptibility to Risk and Survival of Pancreatic Cancer [43].
Numerous studies have associated bladder cancer with exposure to
carcinogens present in tobacco smoke and other environmental or occupational
exposures. Approximately 50% of all humans inherit two deleted copies of the
GSTM1 gene which encodes for the carcinogen [44] detoxification
17
enzyme glutathione S-transferase M1. Recent findings suggest that the GSTM1
gene may modulate the internal dose of environmental carcinogens
and thereby affect
the risk of developing bladder cancer.
Gene Regulation Expression: A region in the 5' flanking sequence of the glutathione
S-transferase (RX:glutathione R-transferase, EC 2.5.1.18) Ya subunit gene [45] that
contains a unique xenobiotic-responsive element (XRE). The regulatory region spans
nucleotides -722 to -682 of the 5' flanking sequence and is responsible for part of the
basal level as well as inducible expression of the Ya subunit gene by planar aromatic
compounds such as beta-naphthoflavone (beta-NF) and 3-methyl-cholanthrene. The
DNA sequence of this region (beta-NF-responsive element) is distinct from the DNA
sequence of the XRE found in the cytochrome P-450 IA1 gene. In addition to the
region containing the beta-NF-responsive element, two other regulatory regions of the
Ya subunit gene was identified.
One region spans nucleotides -867 to -857 and has a DNA sequence which is
identical to the hepatocyte nuclear factor 1 recognition motif found in several liver-
specific genes. The second region spans nucleotides -908 to -899 and contains a DNA
sequence with identity to the XRE found in the cytochrome P-450 IA1 gene. The
XRE sequence also contributes to part of the responsiveness of the Ya subunit gene to
planar aromatic compounds. This suggests that regulation of gene expression by
planar aromatic compounds can be mediated by a DNA sequence that is distinct from
the XRE sequence.
1.4 SEQUENCE ANALYSIS
Early in the days of protein and gene sequence analysis [46][47], it was
discovered that the sequences from related proteins or genes were similar, in the sense
18
that one could align the sequences so that many corresponding residues match. This
discovery was very important, since strong similarity between two genes is a strong
argument for their homology. Bioinformatics is based on it. The basis for comparison
of proteins and genes using the similarity of their sequences is that the proteins or
genes are related by evolution; they have a common ancestor. Random mutations in
the sequences accumulate over time, so that proteins or genes that have a common
ancestor far back in time are not as similar as proteins or genes that diverged from
each other more recently. Analysis of evolutionary relationships between protein or
gene sequences depends critically on sequence alignments.
There are many features of sequence alignments that give interesting
information. For example, a closer analysis of the alignment can reveal which parts of
the sequences are likely to be important for the function, if the proteins are involved
in similar processes. In parts of the sequence of a protein which are not very critical
for its function, the random mutations can accumulate more easily. In parts of the
sequence that are critical for the function of the protein, hardly any mutations will be
accepted; nearly all changes in such regions will destroy the function.
Sequence alignment has become an essential part of biological science [48].
There are now many different techniques and implementations of methods to perform
alignment of sequences.
1.4.1 The Evolutionary Basis of Sequence Alignment
One goal of sequence alignment is to enable the researcher to determine
whether two sequences display sufficient similarity such that an inference of
homology is justified [49][50]. Although these two terms are often interchanged in
popular usage, let us distinguish them to avoid confusion in the current discussion.
Similarity is an observable quantity that might be expressed as, say, percent identity
19
or some other suitable measure. Homology, on the other hand, refers to a conclusion
drawn from these data that two genes share a common evolutionary history [51].
Genes either are or are not homologous—there are no degrees for homology [52] as
there are for similarity.
1.4.2 Goals of Sequence Analysis
1. Identify the genes
2. Determine the function of each gene. One way to hypothesize the function is to find
another gene possibly from another organism whose function is known and to which
the new gene has high sequence similarity. This assumes that sequence similarity
implies functional similarity, which may or may not be true.
3. Identify the proteins involved in the regulation of gene expression.
4. Identify sequence repeats.
5. Identify other functional regions, for example origins of replication (sites at which
DNA polymerase binds and begins replication; For example, pseudogenes (sequences
that look like genes but are not expressed), sequences responsible for the compact
folding of DNA, and sequences responsible for nuclear anchoring of the DNA.
Many of these tasks are computational in nature. Given the incredible rate at
which sequence data is being produced, the integration of computer science,
mathematics, and biology will be integral to analyzing those sequences.
1.4.3 Sequence Similarity
In molecular biology, a common question is to ask whether or not two
sequences are related. The most common way to tell whether or not they are related
is to compare them to one another to see if they are similar. If we look at two words
in the English language, we note that two words that are spelled similarly may mean
two completely different things, such as the words pear and tear.
20
Biological sequences that are similar but not exact provide useful information
to help discover functional, structural, and evolutionary information. Two sequences
in different organisms are homologous if they have been derived from a common
ancestor sequence [53]. Two sequences may or may not be homologous regardless of
their sequence similarity. However, the greater the sequence similarity, the greater
chance there is that they share similar function and or structure.
1.4.3.1 Comparision of sequence analysis techniques
There are different sequence analysis techniques, among which my algorithm
SSHC (Sequnce search hill climbing algorithm) is compared with BLAST and
FASTA techniques.
Table 1. 2: Comparision table of different techniques
By observing all the three approaches our proposed system is performing
constantly. Our proposed system performs sequence analysis efficiently various
BLAST analysis using five
different matrices
FASTA analysis using five different
matrices
SSHC Analysis Using five different
Matrices
Matri
x
%Ide
ntity
%Posi
tives
Sco
res
Matrix %iden-
Tity
%simila
-rity
Score
s
Matrix %iden-
tity
%Posi-
tives
Scores
PAM
30
66 80 306 BL 50 66.2 86.8 230 PAM
30 73 76.5
268
PAM
70
66 80 304 BL 62 66.2 86.3 276.3 PAM
70 73 76.25
290.15
BL 80 66 80 311 BL 80 66.2 83.1 346.9 BL 80 73 74.65
328.95
BL 62 66 80 300 PAM
120
66.2 86.8 327.4 BL 62
73 76.5
313.7
BL 45 66 80 313 PAM
250
66.2 90.4 153.7 BL 45
73 78.3
233.35
21
aspects. SSHC searches for each amino acids individually as well as no gaps. When
we compare SSHC with the other systems such as BLAST and FASTA it gives
neighbor sequence from the database. So,we conclude that our system is better than
other systems.
Our proposed system Sequence Analysis Steepest-Ascent Hill Climbing
Technique follows a strategy that searches for the nearest peak point and upon
climbing it searches for the destination. If not it searches for next peak point. SSHC
Continues this process till it finds the best solution.
1.4.3.2 Advantages of my algorithm
My algorithm SSHC follows Heuristic search.
Heuristic search is an excellent Artificial Intelligence(AI) technique for
solving several problems.
This algorithm searches very accurately and efficiently.
SSHC occupies very less memory compare to direct search
Regular search techniques are used in priority of operations.
SSHC make use of domain specific information.
It tells us which is the best solution nearest to us.
SSHC yields the search space into smaller so that search can be done faster
way.
SSHC searches for best fit to the solution.
1.4.4 Review Of Literature
Comparative modeling of protein or Homology Modeling refers to
constructing an atomic-resolution model of the target protein from its amino acid
sequence and an experimental three-dimensional structure of a related homologous
protein [54]. Homology modeling can produce high-quality structural models when
22
the target and template are closely related, which has inspired the formation of
a structural genomics consortium dedicated to the production of representative
experimental structures for all classes of protein folds. Mainly, it relies on the
identification of one or more known protein structures likely to resemble the structure
of the query sequence and on the production of an alignment that maps residues in the
query sequence to residues in the template sequence. The chief inaccuracies in
homology modeling, which degenerate with lower sequence identity, derive from
errors in the initial sequence alignment and from improper template selection. The
sequence alignment and template structure are then used to produce a structural model
of the target. Because protein structures are more conserved than DNA sequences,
detectable levels of sequence similarity usually imply considerable structural
similarity.
Like other methods of structure prediction, current practice in homology
modeling is assessed in a biannual large-scale experiment known as the Critical
Assessment of Techniques for Protein Structure Prediction, or CASP. The quality of
the homology model is dependent on the quality of the sequence alignment and
template structure. The approach can be complicated by the presence of alignment
gaps (commonly called indels) that indicate a structural region present in the target
but not in the template, and by structure gaps in the template that arise from poor
resolution in the experimental procedure (usually X-ray crystallography) used to solve
the structure[55].
Model quality declines with decreasing sequence identity; a typical model has
~1-2 Å root mean square deviation between the matched Cα atoms at 70% sequence
identity but only 2-4 Å agreement at 25% sequence identity. However, the errors are
23
significantly higher in the loop regions, where the amino acid sequences of the target
and template proteins may be completely different.
1.5 STUDY ON GLUTATHIONE S-TRANSFERASE
Glutathione S-transferase (GST)[56] are an important class of phase II
detoxifying enzymes, catalyzing the conjugation of glutathione (GSH) to electrophilic
species. Recently, a number of cytosolic GSTs were crystallized. Rat Mu class
Glutathione S transferase 4-4 homology model has been done by using molecular
modeling techniques. This modeling study was done based upon the crystal structure
of rat GST 3-3, both members of the mu class. GST 3-3 and GST 4-4 isoenzymes
share a sequence homology of 88%.
The critical role of the glutathione S-transferase (GST) multigene family in
cellular protection in combination with the large interindividual variability in the
expression of these enzymes has prompted an investigation of their importance in
cancer prevention and susceptibility. Previous preclinical and clinical studies in
laboratories have established an association between decreased GST activity and
increased risk of colorectal cancer. Based on the increased incidence of colon
malignancies among patients with ulcerative colitis, GST activity has been examined
in a mouse model of induced colitis. Significant decreases (50% of controls) in the
GST activity of colon tissue were observed during the establishment and progression
of colitis.
These data suggested that the depletion of cellular protection may be an
important event in the carcinogenic progression of ulcerative colitis. The ability of the
dithiolthione oltipraz to induce GST expression within the murine colon has been
demonstrated. Use of chemopreventive regimens to induce phase 2 detoxication
enzyme expression represents a promising strategy for the prevention of cancer.
24
Clinical studies revealed that the GST activity of blood lymphocytes from individuals
with either a personal or family history of colorectal cancer or a personal history of
colon polyps was decreased significantly when compared to that of healthy controls.
Phase 1 clinical evaluation of oltipraz has demonstrated its ability to induce GST
activity as well as the level of transcripts encoding γ-glutamylcysteine synthetase (γ-
GCS) and DT-diaphorase in the colon mucosa of individuals at increased risk for
colorectal cancer. The observed correlation between the post treatment response in
blood lymphocytes and colon mucosa suggested that blood lymphocytes may be used
in future trials as a surrogate biomarker of the responsiveness of colon tissue to
chemopreventive regimens.
However, modeling study of Wuchereria bancrofti Glutathione S
transferase[57]has been done for comparision with brugia malayi. In this study
sequence analysis is done by PSI-Blast that results Bm GST having 42% and 41%
sequence identity (highest homology compared to other known structures) with Sus
scrofa, respectively. Therefore, they used the structure of Sus scrofa GST (PDB:
2GSR) as the template for building homology models for Wb GST and Bm GST using
MOE (molecular operating environment), an automated molecular modeling tool. In
this study models were analyzed by the WHATIF.
Human theta-class glutathione transferase T1-1 was modeled by a manual
threading approach[58]. This study is based on the coordinates of related Theta class
T2-2.The low level sequence identity (about 20%), found in the C-terminal extension.
The C terminal extension contributes the active site of this molecule. Manual docking
of known substrates and non-substrates has implicated potential candidates for the T1-
1 catalytic residues involved in the dehalogenation and epoxide-ring opening
25
activities. This docking study was used to find Glutathione S transferase T1 activity
with different substrates between species.
One more homology modeling study was reported on Rat Mu Class
Glutathione S-transferase and predicts the difference in stereo selectivity between
GST 3-3 and GST 4-4 for the R- and S-configured carbons of the oxirane moiety with
the docking studies [59]. The protein homology model, together with the substrate
model, will be useful to further rationalize the substrate selectivity of GST 4-4, and to
identify new potential GST 4-4 substrates. Human class Mu glutathione S transferase
was modeled by homology modeling. This generates the 3-D structure of three GST
human isozymes of the Mu class, Mlb-1b, M2-2 and M3-3, using the Rat3-3
GST structure as a template. The high percentage of identity among
these enzymes
and the lack of insertions and deletions make the system ideally suited to the
technique of homology modeling.
Over 500 studies have examined the association of genetic variants of
glutathione S-transferase with various malignancies yielding inconsistent results [60].
The genotyping was based on PCR assays that identified the GSTM1 and GSTT1 null
(-/-) genotypes but did not distinguish homozygous wild-type +/+ and heterozygous
+/- individuals. Complete GSTM1 and GSTT1 genotyping can be accomplished by
recently developed assays that allow the definition of +/+, +/-, and -/- genotypes by
separate identification of the respective GSTM1 and GSTT1 wild-type and null
alleles.
Application of the new GSTM1 assay to a breast cancer case-control study
revealed that the relative risk of breast cancer for the +/+ genotype compared to the -/-
genotype was 2.83 (95% confidence interval 1.45-5.59; P=0.002), suggesting a
protective effect of the GSTM1 deletion. Regardless of the explanation for the
26
association between the +/+ genotype and increased breast cancer risk, these results
warrant application of true GSTM1 and GSTT1 genotyping to additional or
previously analyzed groups with breast cancer or other malignancies.
Dietary intake of cruciferous vegetables (Brassica spp.) has been inversely
related to colorectal cancer risk[61], and this has been attributed to their high content
of glucosinolate degradation products such as isothiocyanates (ITCs). These
compounds act as anticarcinogens by inducing phase II conjugating enzymes, in
particular glutathione S-transferase (GSTs). These enzymes also metabolize ITCs,
such that the protective effect of cruciferous vegetables may predicate on GST
genotype. The Singapore Chinese Health Study is a prospective investigation among
63 257 middle-aged men and women, who were enrolled between April 1993 and
December 1998.
In this nested case-control analysis, they compared 213 incident cases of
colorectal cancer with 1194 controls. Information on dietary ITC intake from
cruciferous vegetables, collected at recruitment via a semi-quantitative food frequency
questionnaire, was combined with GSTM1, T1 and P1 genotype from peripheral
blood lymphocytes or buccal mucosa. When categorized into high and low intake,
dietary ITC was slightly lower in cases than controls but the difference was not
significant.
There were no overall associations between GSTM1, T1 or P1 genotypes and
colorectal cancer risk. However, among individuals with both GSTM1 and T1 null
genotypes, we observed a 57% reduction in risk among high versus low consumers of
ITC (OR 0.43, 95% CI 0.20–0.96), in particular for colon cancer (OR 0.31, 0.12–
0.84), these results are compatible with the hypothesis that ITCs from cruciferous
vegetables modify risk of colorectal cancer in individuals with low GST activity.
27
Further, this gene–diet interaction may be important in studies evaluating the effect of
risk-enhancing compounds in the colorectum.
Chick Class-theta Glutathione S- transferase CL1 model[62], was constructed
by using multiple sequence alignment algorithm of Feng on dolittle. GST was taken
from purified from 1-day-old chick livers were digested with Achromobacter
proteinase1.The resulting fragments were utilized for sequence analysis. This protein
has 70-73% sequence similarity with other mammalian class-Theta glutathione S-
transferae. Based on the known X-ray crystal structures of class-Alpha,-Mu and –Pi
glutathione S-transferase , a model is constructed for the N-terminal 232 residues of
CL1.The final model of the CLI subunit was obtained by applying the CHARMM
energy-minimization process with the CHARMM/QUANTA program package. The
program ran through 100 cycles of steepest descent minimization.
Homology modeling was used for model building of a terminal thiamine
biosynthesis enzyme phosphoryl thymidine kinase (Thi E) using Geno3D, Swiss
Model and Modeler. The best model was selected based on overall stereo chemical
quality. The potential ligand binding sites in the model were identified by CASTp
server. The validated theoretical model of the 3D structure of the thi E protein of E.
coli O157:H7 was predicted using a thiamine phosphate pyrophosphatase
from Pyrococcus furiosus (PDB ID: 1X13_A) as template.
Dengue is a serious public health problem in tropical and subtropical
countries. It is caused by any of the four serologically distinct dengue viruses, namely
DENV1-4.The three-dimensional structures of the three isoforms of Ae. Aegypti
defensins predicted by comparative modeling[63][64]. Prediction was done with
Modeller 9v1 and the structures validated through a series of tests.
28
1.6 HOMOLOGY MODELLING ON ANNEXIN
During the last two decades, the number of sequence-known proteins has
increased rapidly. In contrast, the corresponding increment of structure-known
protein is much slower. The unbalanced situation has critically limited our ability to
understand the molecular mechanism of protein and conduct structure based drug
design timely by using updated information of newly found sequences. Therefore it is
highly desired to model 3D structure of protein by using structural bioinformatics
approach by homology modeling. In this study homology modeling approach was
utilized to develop 3D structure of Aspergillus fumigatusAf293 (anxc3.1)[65]. An
annexin, (anxC3.1), was isolated and characterized from the industrially important
filamentous fungus Aspergillus niger. AnxC3.1 is a single copy gene encoding a 506
amino acid predicted protein which contains four annexin
repeats. AnxC3.1 expression was found to be unaltered under a variety of conditions
such as increased secretion, altered nitrogen source, heat shock, and decreased
Ca2+
levels, indicating that anxC3.1 is constitutively expressed. This is the first
reported functional characterization of a fungal annexin. So it highly desired to
develop a model to this protein.
Anticancer drug development using the platform of glutathione (GSH),
glutathione S-transferase (GST) and pathways that maintain thiol homeostasis has
recently produced a number of lead compounds. GSTπ is a prevalent protein in many
solid tumors and is over expressed in cancers resistant to drugs. It has proved to be a
viable target for pro-drug activation with at least one candidate in late-stage clinical
development.
In addition, GSTπ possesses noncatalytic ligand-binding properties important
in the direct regulation of kinase pathways. This has led to the development and
29
testing of agents that bind to GSTπ and interfere with protein–protein interactions,
with the phase II clinical testing of one such drug. Attachment of glutathione to
acceptor cysteine residues (glutathionylation) is a post translational modification that
can alter the structure and function of proteins. Two agents in preclinical development
(PABA/NO, releasing nitric oxide on GST activation, and NOV-002, a
pharmacologically stabilized pharmaceutical form of GSSG) can lead to
glutathionylation of a number of cellular proteins. The biological significance of these
modifications is linked with the mechanism of action of these drugs. In the short term,
glutathione-based systems should continue to provide viable targets and a platform for
the development of novel cancer drugs
Modeller is a computer program[66] that models three-dimensional structures
of proteins and their assemblies by satisfaction of spatial restraints. Modeller is most
frequently used for homology or comparative protein structure modeling: The user
provides an alignment of a sequence to be modeled with known related structures and
Modeller will automatically calculate a model with all non-hydrogen atoms.
More generally, the input to the program is a restraint on the spatial structure
of the amino acid sequence(s) and ligands to be modeled. The output is a 3D structure
that satisfies these restraints as far as possible. Restraints can in principle be derived
from a number of different sources. These include related protein structures
(comparative modeling), NMR experiments rules of secondary structure packing
(combinatorial modeling),cross-linking experiments, fluorescence spectroscopy,
image reconstruction in electron microscopy, site-directed mutagenesis, intuition,
residue–residue and atom–atom potentials of mean force, etc.
The restraints can operate on distances, angles, dihedral angles, pairs of
dihedral angles and some other spatial features defined by atoms or pseudo atoms.
30
Presently, Modeller automatically derives the restraints only from the known related
structures and their alignment with the target sequence. A 3D model is obtained by
optimization of a molecular probability density function.
The molecular function for comparative modeling is optimized with the
variable target function procedure in Cartesian space that employs methods of
conjugate gradients and molecular dynamics with simulated annealing. Modeller can
also perform multiple comparison of protein sequences and/or structures, clustering of
proteins, and searching of sequence databases. The program is used with a scripting
language and does not include any graphics. It is written in standard Fortran 90 and
will run on Unix, Windows, or Mac computers. The core modeling procedure begins
with an alignment of the sequence to be modeled (target) with related known 3D
structures (templates). This alignment is usually the input to the program. The output
is a 3D model for the target sequence containing all main chain and side chain non-
hydrogen atoms. Given an alignment, the model is obtained without any user
intervention. First, many distance and dihedral angle restraints on the target sequence
are calculated from its alignment with template 3D structures.
The form of these restraints was obtained from a statistical analysis of the
relationships between many pairs of homologous structures. This analysis relied on a
database of 105 family alignments that included 416 proteins with known 3D
structure. By scanning the database, tables quantifying various correlations were
obtained, such as the correlations between two equivalent C-Alpha distances, or
between equivalent main chain dihedral angles from two related proteins.
These relationships were expressed as conditional probability density
functions and can be used directly as spatial restraints. For example, probabilities for
different values of the main chain dihedral angles are calculated from the type of a
31
residue considered, from main chain conformation of an equivalent residue, and from
sequence similarity between the two proteins.
Another example is, the function for a certain C-Alpha distance has given
equivalent distances in two related protein structures. An important feature of the
method is that the spatial restraints are obtained empirically, from a database of
protein structure alignments. Next, the spatial restraints and CHARMM energy terms
enforcing proper stereochemistry are combined into an objective function.
Finally, the model is obtained by optimizing the objective function in
Cartesian space. The optimization is carried out by the use of the variable target
function method employing methods of conjugate gradients and molecular dynamics
with simulated annealing. Several slightly different models can be calculated by
varying the initial structure. The variability among these models can be used to
estimate the errors.
MolDock is based on a new heuristic search algorithm that combines
differential evolution with a cavity prediction algorithm[67]. The docking scoring
function of MolDock is an extension of the piecewise linear potential (PLP) including
new hydrogen bonding and electrostatic terms. To further improve docking accuracy,
a re-ranking scoring function is introduced, which identifies the most promising
docking solution from the solutions obtained by the docking algorithm. The docking
accuracy of MolDock has been evaluated by docking flexible ligands to 77 proteins.
One application of molecular docking is to design pharmaceuticals in silico by
optimizing lead candidates targeted against proteins. The lead candidates can be found
using a docking algorithm that tries to identify the optimal binding mode of a small
molecule (ligand) to the active site of a macromolecular target. Thus, the purpose of
drug discovery is to derive drugs that more strongly bind to a given protein target than
32
the natural substrate. By doing so, the biochemical reaction that the target molecule
catalyzes can be altered or prevented. The scoring function used by MolDock is
derived from the PLP scoring functions originally proposed by Gehlhaar et al. and
later extended by Yang et al. The scoring function used by MolDock further improves
these scoring functions with a new hydrogen bonding term and new charge schemes.
1.7 COMPUTER- AIDED DRUG DESIGN (CADD)
During the last decades the field of drug discovery process that direct to new
ligands finding turns into the modern science employing of computational and
experimental approaches. The main experimental methods are combinatorial
chemistry and high-throughput screening. Computer-assisted drug design (CADD)
uses computational chemistry to discover, enhance, or study drugs and related
biologically active molecules. The most fundamental goal is to predict whether a
given molecule will bind to a target and if so how strongly. The CADD techniques
have applied to discover novel therapeutics against inflammatory bowel disease
include protein homology modeling, and docking.
Computer-Aided Drug Design [68] (CADD), also known as computer-assisted
molecular design (CAMD), is a specialized discipline that uses computational
methods to simulate drug-receptor interactions. CADD methods are heavily
dependent on bioinformatics tools, applications and databases. In most current
applications of CADD, attempts are made to find a ligand (the putative drug) that will
interact favorably with a receptor that represents the target site. Binding of ligand to
the receptor may include hydrophobic, electrostatic, and hydrogen-bonding
interactions.
In addition, solvation energies of the ligand and receptor site also are
important because partial to complete desolation must or may occur prior to binding.
33
This approach to CADD optimizes the fit of a ligand in a receptor site. However,
optimum fit in a target site does not guarantee that the desired activity of the drug will
be enhanced or that undesired side effects will be diminished. Moreover, this
approach does not consider the pharmacokinetics of the drug. CADD is dependent
upon the amount of information that is available about the ligand and receptor.
Ideally, one would have 3-dimensional structural information for the receptor and the
ligand-receptor complex from X-ray diffraction or NMR but in the opposite extreme,
one may have no experimental data to assist in building models of the ligand and
receptor, in which case computational methods must be applied without the
constraints that the experimental data would provide.
Based on the information that is available two main approaches for CADD.
One is ligand-based and second is receptor-based molecular design methods. Ligand
base approach is used when the structure of the receptor site is unknown while
receptor-based approach used when a reliable model of the receptor site is available,
as from X-ray diffraction, NMR, or homology modeling. With the availability of the
receptor site, the problem is to design ligands that will interact favorably at the site,
which is a docking problem.
Earlier drug designing drugs has always been used as trial-and-error process.
It was not until the 1960's that some understanding began to develop about the
quantitative relationship between the structure of a drug and its biological activity.
The understanding about this quantitative relationship ushered in a new era in drug
discovery system in the form of computer aided drug design.
The conventional way of synthesizing drugs has always been a tedious
process, consuming several years, huge man power and extremely expensive in order
to come up with a single effective novel drug. However, with the advent of
34
information technology, it can be done efficiently. In silico, thereby saving huge
funds and man power. The pipeline of drug discovery from idea to market consists of
seven basic steps: disease selection, target selection, lead compound identification,
lead optimization, preclinical trials, and clinical trial testing and pharmacogenomic
optimization.
1.7.1 Homology Modeling using CADD
Homology modeling refers to constructing an atomic-resolution model of the
target protein from its amino acid sequence according to experimental three-
dimensional structure of a related homologous protein. Homology modeling relies on
the identification of one or more known protein structures likely to resemble the
structure of the query sequence, and on the production of an alignment that maps
residues in the query sequence to residues in the template sequence . It has been
shown that protein structures are more conserved than protein sequences amongst
homologues. The homology modeling procedure can be broken down into four
sequential steps: template selection, target-template alignment, model construction,
and model assessment. The homology modeling programs we mainly use are SWISS-
MODEL and MODELLER. We have successfully performed homology modeling to
construct a three-dimensional structure of GLUTATHIONE S-TRANSFERASE
(GST) PROTEIN.
1.7.2 Docking using CADD
Computational simulation of a candidate ligand binding to a receptor.In the
field of molecular modeling, docking is a method which predicts the preferred
orientation of one molecule to a second when bound to each other to form a stable
complex. Knowledge of the preferred orientation in turn may be used to predict the
35
strength of association or binding affinity between two molecules using for example
scoring functions.
The associations between biologically relevant molecules such as proteins,
nucleic acids, carbohydrates, and lipids play a central role in signal transduction.
Furthermore, the relative orientation of the two interacting partners may affect the
type of signal produced (e.g., agonism vs antagonism). Therefore docking is useful
for predicting both the strength and type of signal produced.
Docking is frequently used to predict the binding orientation of small
molecule drug candidates to their protein targets in order to in turn predict the affinity
and activity of the small molecule. Hence docking plays an important role in the
rational design of drugs. Given the biological and pharmaceutical significance of
molecular docking, considerable efforts have been directed towards improving the
methods used to predict docking .
From the above studies it has been found that various sequence analysis and
modeling studies were carried out on variants of GST’s. However there was no study
reported on chick(Gallus Gallus) GST protein[69]. Hence in this work various
sequence analysis tools, homology modeling procedures were employed to build 3D
structure of the Chick GST protein. This study will be helpful for the development
GST family and to design drugs for the various cancers which are related to the GST
enzyme. Designing new drugs is a significant part of science. Drug design is the
advancement of finding drugs by design, based on their biological targets.
Typically a drug target is a key molecule involved in a particular metabolic or
signaling pathway that is specific to a disease condition or to the infectivity or
survival of a microbial pathogen. Some approaches attempt to stop the functioning of
the pathway in the diseased state by causing a key molecule to stop functioning.
36
Drugs may be designed that bind to the active region and inhibit this key molecule.
However these drugs would also have to be designed in such a way as not to affect
any other important molecules that may be similar in appearance to the key
molecules. Sequence homologies are often used to identify such risks.
1.8 RESEARCH QUESTIONS
How to predict the structure of Glutathione S transferases based on the
sequence similarities?
What is the computational approach to highlight the crucial amino acid
residues responsible for functional attributes?
How to understand the interactions at binding sites of ligands (homosapiens) –
protein (GST)?
How the computational techniques can be used to reduce the time spent in
synthesizing compounds and how further experimental procedures on
designed compounds would lead to effective compounds by using computer
aided drug designing (CADD)?