signal processing of dna and protein sequences
TRANSCRIPT
Nitesh Kumar Singh
SIGNAL PROCESSING OF PROTEIN SEQUENCES AND
DNA
Signal -Signal is the flow of Information.Mathematically, Signals are the functions of
the independent variable, such as time ( For example speech signal ), or position ( for example image ).
Biomedical Signal –
Electrical signals generated in the a biological system (human or animal) or originating from a physiologic process due to electrochemical changes accompanied by the conduction of signals. Examples are EEG, ECG.
Signal Processing Methods –
Analog or Continuous Time Signal Processing
Digital or Discrete Time Signal Processing
Advantages of DSP over ASP -
Stable, robust, accurate.Flexibility and up-gradation.easily stored.Easy operation in short timeMultiplexing done by Integrated Service
Digital Network (ISDN)
DSP In Biomedical Signals -
Processing of biomedical signals in biological as well as synthetic biological world. Signals are then recorded and processed digitally.
Example : EEG, ECG etc.DSP in medical imaging. Example : CT scanner,
ultrasound, endoscopes etc.Manufacturing healthcare instruments. Example :
heart rate meter, aspect bispectral index.For diagnostic purposes, like analyzing the signals of
heartbeat to check the abnormality and so like, the proteins sequences to study the genomic of living beings.
Biomedical application domain using DSP -
Information gathering : Measurement of phenomena to understand the biological system.
Diagnosis : Detection of the malfunction, abnormality, pathology.
Monitoring : To obtain periodic or continuous information about the biological system.
Therapy and Control : Modify the behavior of the system and ensure the result.
Evaluation : Objective analysis, i.e. proof of performance, quality control, effect of treatment.
Processing of Biomedical Signals -
Transducers
Amplifiers and Filters
Analog to Digital conversion
Filtering to remove artifacts
Detection of events and components
Analysis of events and waves; Feature extraction
Pattern recognition, classification and
diagnostic decisions
Computer aided diagnostic therapy
Biomedical
signals
Sign
al
proc
essi
ng
Signal
processingSignal processing
Signal processing
Signal Data Acquisition
IN THE GENOMICS WORLD
DNA and proteins are mathematically represented in ‘character strings’, in which each character is a letter of an alphabet.
For e.g., DNA has alphabet size of 4 and has the letters A, T, C and G.
Protein has alphabet size of 20.
REVISING SOME BIOLOGICAL FUNDAMENTALS
DNA :It is made up of many linked smaller
components, called Nucleotides.Each nucleotides is of 4 types, designated by A,
G, T, C with ends either being 3’ or 5’. 3’ end is linked to 5’ and vica-versa for a strong
covalent bond.Always read in a specific direction, from left to
right5’ 3’
Cont.
DNA occurs in pair of stands.Each pair being complementary to each other.The nucleotide chains are bonded by hydrogen
bond with
A = T
C GThe 2 stands in a DNA runs opposite to each
other
CENTRAL DOGMA
Each DNA is made up of 2 types of regions : Genes and intergenic spaces.
Gene contain the information of the proteins.Each gene is responsible for the production of
protein.A gene, further has 2 sub-regions : Introns and
Exons.Genes are first transcribed into single stranded
RNA or mRNA.Introns from RNA are then removed by the
process of splicing.
Cont.After splicing, each mRNA is divided into 3
adjacent bases.Each base is called a Codon.
E.g., AGT, AAC, TGC, TAC, etc.A codon identifies an amino acid which defines
a protein.There are about 64 possible codons, but only 20
amino acids.Many codons can define 1 single amino acid
(many-to-one)
Cont.
The process of conversion of mRNA to protein is called as translation.
Translation is aided by an adopter molecules, called transfer RNA or tRNA.
DNA SEQUENCES AND DSP
The macromolecular biological sequences corresponding to chains of nucleotides or amino acids is done by considering them to be strings of characters “A,” “T,” “C,” and “G.” In DSP of these sequences, the characters are assigned a numerical values.
Suppose, we assign number a to character ‘A’, t to character ‘T’, c to character ‘C’, and g to character ‘G’ where a, t, c and g are complex numbers.
Cont.If, we take ‘ t = a* ’ and ‘ g = c* ’
We can get a complementary DNA sequence by :
We can also obtain a sequences of proteins by assigning numerical values to the amino acids.
Indicator SequenceThe indicator sequence of adenine of a DNA
sequence is defined as:
Where , adenine
And, DNA sequenceSimilarly, we can obtain for the rest 3 bases
Cont.
The total spectrum of a symbolic sequence is often defined as the squared modulus of the DFT’s of the indicator sequences, that is:
Spectral Envelope
Consider the n × 4 matrix,
and the vector of real weights,
The sequence z = uw then corresponds to the mapping of
A a, C c, G g, t T
DNA walk
It is a graphical representation of DNA sequence, termed as “fractal landscape” or “DNA walk”.
random walk model, a walker moves either up ( u(i) = +1) or down ( u(i) = −1) one unit length for each step i of the walk.
uncorrelated walk, the direction of each step is independent of the previous steps.
correlated random walk, the direction of each step depends on the history (“memory”) of the walker.
Cont.
The DNA walk is defined by the rule that the walker steps up ( u(i) = +1) if a pyrimidine occurs at position a linear distance i along the DNA chain, while the walker steps down ( u(i) = −1) if a purine occurs at position i.
This provides degree of correlation in the base pair sequence, which is directly visualized by calculating the “net displacement” of the walker after number of steps.
Gene Prediction
Characteristics of protein coding DNA regions:base sequences in the protein-coding regions of
DNA molecules have a period-3 component because of the codon structure involved in the translation of base sequences into amino acids.
Eg, For eucaryotes (cells with nucleus) this periodicity has mostly been observed within the exons and not within the introns.
Cont.
Filtering:
The filtering of the fragment of the DNA sequence is done with the help of IIR Antinotch Filter
Cont.
DNA Spectrogram:the appearance of spectrograms provides
significant information about signals.
provide local frequency information for all four bases defined by displaying the resulting three magnitudes by superposition of the corresponding three primary colors
red for x, green for y, blue for z
Cont.
Cont.
Cont.
Identification of protein coding DNA region:First, DFT’s are calculated for different bases by
the formula of
with k = N/3, that:
W=aA+tT+cC+gG.
Color coding and color map approach
Since, Number of primary colors is same as the number of the coding reading frames, color-coding scheme is applied. In this,
the value Θ = 0B is assigned to color RED
the value Θ = 120B is assigned to color BLUE
the value Θ = -120B is assigned to color GREEN
Cont. In-between values are color-coded in a linear manner in
which the three axes labeled R, G, and B correspond to the primary colors red, green, and blue.
Cont.In color map, the intensity is modulated by the square
magnitude multiplied by 700 and clipped to the interval (0, 1).
DisadvantagesThe obstacles involved include large amounts of data,
lacking a complete knowledge of the genome length a priori, and recognizing nucleotide symbol identity with complete accuracy.
These impediments are typical of ones encountered in standard telecommunications problems.
Using Fourier transforms for mapping, the mapping may either expose or hide some frequency information.
Furthermore, there might be no biochemical meaning for the ordering and arithmetic structure that result from the symbolic to numeric mapping.
Conclusion -Signal processing-based computational and visual tools
are meant to synergistically complement character-string-domain tools that have successfully been used for many years by computer scientists.
The assignment of optimized, complex numerical values to nucleotides and amino acids provides a new computational framework, which may also result in new techniques for the solution of useful problems in bioinformatics, including sequence alignment, macromolecular structure analysis, and phylogeny.
field of computer science, bioinformatics, has emerged, focusing on the use of computers for efficiently deriving, storing, and analyzing these character strings to help solve problems in molecular biology
THANK YOU!!