pharmacogenomics and bioinformatics

72
Pharmacogenomics and Bioinformatics M. Saleet Jafri

Upload: quynn-peters

Post on 04-Jan-2016

52 views

Category:

Documents


2 download

DESCRIPTION

Pharmacogenomics and Bioinformatics. M. Saleet Jafri. What is pharmacogenomics?. Pharmacogenomics is the use genomic and sequence data of host and pathogens to identify potential drug targets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pharmacogenomics and Bioinformatics

Pharmacogenomics and Bioinformatics

M. Saleet Jafri

Page 2: Pharmacogenomics and Bioinformatics

What is pharmacogenomics?

• Pharmacogenomics is the use genomic and sequence data of host and pathogens to identify potential drug targets

• Involves a variety of techniques/disciplines such as sequence analysis, protein structure, genomics, micorarray analysis and others

• These fields rely heavily on bioinformatics• Usually focuses on medical or agricultural applications

Page 3: Pharmacogenomics and Bioinformatics

Human Genome ProjectProject goals are to • identify all the approximately 20,000-25,000 genes in

human DNA, • determine the sequences of the 3 billion chemical base

pairs that make up human DNA, • store this information in databases, • improve tools for data analysis, • transfer related technologies to the private sector, and • address the ethical, legal, and social issues (ELSI) that

may arise from the project.

From http://www.ornl.gov/hgmis/

Page 4: Pharmacogenomics and Bioinformatics

Human Genome ProjectProgress

- Several types of genome maps have already been completed, and a working draft of the entire human genome sequence was announced in June 2000, with analyses published in February 2001.

- An important feature of this project is the federal government's long-standing dedication to the transfer of technology to the private sector. By licensing technologies to private companies and awarding grants for innovative research, the project is catalyzing the multibillion-dollar U.S. biotechnology industry and fostering the development of new medical applications.

From http://www.ornl.gov/hgmis/

Page 5: Pharmacogenomics and Bioinformatics

Human Genome Project

• Seven organisms were originally chosen for sequencing.– E. coli– Yeast– Fly– Worm– Arabidopsis– Mouse– human

• Why were these chosen?

Page 6: Pharmacogenomics and Bioinformatics

Genome ProjectsAs of January 2005 there were many more sequenced

– 25 non-plant eukaryotes– 5 plants– 213 microbes completed– 21 Archae– 274 microbes in progress– 1431 viruses in progress– 833 non-virus organisms with at least on nucleotide

sequence submitted

• Why were these chosen?

Page 7: Pharmacogenomics and Bioinformatics

Genome Projects

• Chosen by funding agencies• Four main categories

– Medical applications– Evolutionary significance– Environmental impact– Food production

Page 8: Pharmacogenomics and Bioinformatics

How are genomics used for drug target identification?

• The basic idea is to look for genes unique to the pathogen that are crucial for its survival. This would be the drug target.

• If this is a pathogen in the host, the gene would be in the pathogen and not in the host.

• If this was in the environment, the gene should be as specific as possible for the pathogen to avoid harming other organisms that might be beneficial.

Page 9: Pharmacogenomics and Bioinformatics

How can this be done?

• To do this genomics, proteomics and bioinformatics are involved.

• In any of these cases bioinformatics tools are necessary.

Page 10: Pharmacogenomics and Bioinformatics

Genome Sequencing and Comparison

• As mentioned earlier, many pathogen (virus, bacteria, and other microorganisms) have been sequenced.

• Once they are sequenced, they are annotated. Annotation is the process by which the functions of the different proteins (genes) are determined.

• In this way, an understanding of the organisms metabolism is gained.

Page 11: Pharmacogenomics and Bioinformatics

Malaria

• Malaria is caused by the genus Plasmodium, with Plasmodium falciparum being the most lethal.

• Its genome has been sequenced• It is a pathogen that digests proteins for food. It does not

contain any amino acid producing genes in its genome, i.e. it does not make its own amino acids.

• Purines are recycled, but there are not genes for purine synthesis.

• Has many solute ATP dependent transporters and one novel multifunctional transporter.

Page 12: Pharmacogenomics and Bioinformatics

How is annotation done?

• Annotation is the process of predicting the function of genes in a genome.

• First all the genes have to be found. This is done by finding the open reading frame (ORF).

• This is done by gene finding or gene prediction software.

Page 13: Pharmacogenomics and Bioinformatics

Gene Prediction

• Analysis by sequence similarity can only reliably identify about 30% of the protein-coding genes in a genome

• 50-80% of new genes identified have a partial, marginal, or unidentified homolog

• Frequently expressed genes tend to be more easily identifiable by homology than rarely expressed genes

Page 14: Pharmacogenomics and Bioinformatics

Gene Finding

• Process of identifying potential coding regions in an uncharacterized region of the genome

• Still a subject of active research

• There are many different gene finding software packages and no one program is capable of finding everything

Page 15: Pharmacogenomics and Bioinformatics
Page 16: Pharmacogenomics and Bioinformatics

Eukaryotes vs Prokaryotes

• Eukaryotic DNA wrapped around histones that might result in repeated patterns (periodicity of 10) for histone binding. The promotor regions might be near these sites so that they remain hidden.

• Prokaryotes have no introns.

• Promotor regions and start sites more highly conserved in Prokaryotes

• Different codon use frequencies

Page 17: Pharmacogenomics and Bioinformatics

Gene finding is species-specific

• Codon usage patterns vary by species

• Functional regions (promoters, splice sites, translation initiation sites, termination signals) vary by species

• Common repeat sequences are species-specific

• Gene finding programs rely on this information to identify coding regions

Page 18: Pharmacogenomics and Bioinformatics

The genetic code

Page 19: Pharmacogenomics and Bioinformatics

Codon usage

Page 20: Pharmacogenomics and Bioinformatics

Identifying ORFs

• Simple first step in gene finding

• Translate genomic sequence in six frames. Identify stop codons in each frame

• Regions without stop codons are called "open reading frames" or ORFs

• Locate and tag all of the likely ORFs in a sequence

• The longest ORF from a Met codon is a good prediction of a protein encoding sequence.

• SOFTWARE: NCBI ORF Finder

Page 21: Pharmacogenomics and Bioinformatics

ORF Finder input

Page 22: Pharmacogenomics and Bioinformatics

ORF finder results

Page 23: Pharmacogenomics and Bioinformatics

Tests of the Predicted ORF

• Check if the third base in the codons tends to be the same one more often than by chance alone.

• Are the codons used in the ORF the same as those used in other genes (need codon usage frequency).

• Compare the amino acid sequence for similarity with other know amino acid sequences.

Page 24: Pharmacogenomics and Bioinformatics

Problems with ORF finding

• A single-character sequencing error can hide a stop codon or insert a false stop codon, preventing accurate identification of ORFs

• Short exons can be overlooked

• Multiple transcripts or ORFs on complementary strand can confuse results

Page 25: Pharmacogenomics and Bioinformatics

Pattern-based gene finding

• ORF finding based on start and stop codon frequency is a pattern-based procedure

• Other pattern-based procedures recognize characteristic sequences associated with known features and genes, such as ribosome binding sites, promoter sites, histone binding sites, etc.

• Statistically based.

Page 26: Pharmacogenomics and Bioinformatics

Content-based gene finding

• Content-based gene finding methods rely on statistical information derived from known sequences to predict unknown genes

• Some evaluative measures include: "coding potential" (based on codon bias), periodicity in the sequence, sequence homogeneity, etc.

Page 27: Pharmacogenomics and Bioinformatics

A standard content-based alignment procedure

• Select a window of DNA sequence from the unknown. The window is usually around 100 base pairs long

• Evaluate the window's potential as a gene, based on a variety of factors

• Move the window over by one base

• Repeat procedure until end of sequence is reached; report continuous high-scoring regions as putative genes

Page 28: Pharmacogenomics and Bioinformatics

Combining measures

• Programs rarely use one measure to predict genes

• Different values are combined (using probabilistic methods, discriminant analysis, neural net methods, etc.)

to produce one "score" for the entire window

Page 29: Pharmacogenomics and Bioinformatics

Drawbacks to window-based evaluation

• A sequence length of at least 100 b.p. is required before significant information can be gained from the analysis

• Results in a +/- 100 b.p. uncertainty in the start site of predicted coding regions, unless an unambiguous pattern can also be found to indicate the start.

Page 30: Pharmacogenomics and Bioinformatics

Most are web-based, but...

• Submit sequence; input sequence length may be limited

• Select parameters, if any

• Interpret results

• Most software is first or second generation; results come in non-graphical formats.

• GeneMark, GenScan, Glimmer

Page 31: Pharmacogenomics and Bioinformatics

How is annotation done?

• This is done by comparing the DNA sequences of the genes to known genes in a database. If they sequences are similar, the a similar function is assumed.

• The comparison is done using sequence comparison

tools such as BLAST

Page 32: Pharmacogenomics and Bioinformatics

Database Searching for Similar Sequences

• Database searching for similar sequences is ubiquitous in bioinformatics.

• Databases are large and getting larger• Need fast methods

Page 33: Pharmacogenomics and Bioinformatics

Types of Searches

• Sequence similarity search with query sequence• Alignment search with profile (scoring matrix with gap

penalties)• Serch with position-specific scoring matrix representing

ungapped sequence alignment• Iterative alignment search for similar sequences that

starts with a query sequence, builds a multiple alignmnet, and then uses the alignment to augment the search

• Search query sequence for patterns representative of protein families

From Bioinformatics by Mount

Page 34: Pharmacogenomics and Bioinformatics

DNA vs Protein Searches

• DNA sequences consists of 4 characters (nucleotides)• Protein sequences consist of 20 characters (amino acids)• Hence, it is easier to detect patterns in protein sequences

than DNA sequences• Better to convert DNA sequences to protein sequences

for searches.

Page 35: Pharmacogenomics and Bioinformatics

Database Searching Efficacy

• To evaluate searching methods, selectivity and sensitivity need to be considered.

• Selectivity is the ability of the method not to find members known to be of another group (i.e. false positives).

• Sensitivity is the ability of the method to find members of the same protein family as the query sequence.

Page 36: Pharmacogenomics and Bioinformatics

Protein Searches

• Easier to identify protein families by sequence similarity rather than structural similarity. (same structure does not mean same sequence)

• Use the appropriate gap penalty scorings• Evaluate results for statistical significance.

Page 37: Pharmacogenomics and Bioinformatics

History

• Historically dynamic programming was used for database sequence similarity searching.

• Computer memory, disk space, and CPU speed were limiting factors.

• Speed still a factor due to the larger databases and increase number of searches.

• FASTA and BLAST allow fast searching.

Page 38: Pharmacogenomics and Bioinformatics

History

• The PAM250 matrix was used for a long time. It corresponds to a period of time where only 20% of the amino acids have remained unchanged.

• BLOSUM has replace PAM250 in most applications. BLAST use the BLOSUM62 matrix. FASTA uses the BLOSUM50 matrix.

Page 39: Pharmacogenomics and Bioinformatics

Search Tools

• Similarity Search Tools– Smith-Waterman Searching

• Heuristic Search Tools– FASTA– BLAST

Page 40: Pharmacogenomics and Bioinformatics

Malaria Vaccine

• A German and American Team used reverse genetics i.e. they used the sequenced genome, deduced the candidate genes, and then knocked out a particular gene (Uis3).

• This give 30 day immunity in mice which is better than vaccines made by traditional methods

Page 41: Pharmacogenomics and Bioinformatics

Microarray Data Analysis

Gene chips allow the simultaneous monitoring of the expression level of thousands of genes. Many statistical and computational methods are used to analyze this data. These include: – statistical hypothesis tests for differential expression analysis– principal component analysis and other methods for

visualizing high-dimensional microarray data– cluster analysis for grouping together genes or samples with

similar expression patterns– hidden Markov models, neural networks and other classifiers

for predictively classifying sample expression patters as one of several types (diseased, ie. cancerous, vs. normal)

Page 42: Pharmacogenomics and Bioinformatics

What is Microarray Data?

In spite of the ability to allow us to simultaneously monitor the expression of thousands of genes, there are some liabilities with micorarray data. Each micorarray is very expensive, the statistical reproducibility of the data is relatively poor, and there are a lot of genes and complex interactions in the genome.

 

Microarray data is often arranged in an n x m matrix M with rows for the n genes and columns for the m biological samples in which gene expression has been monitored. Hence, mij is

the expression level of gene i in sample j. A row ei is the gene

expression pattern of gene i over all the samples. A column sj

is the expression level of all genes in a sample j and is called the sample expression pattern.

Page 43: Pharmacogenomics and Bioinformatics

Types of Microarrays

• cDNA microarray

• Nylon membrane and plastic arrays (by Clontech)

• Oligonucleotide silicon chips (by Affymetrix)

• Note: Each new version of a microarray chip is at least slightly different from the previous version. This means that the measures are likely to change. This has to be taken into account when analyzing data.

Page 44: Pharmacogenomics and Bioinformatics

cDNA Microarray

• The expression level eij of a gene i in sample j is

expressed as a log ratio, log(rij/gi), of the log of its

actual expression level rij in this sample over its

expression level gi in a control.

• When this data is visualized eij is color coded to a

mixture of red (rij >> gi) and green (rij << gi) and a

mixture in between.

Page 45: Pharmacogenomics and Bioinformatics

Nylon Membrane and Plastic Arrays (by Clontech)

• A raw intensity and a background value are measured for each gene.

• The analyst is free to choose the raw intensity or can adjust it by subtracting the background intensity.

Page 46: Pharmacogenomics and Bioinformatics

Oligonucleotide Silicon Chips (by Affymetrix)

• These arrays produce a variety of numbers derived from 16-20 pairs of perfect match (PM) and mismatch (MM) probes.

• There are several statistics related to gene expression that can be derived from this data. The most commonly used one is the average difference (AVD), which is derived from the differences of PM-MM in the 16-20 probe pairs.

• The next most commonly used method is the log absolute value (LAV), which comes from the ratios PM/MM in the probe pairs.

• Note: The Affymetrix gene-chip software has a absent/present call for each gene on a chip. According to Jagota, the method is complex and arbitrary so they usually ignore it.

Page 47: Pharmacogenomics and Bioinformatics

For What Do We Use Microarray Data?

• Genes with similar expression patterns over all samples – We can compare the expression patterns ei and ei’ of two genes i and i' over all samples.

• If we use cluster analysis, we can separate the genes into groups of genes with similar expression patterns (trees).

• This will allow us to find what unknown genes have altered expression in a particular disease by comparing the pattern to genes know to be affiliated with a disease.

• It can also find genes that fit a certain pattern such as a particular pattern of change with time.

• It can also characterize broad functional classes of new genes from the known classes of genes with similar expression.

Page 48: Pharmacogenomics and Bioinformatics

For What Do We Use Microarray Data?

• Genes with unusual expression levels in a sample – In contrast to standard statistical methods where we ignore outliers, here outliers might have particular importance. Hence, we look for genes whose expression levels are very different from the others.

• Genes whose expression levels vary across samples – We can compare gene expression levels of a particular gene or set of genes in different samples. This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment.

Page 49: Pharmacogenomics and Bioinformatics

For What Do We Use Microarray Data?

• Samples that have similar expression patterns – We might want to compare the expression patters of all genes between two samples. We might cluster the genes into gene with similar expression patterns to help with the comparison. This can be used to look compare normal and diseased tissues or diseased tissue before and after treatment.

• Tissues that might be cancerous (diseased) – We can take the gene expression pattern of sample and compare it to library expression patterns that indicate diseased or not diseased tissue.

Page 50: Pharmacogenomics and Bioinformatics

Statistical Methods Can Help

• Experimental Design – Since using microarrays is costly and time consuming, we want to design experiments to use the minimal number of micorarrays that will give a statistically significant result.

• Data Pre-processing – It is sometimes useful to preprocess the data prior to visualization. An example of this is the log ratio mentioned earlier. It is often necessary to rescale data from different microarrays so that they can be compared. This is due to variation in chip to chip intensity. Another type of preprocessing is subtracting the mean and dividing by the variance.

Page 51: Pharmacogenomics and Bioinformatics

Statistical Methods Can Help

• Data Visualization – Principle component analysis and multidimensional scaling are two useful techniques for reducing multidimensional data to two and three dimensions. This allows us to visualize it.

• Cluster Analysis – By associating genes with similar expression patterns, we might be able to draw conclusions about their functional expression.

• Probability Theory – We can use statistical modeling and inference to analyze our data. Probability theory is the basis for these.

Page 52: Pharmacogenomics and Bioinformatics

Statistical Methods Can Help

• Statistical Inference – This is the formulation and statistical testing of a hypothesis and alternative hypothesis.

• Classifiers for the Data – We can construct classes from data, such a diseased vs. non-diseased tissue. We can build a model (such as a hidden Markov model) that fits know data for the different classes. This can then be used to classify previously unclassified data.

Page 53: Pharmacogenomics and Bioinformatics

Preprocessing Microarray Data

• Before microarray data can be analyzed or stored, a number of procedures or transformations must be applied to it.

• In order to analyze the data correctly, it is important to understand what the transformations might be doing to the data.

Page 54: Pharmacogenomics and Bioinformatics

Preprocessing Microarray Data • Ratioing the data• Log-tranforming ratioed data• Alternative to ratioing the data• Differencing the data• Scaling data across chips to account for chip-to-chip

difference• Zero-centering a gene on a sample expression pattern• Weighting the components of a gene or sample

expression pattern differently• Handling missing data• Variation filtering expression patterns• Discretizing expression data

Page 55: Pharmacogenomics and Bioinformatics

Cluster Analysis of Microarray Data

• Recall that microarray data can be thought of as gene expression patterns or sample expression patterns. These can be each considered to be vectors. The first thing we have to do before applying cluster analysis is to find a distance between the various expression pattern vectors. This is done using similarity/dissimilarity measures such as Euclidean distance, Mahalonobis distance, or linear correlation coefficients. Once a distance matrix is computed, the following clustering algorithms can be used. The clusters formed can differ significantly depending upon the distance measure used.

Page 56: Pharmacogenomics and Bioinformatics

Cluster Analysis of Microarray Data

• Hierarchical Clustering – Assume each data point is in a singleton cluster. – Find the two clusters that are closest together.

Combine these to form a new cluster.– Compute the distance from all clusters to the new

cluster using some form of averaging. – Find the two closest clusters and repeat.

Page 57: Pharmacogenomics and Bioinformatics

Cluster Analysis of Microarray Data

• k-Means Clustering – An alternate method of clustering called k-means clustering, partitions the data into k clusters and finds cluster means i for each cluster. In our case, the means will be vectors also. Usually, the number of clusters k is fixed in advance. To choose k something must be know about the data. There might be a range of possible k values. To decide which is best, optimization of a quantity that maximizes cluster tightness ie. minimizes distances between points in a cluster.

Page 58: Pharmacogenomics and Bioinformatics

Cluster Analysis of Microarray Data

• Self-organizing Maps – This is basically an application of neural networks to microarray data. Assume that there is a 2-dimensional grid of cells and a map from a given set of expression data vectors in Rn, ie, there are n nodes in the input layer and a connection neuron from each of these to each cell. Each cell (i, j) gets it own weight from n input neurons. The weight vector mij is the mean of the cluster associated with cell (i, j). Each data vector d gets mapped to the cell (i, j) that is closest to d using Euclidean distance.In order to train the network, the mean vectors mij for the cells (i, j) must be learned.

Page 59: Pharmacogenomics and Bioinformatics

Sample Microarray

Page 60: Pharmacogenomics and Bioinformatics

Correlations

Page 61: Pharmacogenomics and Bioinformatics

Clustering of Genes

Page 62: Pharmacogenomics and Bioinformatics

Personalized Medicine

• There is a new buzz word called personalized medicine.• The idea is to develop medicine and treatment plan

based on an individuals genetic make-up.

Page 63: Pharmacogenomics and Bioinformatics

Proteomics

• Understanding protein function • Functional genomics• Multiple approaches – structure, expression levels,

biochemistry, modeling etc.• Combining technologies is necessary to understand in

vivo protein functional

Page 64: Pharmacogenomics and Bioinformatics

Approach

• Use data to determine pathway.• Use biochemistry to figure out kinetics and

concentrations.• Use new proteomic approaches to determine relative

concentrations.• Apply pathway model to determine functional

consequence.

Page 65: Pharmacogenomics and Bioinformatics

Pathway Data

• Using molecular biological techniques we can determine what proteins make up a biochemical pathway.

A B C

D

Page 66: Pharmacogenomics and Bioinformatics

Pathways

• Biochemical Pathways form complex biochemical reaction networks.

• There might be multiple ways to get from A to B.• The path chosen depends on biochemical kinetics.

Page 67: Pharmacogenomics and Bioinformatics

Biochemistry

• Classical biochemistry isolates proteins from tissue or cells.

• Modern molecular biology allows the production of purified protein.

• The concentration of the protein is determined• The kinetic properties of the proteins is determined by

biochemical assay – rates of reactions, modulating factors, etc.

Page 68: Pharmacogenomics and Bioinformatics

Pathway Modeling Methods

• Boolean Models • Metabolic Control Theory – Flux Balance Analysis• Biochemical Systems Analysis• Kinetic Modeling Approach

Page 69: Pharmacogenomics and Bioinformatics

Disorders of Thrombophilia

• The functional consequences of nonsynonymous SNPS can be predicted by comparison of protein structures.

• There are various SNPs know– Activated protein C resistance by Arg 506 to Glu– Prothrombing polymorphism (G20210A) causing

elevated prothrombin levels– Protein C deficiency– Protein S deficiency– Antithormbin deficiency– Elevated factor VIII levels

Page 70: Pharmacogenomics and Bioinformatics

Fibrinogen Abnormalities

• Various polymorphisms found in the long arm of chromosome 4

• Two dimorphisms of the -chain gene are of major importance and in linkage disequilibrium with each other.

• These affect plasma fibrogen levels

Page 71: Pharmacogenomics and Bioinformatics

Prothrombin G20210 Polymorphism

• Replacement of a G by A at nucleotide 20210 in the untranslated section of the prothrombin gene increases translation without altering transcription of the gene.

• This results in elevated synthesis and secretion of prothrombin by the liver.

• This results in increased thrombin levels

Page 72: Pharmacogenomics and Bioinformatics

Activated protein C deficiency

• Factor V Leiden R506Q mutation occurs in 8% of the population.

• It is a GA substitution at nucleotide 1691 in the gene for factor V.

• Factor V is cleaved less efficiently by activated protein C

• Results in deep vein thrombosis, early kidney transplant loss, recurrent miscarriages and other disorders