bioc 300: bioinformatics

Bioc 300: Bioinformatics

www.geneticsplace.com

Goals of the Course

Understand Methods and Research Questions Analyze Real Data Engage in a Realistic Learning Environment Utilize Online Databases Appreciate Complexity of Research Systems Integrate Different Types of Information Reconsider Cells as Intracellular Ecosystems Integrate Bioinformatics with Biology

What is bioinformatics?

"Bioinformatics is the term coined for the new field that merges biology, computer science, and information technology to manage and analyze the data, with the ultimate goal of understanding and modeling living systems." Genomics and Its Impact on Medicine and Society - A 2001 Primer U.S. Department of Energy Human Genome Program

Bioinformatics also represents a paradigm shift for molecular biology, instead of taking a reductionist approach, the sub-disciplines of bioinformatics are more expansionist: they attempt to study the entire complement of a particular cellular molecule or process.

The “omics” revolution Genomics: The study of the entire DNA complement of an organism

Genome Sequence Information

Applied Research

Basic Research

•Acquiring Sequence•Human Genome Draft•Evolution

•Identification of Biological Unknowns•Biomedical Research

Genomic Variations

Ethics

Human Variations

Ecology•Tracking Ivory Sales•Diatoms and Global Warming

•SNPs•Disease Analysis

•GMO’s•Genetic Testing

DNA Microarrays

Applied Research

Basic Research•Introduction to Method•Data Analysis

•Cancer •Pharmacogenomics

Proteomics

The “omics” revolution Genomics:

The study of the entire DNA complement of an organism

The study of the entire set of proteins in a particular cell type

Proteomics

Identification and Quantification

Protein-Protein Interactions

Cellular Roles

permission form Stan Fields

permission form Stan Fields

permission from Benno Schwikowski

Transcriptonomics

Proteomics

The “omics” revolution Genomics:

The study of the entire DNA complement of an organism

The study of the entire set of proteins in a particular cell type

The study of all mRNA transcripts in a particular cell type Metabolomics

The study of all metabolites in a particular cell type Glycomics

The study of all polysaccharides in a particular cell type Variomics

The study of all possible drug targets in a particular cell type

Genomic Circuits

Integrated Circuits

Toggle Switches

Single Gene Circuit

www.bio.davidson.edu/courses/genomics/circuits.html

Sequencing of Whole Genomes

Three Phases of Genome Sequencing:

Preliminary sequencing Finishing Annotating

Preliminary sequencing 1970’s

Maxam-Gilbert sequencing (chemical cleavage)

Sanger sequencing (dideoxy method)

You could sequence 100’s of bases per day!

Autorad

Genomics “took off” with automated sequencing 1990’s

Leroy Hood made modifications to dideoxy sequencing:

ddNTPs were coupled to fluorescent dyes (instead of radioactivity)

DNA fragments were separated via capillary gel electrophoresis

Sequence read by lasers, data was directly recorded into computer

Now, instead of an autorad, we have a:

Chromat!

The newest DNA sequencers can determine millions of bases of sequence in a day!

The increasing ease of obtaining sequence data has lead to a logarithmic growth of Genbank, the main repository of sequence data which is housed at the National Library of Medicine at NIH.

Growth of Genbank

Sequencing Entire OrganismsBefore the 1990’s, sequencing was somewhat haphazard. Depending on the researcher, different pieces of different organisms’ genomes had been sequenced.No concerted effort had been made to sequence the entire genome of an organism.

HUGO changed all of that, it’s mission was to sequence the human genome, as well as a number of the genomes of model organisms.While small genomes could be sequenced directly, larger genomes were first mapped out.

Mapping large genomesSequencers needed some reference sequences to know what part of a genome they were dealing with.

STSs - sequence tagged sitesThese are defined by a pair of PCR primers that amplify only one segment of a genome (ie. unique sequence).ESTs- expressed sequence tagsThese are short sequences of cDNA that indicate where genes are located within the genome.

Now genomes could be cut into pieces, sequenced, and the pieces reassembled.

Cutting up genomesVectors designed to carry large pieces of DNA include:

BACs- bacterial artificial chromosomes- can carry about 150 kb of insert

YACs- yeast artificial chromosomes- can carry up to 1.5 Mb of insertBACs or YACs containing overlapping DNA can be assembled into contigous overlapping fragments.

“Shotgun” sequencing

While HUGO was busy mapping large genomes and sequencing some small genomes, Craig Venter founded TIGR.TIGR took a completely different approach. Instead of mapping a genome, they simply cut it into thousands of pieces, sequenced the pieces, and reassembled the data using overlapping fragments.It was TIGR, not HUGO, who produced the world’s 1st completed genome in 1995- H. influenzae.

Finishing a Genomic Sequence

A “finished” sequence is defined as one that contains no more than 1 error in 10,000 bases.

Finishing a sequence involves aligning a number of preliminary sequences and correcting any inconsistencies.

Overlapping segments are combined into larger assemblies of contiguous DNA (contigs).

If contigs do not overlap, a gap remains in the sequence.

Finishing continued

The human “draft” sequence, published in 2001, contained 147,821 gaps.

The “finished” sequence, published in 2004, contained 341 gaps.

A gap usually contains highly repetitive DNA that complicates attempts to clone and sequence it.

Finishing is a very expensive process, many genomes have not been finished.

Annotating Genomes

Annotation involves the identification of functionally important sections of a genome.

This includes, but is not limited to, making an educated guess about what kind of protein is encoded by a given coding sequence.

Annotation is performed using various computer programs.

Locating genes within a genome

Prokaryotes contain ORFs with no introns and very little intergenic sequence.

Eukaryotes contain introns, complex promoters, and enhancers

Introns range between 70 and 30,000 bpOne eukaryotic gene can encode more that one different

protein via alternate splicing mechanismsEukaryotes also contain pseudogenes, ORFs which have

been rendered nonfunctional by mutationMammalian genomes contain about 23% pseudogenes

Process is different in prokaryotes vs. eukaryotes

Tools for gene hunting GeneMark - originally created for prokaryotes but

adapted for some model eukaryotes GenScan - accepts up to 1 million bp of sequence

online, more if downloaded Glimmer & GlimmerM - developed by TIGR,

accepts up to 200 kb online, more if downloaded

Once a genome is annotated… One can use a genome browser to locate specific

loci on specific chromosomes One can then use resources such as GeneCard to

find out more about a specific gene

Progress of Genome SequencingSequenced Euk. Genomes Yeast Drosophila C. elegans Arabidopsis Mosquito Human Mouse Rat Chicken Dog Zebra fish

Euk. Genomes in Progress Xenopus Cow Cat Horse Kangaroo Honey Bee Turkey Lobster Bat Hedgehog

and others…

Genomic Search Engines include: BLAST- searches sequence information, either

nucleotide (BLASTn) or protein (BLASTp) BLAST2- aligns two sequences, checking similarity Enterez- searches databases for textual information PubMed- searches scientific literature for text ORF finder- finds Open Reading Frames (genes) PREDATOR- predicts secondary structure of proteins ExPASy- analysis of protein sequence and structure as

well as 2D gel information

Tools had to be developed to make sense of the dearth of genomic data being produced

Calculating E(expect)-values

E-values measure the “significance” of a match, the smaller E-value, the better

E-values are calculated using:1) S, the bit score, a measure of the similarity

between the hit and the query2) m, the length of the query3) n, the size of the database

E = mn2-S

So, how do you get the bit score?

S is calculated from the raw score, R

R = aI + bX - cO - dG

Where I is the # of identities, X is the # of mis-matched nucleotides, O is the # of gaps, and G is the # of spaces in the gap.a, b, c, and d are the rewards, and penalties, for each of these variables.The defaults of these lower-case letters are set at 1, -3, 5, and 2, respectively.

These values can be changed on the “Other advanced” line.

Now that we have a raw score, the bit score can be obtained by normalizing the data:

S = (R - ln K)/ln 2

(where and K are the normalizing parameters)

These parameters are printed at the bottom of a BLAST report.

Normalization enables a direct comparison of E-values and bit scores, even if the reward and penalty variables have been changed by the user.

More databases of interest:

SwissProt- protein sequence database PDB- contains protein structural information OMIM- catalogs human disease genes TIGR- many searchable genomes, esp. bacterial ones GeneCard- genomic, proteomic and phenotypic info. Unigene- catalogs human ESTs Human map viewer- shows chromosomal location of genes

Protein structure and functionFor most researchers, the final goal of genomic research is not the genomic data itself but an understanding of the proteins encoded for by a genome.

Steps to determining protein structure and function: Find ORFs, or coding sequences (CDSs) Translate ORFs

Predict hydropathy using a Kyte-Doolitle plot Check if 3D structure has been determined

Predict secondary structure of your protein

Is this a known protein? If not, find protein orthologs, similar proteins in different species

What do we mean by function?

Why = biological process. The objective toward which this protein contributes.

What = molecular function. The biochemical activity that the protein accomplishes.

Where = cellular component. The location of protein activity.

The term “function” is too simplistic and is somewhat outdated. A consortium called “Gene Ontology” decided that a complete description of function must include not only “why?” but also “what?” and “where?”

One example: isocitrate dehydrogenase (IDH)

OMIM - IDH3A COG - functional categories, dendograms,isoforms- distinct genes encoding similar proteins Enzyme Commission, “EC” numbers Swiss-Prot Phylogenetic trees rooted vs. unrooted

Terms used to describe phylogeny paralogs - genes which arose from a common ancestral

gene within one species (isoforms) orthologs - genes from two organisms which arose

from a common ancestral gene synteny -genetic loci located on the same chromosome

(or multiple genetic loci from different species which are located on a chromosomal region of common ancestry)

homology - sequences which are similar due to a common evolutionary origin

similarity or identity

- terms used to describe sequences without regard to evolutionary relationships

Searching for related proteins

PSI-BLAST allows one to search outward in a spiraling pattern from a central starting point.

First iteration- finds proteins with similar sequences.

Second iteration- can be performed using a consensus sequence computed from your first iteration. More iterations can be performed as desired. Or, one can choose a species and perform another first iteration using the results of the original search.

This approach can be used to annotate ORFs from a newly sequenced genome

Alternate Splicing

60% of human genes produce more than 1 mRNA

Only about 22% of genes in C. elegans fit into this category

Epigenetic ControlIt is not just the coding regions which matter.

Methylation, such as that found in heterochromatin

and CpG islands, also plays a role in gene expression.

At any given time, there are 400,000 mC in a given cell. Since there are about 100 different human cell types, this totals 40 million methylation events in our methylome.Nonmammalian animals lack this form of epigenetic control.

The # of CpG islands correlates with the # of genes on a chromosome

CpGs are usually associated with genes

ImprintingAbout 20 mammalian genes are known to be

methylated during gametogenesis in either the parental or maternal copy.

Imprinting may represent a “genetic tug-of-war” between male and female interests.

For example, the insulin-like growth factor 2, Igf2, is expressed only in the paternal allele. Igf2 promotes the growth of the developing embryo.

The expression of its receptor, Igf2r, is controlled by the maternally inherited allele.

Expression of Paternal Allele of Igf2 in embryo and placenta

How does silencing work?

What is the effect a loss of imprinting? Loss of Igf2 imprinting can lead to colorectal

cancer and Beckwith-Wiedemann Syndrome

There is a cluster of CpG islands in an insulator region near Igf2

CTCF is a protein which only binds to unmethylated DNA.

17/20 tumor samples taken from cancer patients were found to be hypermethylated in this region.

What about the rest of our genome? Since only 1-2% of our genome is coding

sequence what does the rest do?

A majority of our DNA is repetitive sequence There are 5 classes of repetitive sequence:1) transposon derived

4) segmental duplications3) simple repeats such as VNTRs2) pseudogenes

5) heterochromatic regions

The first category alone accounts for 45% of our genome!

TransposonsTransposons fall into 4 categories:

1) SINEs, short interspersed elements, such as Alu comprise 13% of our genome

These may help a cell cope with stress, RNA produced from these bind to an inhibitor of translation.

2) LINEs, long interspersed elements, comprise 21% of our genome

3) LTR retrotransposons comprise 8% of our genome

4) Other DNA transposons 3% of our genome

More Transposon Facts

About 50 genes appear to be derived from transposons, including RAG1 and RAG2, necessary for antibody diversity.

The X chromosome has the highest concentration of transposons- one 525 kb section is 89% transposon-derived.The Y chromosome has the highest concentration of LINEs, it is the most gene-poor of the chromosomes and probably tolerates insertions well.

bioc 300: bioinformatics

Documents

particular cell typethe

organismthe study

entire complement

entire dna complement

entire set of proteins

dideoxy sequencing

particular cellular

real data