introduction to bioinformatics cpsc 265

Introduction to Bioinformatics

CPSC 265

Interface of biology and computer science

Analysis of proteins, genes and genomes using computer algorithms and computer databases

Genome informatics: making sense of the billions of base pairs of DNA that are sequenced by genomics projects.

Mostly, it’s about protein and DNA sequences

What is bioinformatics?

What do bioinformatics researchers do?

Process large data outputs from new technologies

Turn sequence data into whole-genome sequences

Interpret genome sequences in terms of genes and their expression

Find genes that control crop, animal traits, disease etc.

Model evolution in genomes and proteins

Model and predict 3D structures of proteins

Growth of GenBank

Year

Bas

e p

airs

of

DN

A (

bil

lio

ns)

Seq

uen

ces

(mil

lio

ns)

Updated 8-12-04:>40b base pairs

1982 1986 1990 1994 1998 2002 Fig. 2.1Page 17

Cost of sequencing is falling exponentially

DNA sequence analysis

Could be like those from our experiment last week

Or, a lot bigger, like the whole human genome.Some have chromatogram or “quality” data, some don’t.

DNA makes RNA makes protein

Hard to sequence RNA

Very hard to sequence protein

We can deduce RNA sequence from DNA (in bacteria, as easy as turning Ts to Us.In eukarya, need also to figure out where introns are)

We can deduce protein sequence from RNA, using the Universal Genetic Code

ConceptualTranslation

In a computer, take each set of three RNA letters, and then figure out what amino acid they code for.

Professionalbiologists usethe SINGLELETTER CODE

DNA potentially encodes six proteins

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT

5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

We call these READING FRAMES

5’ CAT CAA 5’ ATC AAT 5’ TCA ATG


5’ CATCAATGACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTACTGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

All proteins start with M (ATG)TAG, TAA and TGA are all STOP

This can help narrow it down

5’ CAT CAA 5’ ATC AAT 5’ TCA ATG


5’ CATCAATGACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTACTGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Once you know the sequence of the protein, you can figure out if it has

been studied already.You may even be able to track down

a likely structure

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformatics

Institute

There are three major public DNA databases

Housed at NCBINational

Center forBiotechnology

Information

Housed in Japan

Page 16

www.ncbi.nlm.nih.gov

PubMed is… • National Library of Medicine's search service• 12 million citations in MEDLINE• links to participating online journals• PubMed tutorial (via “Education” on side bar)

BLAST is…

• Basic Local Alignment Search Tool• NCBI's sequence similarity search tool• supports analysis of DNA and protein databases• 80,000 searches per day

TaxBrowser is…

• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms

From the NCBI homepage, type “lectin”and hit “Search”

PubMed is the NCBI gateway to MEDLINE.

MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries.

It has 12 million records dating back to 1966.

Page 35

PubMed

BLAST

BLAST looks for similarity between your favorite query sequence and other known protein or DNA sequences.

Applications include• identifying homologs (orthologs and paralogs)• discovering new genes or proteins• discovering variants of genes or proteins• investigating expressed sequence tags (ESTs)• exploring protein structure and function

page 88

Four components to a BLAST search

(1) Obtain the sequence (query)

(2) Select the BLAST program

(3) Enter sequence

(4) Choose optional parameters

Then click “BLAST”

page 88

Step 2: Choose the BLAST program

blastn (nucleotide BLAST)

blastp (protein BLAST)

tblastn (translated BLAST)

blastx (translated BLAST)

tblastx (translated BLAST)

DNA potentially encodes six proteins

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT


5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Choose the BLAST program

Program Input Database 1

blastn DNA DNA 1

blastp protein protein 6

blastx DNA protein 6

tblastn protein DNA 36

tblastx DNA DNA

Step 3: choose the database

nr = non-redundant protein (most general

database)

Also can search specific organisms and DNA

rather than protein (although ALL DNA is going to

take a long time…)

filtering

So now you can

• Find any sequence in the database

• Find relevant publications

• Match DNA to protein sequence

• Find database matches to DNA or protein

• Find conserved domains in protein

• Find the 3D structure of a protein

…Without doing any experiments!

introduction to bioinformatics cpsc 265

Documents

rna sequence

protein sequence

dna sequence analysiscould

analysis of dna

dna sequenceswhat

sequence rnavery hard

billions of base pairs

tgg gta