lecture1 1 perl for bioinformatics davide pisani & james cotton

Bioinformatics

Programming

(Perl Programming)

2010

Davide Pisani

Bioinformatics

• Using computers to store, organise and

interpret biological data

• In particular, data from high-throughput

technologies (-omics)

High-throughput technologies

• DNA & Protein sequences and structure

(genomics & Proteomics)

• Yeast two-hybrid screens (interactomics)

• Microarrays (transcriptomics)

• Metabolic networks (metabolomics)

How much sequence data is

there? 1371published complete genomes

188 ongoing archaeal genomes

4941 Bacterial ongoing genomes

1599 Ongoing eukaryotic genomes

242 metagenomes

How much data in each

genome? ftp://ftp.ncbi.nih.gov/refseq/release/

The human genomeftp://ftp.ncbi.nih.gov/refseq/release/

The human genomeftp://ftp.ncbi.nih.gov/refseq/release/

etc..

(70 base pairs per line, 57 lines per page = 3990 bases/page

Chromosome 1 is (about) 247,249,719 bases long

i.e. 62,000 pages

Whole genome (3.2 x 109) = 802,000 pages

Genome Base pairs No. of Genes

Phi-X 174 5,386 10

Nanoarchaeum equitans 490,885 552

E. coli 4,639,221 4,377

Saccharomyces

cerevisiae

12,495,682 5,800

Drosophila

melanogaster

122,653,977 13,379

Homo sapiens 3.2 x 109 30,000

Protopterus aethiopicus 1.3 x 109 ?

Psilotum nudum 2.5 x 1011 ?20-25,000

Amoeba dubia 6.7 x 1011 ?

Genbank contains much more

than just sequence data

Information on the Organism, the

gene, where it is expressed and so

forth.

Protein Structure

PDB: Protein Structure

lecture1 1 perl for bioinformatics davide pisani & james cotton

Science

human genomeftp

bacterial ongoing genomes

ongoing archaeal genomes

ongoing eukaryotic genomes

protein structure

sequence data isthere

published complete genomes

andinterpret biological