lecture1 1 perl for bioinformatics davide pisani & james cotton
DESCRIPTION
Bioinformatics programming for PerlTRANSCRIPT
Bioinformatics
Programming
(Perl Programming)
2010
Davide Pisani
Bioinformatics
• Using computers to store, organise and
interpret biological data
• In particular, data from high-throughput
technologies (-omics)
High-throughput technologies
• DNA & Protein sequences and structure
(genomics & Proteomics)
• Yeast two-hybrid screens (interactomics)
• Microarrays (transcriptomics)
• Metabolic networks (metabolomics)
How much sequence data is
there? 1371published complete genomes
188 ongoing archaeal genomes
4941 Bacterial ongoing genomes
1599 Ongoing eukaryotic genomes
242 metagenomes
How much data in each
genome? ftp://ftp.ncbi.nih.gov/refseq/release/
The human genomeftp://ftp.ncbi.nih.gov/refseq/release/
The human genomeftp://ftp.ncbi.nih.gov/refseq/release/
The human genomeftp://ftp.ncbi.nih.gov/refseq/release/
etc..
(70 base pairs per line, 57 lines per page = 3990 bases/page
Chromosome 1 is (about) 247,249,719 bases long
i.e. 62,000 pages
Whole genome (3.2 x 109) = 802,000 pages
Genome Base pairs No. of Genes
Phi-X 174 5,386 10
Nanoarchaeum equitans 490,885 552
E. coli 4,639,221 4,377
Saccharomyces
cerevisiae
12,495,682 5,800
Drosophila
melanogaster
122,653,977 13,379
Homo sapiens 3.2 x 109 30,000
Protopterus aethiopicus 1.3 x 109 ?
Psilotum nudum 2.5 x 1011 ?20-25,000
Amoeba dubia 6.7 x 1011 ?
Genbank contains much more
than just sequence data
Information on the Organism, the
gene, where it is expressed and so
forth.
Protein Structure
PDB: Protein Structure