binf634 fall15 - lecture 41 topics logical expression string functions: substr and index random...

36
BINF634 FALL15 - LECTURE 4 1 Topics Logical expression string functions: substr and index random numbers and mutation hashes Transcription, translation, genetic code Quiz 2

Upload: kelley-matthews

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

BINF634 FALL15 - LECTURE 4 1

Topics Logical expression string functions: substr and index random numbers and mutation hashes Transcription, translation, genetic code Quiz 2

BINF634 FALL15 - LECTURE 4 2

Logical Value of Expressions Any expression in Perl can be

interpreted as a logical value true or false

Scalar context: false if 0, "", or undefined otherwise true

List context: false if () or undefined true otherwise

my $x; # or: my $x = 0;if ($x) { $x++ }else { $x = 17 }print "$x\n";17

my $x = 2;while ($x) {

print $x--, "\n";}21

my @a = ("A", "T");while (@a) {

print shift @a, "\n";}AT

BINF634 FALL15 - LECTURE 4 3

Logical Operators

$a and $b $a if $a is false, $b otherwise

$a or $b $a if $a is true, $b otherwise

not $a true if $a is not true

$a xor $b True if $a or $b is true, but not both

Logical operator "short-circuit": only evaluate second argument if necessary

Example:

open(GRADES, "grades") or die "Can't open file grades\n";

See Wall (p 109-110) for discussion of C-style &&, || and ! operators

BINF634 FALL15 - LECTURE 4 4

String Functions: substr

substr EXPR, OFFSET, LENGTH, REPLACEMENTReturn substring of string EXPR at position OFFSET and length LENGTH

$s = "hello, world!";$x = substr $s, 1, 1; # $x = "e"$x = substr $s, 0, 3; # $x = "hel"$x = substr $s, 7; # $x = "world!"

# The substring is replaced by REPLACEMENT if used:substr($s,0,5,"goodbye"); # $s = "goodbye, world!";

# This does the same thing:$s = "hello, world!";substr($s,0,5) = "goodbye";

Note: Predefined Perl functions may be used with or without parentheses around their arguments

# From example7-4.pl

# matching_percentage## Subroutine to calculate the percentage of identical bases in two# equal length DNA sequences

sub matching_percentage {my($string1, $string2) = @_;

# we assume that the strings have the same length my $length = length($string1); my $count = 0;

for (my $position=0; $position < $length ; ++$position) { if(substr($string1,$position,1) eq substr($string2,$position,1)) { ++$count; } } return $count / $length;}

5BINF634 FALL15 - LECTURE 4

#!/usr/bin/perluse strict;use warnings;my $dna = "AAAAATTTTTGGGGGTTTTT";print_sequence(dna,8);exit;

# A subroutine to format and print sequence data

sub print_sequence { my($sequence, $length) = @_;

# Print sequence in lines of $length for (my $pos = 0 ; $pos < length($sequence); $pos += $length ) { print substr($sequence, $pos, $length), "\n"; }}

AAAAATTTTTGGGGGCCCCC

6BINF634 FALL15 - LECTURE 4

BINF634 FALL15 - LECTURE 4 7

String Functions: index

index STR, SUBSTR, OFFSETReturn the position of the first occurrence of SUBSTR in STR; If OFFSET is given, skip this many letters before lookingReturns -1 if SUBSTR not found

$dna = "GATGCCATGAAATGC";$pos = index $dna, "ATG";print "ATG found at position $pos\n"; # answer: 1

pos = -1;while (($pos = index($dna, "ATG", $pos)) > -1) {

print "ATG found at position $pos\n";$pos++;

}OUTPUT:ATG found at position 1ATG found at position 6ATG found at position 11

BINF634 FALL15 - LECTURE 4 8

Random Number Generation

$r = rand; # $r is random between 0 and 1 (0<=$r< 1.0)

$r = rand(100); # random between 0 and 100

$d = int rand(101); # random integer from 0 to 100

srand($seed); # seeds the random number generator

If a program doesn't call srand(), then it generates different random numbers each time it is run.

If program does call srand($seed), we get the same sequence of random number each time the program is run with the same value for $seed.

BINF634 FALL15 - LECTURE 4 9

An Example to Generate Random Stories - I#!/usr/bin/perl -w# Example 7-1 Children's game, demonstrating primitive artificial intelligence,# using a random number generator to randomly select parts of sentences.

use strict;use warnings;

# Declare the variablesmy $count;my $input;my $number;my $sentence;my $story;

# Here are the arrays of parts of sentences:my @nouns = ('Dad','TV','Mom','Groucho','Rebecca','Harpo','Robin Hood','Joe and Moe',);

BINF634 FALL15 - LECTURE 4 10

An Example to Generate Random Stories - IImy @verbs = ('ran to','giggled with','put hot sauce into the orange juice

of','exploded','dissolved','sang stupid songs with','jumped with',);

my @prepositions = ('at the store','over the rainbow','just for the fun of it','at the beach','before dinner','in New York City',

'in a dream','around the world',);

# Seed the random number generator.# time|$$ combines the current time

with the current process id# in a somewhat weak attempt to come up

with a random seed.srand(time|$$);

# This do-until loop composes six-sentence "stories".

# until the user types "quit".do { # (Re)set $story to the empty

string each time through the loop $story = '';

# Make 6 sentences per story. for ($count = 0; $count < 6;

$count++) {

BINF634 FALL15 - LECTURE 4 11

An Example to Generate Random Stories - III# Notes on the following statements: # 1) scalar @array gives the

number of elements in the array. # 2) rand returns a random number

greater than 0 and # less than scalar(@array). # 3) int removes the fractional

part of a number. # 4) . joins two strings

together. $sentence =

$nouns[int(rand(scalar @nouns))] . " " .

$verbs[int(rand(scalar @verbs))] . " " .

$nouns[int(rand(scalar @nouns))] . " " .

$prepositions[int(rand(scalar @prepositions))]

. '. ';

$story .= $sentence;

}

# Print the story. print "\n",$story,"\n";

# Get user input. print "\nType \"quit\" to quit, or

press Enter to continue: ";

$input = <STDIN>;

# Exit loop at user's request} until($input =~ /^\s*q/i);

exit;

BINF634 FALL15 - LECTURE 4 12

Randomly Selecting and Element of an Array $verbs[int(rand(scalar @verbs)))]

verbs[int(rand(7))] #Why??

What does rand(7) do?

How about int(rand(7))?

BINF634 FALL15 - LECTURE 4 13

An Alternative Way to Randomly Select the Elements in an Array $verbs[int rand scalar @verbs]

This will actually work $verbs[rand @verbs] #How does this work?

# From example7-2.pl# A subroutine to perform a mutation in a string of DNA#

sub mutate { my($dna) = @_;

my(@nucleotides) = ('A', 'C', 'G', 'T');

# Pick a random position in the DNA my($position) = randomposition($dna);

# Pick a random nucleotide my($newbase) = randomnucleotide(@nucleotides);

# Insert the random nucleotide into the random position in the DNA substr($dna,$position,1,$newbase);

return $dna;}

14BINF634 FALL15 - LECTURE 4

# A subroutine to randomly select a position in a string.sub randomposition { my($string) = @_; # rand returns a decimal number between 0 and its argument. # int returns the integer portion of a decimal number. # # The whole expression returns a random number between 0 and # length-1 return int rand length $string;}

# Select at random one of the four nucleotidessub randomnucleotide { my(@nucleotides) = ('A', 'C', 'G', 'T'); return randomelement(@nucleotides);}

# randomly select an element from an arraysub randomelement { my(@array) = @_;

# return $array[int(rand(scalar @array)]; return $array[rand @array];}

15BINF634 FALL15 - LECTURE 4

# Subroutine to perform a mutation in a string of DNA-version 2, in# which it is guaranteed that one base will change on each call.

sub mutate_better { my($dna) = @_; my(@nucleotides) = ('A', 'C', 'G', 'T');

# Pick a random position in the DNA my($position) = randomposition($dna);

# Pick a random nucleotide my($newbase);

do { $newbase = randomnucleotide(@nucleotides);

# Make sure it's different than the nucleotide we're mutating }until ( $newbase ne substr($dna, $position,1) );

# Insert the random nucleotide into the random position in the DNA substr($dna,$position,1,$newbase);

return $dna;}

16BINF634 FALL15 - LECTURE 4

BINF634 FALL15 - LECTURE 4 17

Hashes(Associative Arrays)

A Hash is a collection of zero or more pairs of scalar values,called keys and values

Hash variable names begin with a percent sign (%)%genes = ( "gene1", "ATTCGT", "gene2", "CTGCCATGA");

The values are indexed by the keys Given a key, the hash returns the corresponding value$seq = $genes{"gene2"}; # $seq = "CTGCCATGA" Note that $genes{"gene2"} is a scalar, so it starts with $

BINF634 FALL15 - LECTURE 4 18

Hashes

Hashes can be assigned values use key=>value notation:%genes = ( "gene1", "ATTCGT", "gene2", "CTGCCATGA");

%genes = ( gene1=>"ATTCGT", gene2=>"CTGCCATGA");

Hash elements can be created/altered by assignment statements:$genes{"gene1"} = "ATTCGT";

$genes{gene2} = "CTGCCATGA"; # note: no quotes in key

BINF634 FALL15 - LECTURE 4 19

Hashes (Associative Arrays)

%genomes = ( ); # creates an empty hash

# two ways to do the same thing:

%genomes = ( "virus", 31, "bacteria", 89, "plants", 5 );

%genomes = ( virus => 31, bacteria => 89, plants => 5 );

$genomes{mammals} = 2; # adds a new pair to the hash

@genome_list = keys %genomes;

# @genome list is now ("plants" , "mammals", "bacteria", "virus")

@genome_counts = values %genomes;

# @genome_counts is now (5, 2, 89, 31)

# keys and values are not guaranteed to return the data is same order

# as it was entered, but they are guaranteed to return the data in the

# same order as each other.

BINF634 FALL15 - LECTURE 4 20

HashesThe keys function returns a list of all keys in a hash (in some random order)

%genes = (gene2=>"CTGCCATGA", gene1=>"ATTCGT");

@key_list = keys(%genes);

print "@key_list\n"; # prints: gene1 gene2

# often used to loop through a hash:

foreach $key (@key_list) {

print "The value of $key is $genes{$key}\n";

}

Output:

The value of gene1 is ATTCGT

The value of gene2 is CTGCCATGA

BINF634 FALL15 - LECTURE 4 21

Hashes# exists($H{$key}) returns TRUE if $key occurs in hash %H# print all the words in a file with their countsmy ($file) = @ARGV;open FH, $file;my @lines = <FH>;close FH;my %count = ();for my $line (@lines) {

for my $word (split " ", $line) {if (not exists $count{$word}) {

# initialize the count for a new word $count{$word} = 1;

}else {

# update the count for an existing word$count{$word}++;

}}

}for my $key (sort keys %count) {

print "$key $count{$key}\n";}

What does each line in this program do?

Hashes% cat testfile2a text file with lots of wordssome words occur once and somewords occur more than once

% wordcount.pl testfile2a 1and 1file 1lots 1more 1occur 2of 1once 2some 2text 1than 1with 1words 3

22BINF634 FALL15 - LECTURE 4

BINF634 FALL15 - LECTURE 4 23

Hashes (Associative Arrays)

The each function returns a two-element list containing one key from the hash and its associated value

Subsequent calls to each will return another pair, until all pairs have been returned (at which point an empty array is returned)

while ( ($genome, $count) = each %genomes ) ) {

print “$genome $count\n”;}

OUTPUT: (possibly not in this order)plants 5virus 31bacteria 89mammals 2

BINF634 FALL15 - LECTURE 4 24

More on Hashes (Associative Arrays)

Assigning the return value from values or keys to a scalar gives the number of pairs in a hash:

$genome_count = keys %genomes; # $genome_count is now 4$genome_count = values %genomes; # $genome_count is now 4

The delete function removes a pair from a hash

delete $genomes{bacteria};$genome_count = keys %genomes; # $genome_count is now 3

BINF634 FALL15 - LECTURE 4 25

Transcription and Translation DNA is transcribed to mRNA, mRNA is translated to protein

Start with double stranded DNAATTCGAGCATGACATCATCGGTA (sense strand)TAAGCTCGTACTGTAGTAGCCAT (complement strand)

DNA double helix separates, allowing polymerase to transcribe one strand:ATTCGAGCATGACATCATCGGTA (sense strand)AUUCGAGCAUGACAUCAUCGGUA (mRNA)| | | | | | |TAAGCTCGTACTGTAGTAGCCAT (complement strand)

mRNA codons translated into protein:AUU CGA GCA UGA CAU CAU CGG … (mRNA)

BINF634 FALL15 - LECTURE 4 26

Codons

BINF634 FALL15 - LECTURE 4 27

Reading Frames

A genomic sequence has 6 reading frames, corresponding to the six possible ways of translating the sequence into three-letter codons.

Frame 1 treats each group of three bases as a codon, starting from the first base

Frame 2 starts at the second base Frame 3 starts at the third base

Start with the sense strand of DNA: ATTCGAGCATGACATCATCGGTA Reading frame 1: ATT CGA GCA TGA CAT CAT CGG TA Reading Frame 2: TTC GAG CAT GAC ATC ATC GGT A Reading Frame 3: TCG AGC ATG ACA TCA TCG GTA

Each reading frame can be translated into a different protein sequence Frames 4, 5 and 6 are defined in a similar way, but refer to the opposite strand,

which is the reverse complement of the first strand.

BINF634 FALL15 - LECTURE 4 28

Reading Frames Perl code to process first three reading frames:

$dna = ...;for (my $r = 1; $r <= 3; $r++) {

my $frame = substr($dna, ?, ?); # fill in the blanks!# process reading frame $r here

}

# exercise: write a subroutine to return the nth reading frame for a DNA string

29

BINF634 FALL15 - LECTURE 4

BINF634 FALL15 - LECTURE 4 30

Using the Genetic Code

We’ll look at four versions of translating DNA using the genetic code:

Look up the codon using if-then-else Same as above, but use patterns to reflect redundancy of

genetic code Use a hash to look up each codon Same as above, but more efficiently

See BeginPerlBioinfo.pm

# codon2aa# A subroutine to translate a DNA 3-character codon to an amino acid

sub codon2aa_v1 { my($codon) = @_;

if ( $codon =~ /TCA/i ) { return "S" } # Serine elsif ( $codon =~ /TCC/i ) { return "S" } # Serine elsif ( $codon =~ /TCG/i ) { return "S" } # Serine elsif ( $codon =~ /TCT/i ) { return "S" } # Serine elsif ( $codon =~ /TTC/i ) { return "F" } # Phenylalanine elsif ( $codon =~ /TTT/i ) { return "F" } # Phenylalanine elsif ( $codon =~ /TTA/i ) { return "L" } # Leucine elsif ( $codon =~ /TTG/i ) { return "L" } # Leucine elsif ( $codon =~ /TAC/i ) { return "Y" } # Tyrosine elsif ( $codon =~ /TAT/i ) { return "Y" } # Tyrosine elsif ( $codon =~ /TAA/i ) { return "_" } # Stop elsif ( $codon =~ /TAG/i ) { return "_" } # Stop elsif ( $codon =~ /TGC/i ) { return "C" } # Cysteine elsif ( $codon =~ /TGT/i ) { return "C" } # Cysteine elsif ( $codon =~ /TGA/i ) { return "_" } # Stop elsif ( $codon =~ /TGG/i ) { return "W" } # Tryptophan . . elsif ( $codon =~ /GGT/i ) { return "G" } # Glycine else { print STDERR "Bad codon \"$codon\"!!\n"; exit; }}

Problem: takes longer to look up codons near end of list.

Version 1

31

BINF634 FALL15 - LECTURE 4

Remember in relating to the translation table on slide 29 that the T’s become U’s

# codon2aa# A subroutine to translate a DNA 3-character codon to an amino acid

sub codon2aa_v2 { my($codon) = @_;

if ( $codon =~ /GC./i) { return "A" } # Alanine elsif ( $codon =~ /TG[TC]/i) { return "C" } # Cysteine elsif ( $codon =~ /GA[TC]/i) { return "D" } # Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return "E" } # Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return "F" } # Phenylalanine elsif ( $codon =~ /GG./i) { return "G" } # Glycine elsif ( $codon =~ /CA[TC]/i) { return "H" } # Histidine elsif ( $codon =~ /AT[TCA]/i) { return "I" } # Isoleucine elsif ( $codon =~ /AA[AG]/i) { return "K" } # Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return "L" } # Leucine elsif ( $codon =~ /ATG/i) { return "M" } # Methionine elsif ( $codon =~ /AA[TC]/i) { return "N" } # Asparagine elsif ( $codon =~ /CC./i) { return "P" } # Proline elsif ( $codon =~ /CA[AG]/i) { return "Q" } # Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return "R" } # Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return "S" } # Serine elsif ( $codon =~ /AC./i) { return "T" } # Threonine elsif ( $codon =~ /GT./i) { return "V" } # Valine elsif ( $codon =~ /TGG/i) { return "W" } # Tryptophan elsif ( $codon =~ /TA[TC]/i) { return "Y" } # Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return "_" } # Stop else { print STDERR "Bad codon \"$codon\"!!\n"; exit; }}

Still takes longer to look up codons near end of list.

Version 2

32

BINF634 FALL15 - LECTURE 4

# codon2aa# Translate codon using hash lookup (pp 160-162 and BeginPerlBioinfo.pm)

sub codon2aa_v3 { my($codon) = @_;

$codon = uc $codon;

my(%genetic_code) = ( "TCA" => "S", # Serine "TCC" => "S", # Serine "TCG" => "S", # Serine "TCT" => "S", # Serine "TTC" => "F", # Phenylalanine "TTT" => "F", # Phenylalanine "TTA" => "L", # Leucine "TTG" => "L", # Leucine "TAC" => "Y", # Tyrosine "TAT" => "Y", # Tyrosine "TAA" => "_", # Stop "TAG" => "_", # Stop

. . . "GGT" => "G", # Glycine );

if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; }

else{ die "Bad codon: $codon";}}

Efficiency Problem: the hash is redefined for every codon lookup.

Version 3

Robustness Problem: should the program die if the codon is unknown?

33BINF634 FALL15 - LECTURE 4

BINF634 FALL15 - LECTURE 4 34

More Efficient Solution

Define the genetic code hash just once (outside subroutine) Use subroutine to perform lookup

my(%genetic_code) = ( "TCA" => "S", # Serine "TCC" => "S", # Serine "TCG" => "S", # Serine "TCT" => "S", # Serine "TTC" => "F", # Phenylalanine "TTT" => "F", # Phenylalanine "TTA" => "L", # Leucine "TTG" => "L", # Leucine "TAC" => "Y", # Tyrosine "TAT" => "Y", # Tyrosine "TAA" => "_", # Stop "TAG" => "_", # Stop

. . . "GGA" => "G", # Glycine "GGC" => "G", # Glycine "GGG" => "G", # Glycine "GGT" => "G", # Glycine );

# codon2aa# Translate a DNA 3-character codon to an amino acid# using hash lookup passed as argument

sub codon2aa { my($codon) = @_;

$codon = uc $codon;

if (exists $genetic_code{$codon}) { return $genetic_code{$codon}; } else {

return "X";# alternative error response:# die "Bad codon: $codon";

}}

The exists function return true if an element with the given key occurs in the hash.

Version 4

35BINF634 FALL15 - LECTURE 4

BINF634 FALL15 - LECTURE 4 36

Wrap up Program #1 was due tonight. Please talk to me privately

either via email or in person if you did not turn it in.

Program #2 is posted on course website and will be submitted via Blackboard

hint: Read about BeginPerlBioinfo.pm in chapter 8 Read the code in BeginPerlBioinfo.pm Start NOW!

Read: Tisdall Ch. 9 Wall Ch. 5