python strings chapter 8 from think python how to think like a computer scientist

PYTHONSTRINGSCHAPTER 8

FROM

THINK PYTHON

HOW TO THINK LIKE A COMPUTER SCIENTIST

STRINGSA string is a sequence of characters. You may access the individual characters one at a time with the bracket operator.

>>> name = ‘Simpson’

>>> FirstLetter = name[0]

‘Simpson’

name[0] name[1] name[4] name[-1]==name[6]

name[len(name)-1]

Also remember that len(name) is 7 #number of characters

TRAVERSING A STRING SEVERAL WAYS

name =‘Richard Simpson’

index=0

while index < len(name):

letter = name[index]

print letter

index = index + 1

for i in range(len(name)): letter = name[i] print letter

for char in name: print char

for i in range(len(name)): print name[i]

This make sense?

CONCATENATION

#The + operator is used to concat #two strings together

first=‘Monty’

second = ‘Python’

full = first+second

print full

MontyPython

#Reversing a stringword = 'Hello Monty'rev_word = ''for char in word: rev_word = char + rev_word print rev_word

ytnoM olleH

STRING SLICES

A slice is a connect subsegment(substring) of a string.

s = ‘Did you say shrubberies? ‘

a= s[0:7] slice from 0 to 6 ( not including 7)

a is ‘Did you’

b= s[8:11] slice from 8 to 10

b is ‘say’

c=s[12:] slice from 12 to end (returns a suffix)

is ‘shrubberies’

d=s[:3] from 0 to 2 (returns a prefix)

is ‘Did’

STRINGS ARE IMMUTABLE

You can only build new strings, you CANNOT modify and existing one. Though you can redefine it. For example

name = ‘Superman’

name[0]=‘s’ Will generate an error

name = ‘s’+name[1:] this would work

print name

superman

METHODS VRS FUNCTIONS

type.do_something() # here do_something is a method

‘Hello’.upper() # returns ‘HELLO’

value.isdigit() # returns True if all char’s are digits

name.startswidth(‘Har’) # returns True if so!

do_something(type) # here do_something is a function

Examples:

len(“TATATATA”) # returns the length of a string

math.sqrt(34.5) # returns the square root of 34.5

STRING METHODS

Methods are similar to functions except they are called in a different way (ie different syntax) It uses dot notation

word =‘rabbit’

uword = word.upper() return the string capitalized

string method no arguments

there are a lot of string methods. Here is another

string.capitalize() Returns a copy of the string with only its first character capitalized.

http://docs.python.org/release/2.5.2/lib/string-methods.html

FIND() A STRING METHOD

string.find(sub[, start[, end]])

Return the lowest index in the string where substring sub is found, such that ub is contained in the range [start, end]. Optional arguments start and end are interpreted as in slice notation. Returns -1 if sub is not found.

statement = ‘What makes you think she's a witch? Well she turned me into a newt’ index = statement.find('witch')print indexindex2 = statement.find('she')print index2index3 = statement.find('she',index)print index3

>>> 292141>>>

THE IN OPERATOR WITH STRINGS

The word in is a boolean operator that takes two strings and returns True if the first appears as a substring in the second.

>>> ‘a’ in ‘King Arthur’

False

>>> ‘Art’ in ‘King Arthur’

True

What does this function do?def mystery (word1,word2) for letter in word1: if letter in word2: print letter

Prints letters that occur in both words

MORE EXAMPLES

>>> ‘TATA’ in ‘TATATATATATA’

True

‘AA’ in ‘TATATATATATATATA’

False

>>> ‘AC’ + ‘TG’

‘ACTG’

>>> 5* ‘TA’

‘TATATATATA’

>>>‘MNKMDLVADVAEKTDLS’[1:4]‘NKM’>>>‘MNKMDLVADVAEKTDLS’[8:-1]‘DVAEKTDL’>>>‘MNKMDLVADVAEKTDLS’[-5,-4]‘K’>>>‘MNKMDLVADVAEKTDLS’[10:]‘AEKTDLS’>>>‘MNKMDLVADVAEKTDLS’[5:5]‘’>>>‘MNKMDLVADVAEKTD’.find(‘LV’)5

STRING COMPARISIONS

The relational operators also work here

if word == ‘bananas’:

print ‘Yes I want one’

Put words in alphabetical order

if word1 < word2:

print word1,word2

else:

print word2, word1

NOTE: in python all upper case letters come before lower case! i.e. ‘Hello’ is before ‘hello’

LETS DOWNLOAD A BOOK AND ANALYZE ITGo to http://www.gutenberg.org/ and download the first edition of Origin of the Species by Charles Darwin. Be sure in download the pure text file version and save as oots.txt

(http://www.gutenberg.org/files/1228/1228.txt)

This little program will read in the file and print it to the screen.

file = open('oots.txt', 'r') #open for reading

print file.read()

NOTE: The entire file is read in and stored in the memory of the computer under the name file!

See: http://www.pythonforbeginners.com/systems-programming/reading-and-writing-files-in-python/

http://www.gutenberg.org/

http://www.gutenberg.org/

http://www.gutenberg.org/files/1228/1228.txt



http://www.pythonforbeginners.com/systems-programming/reading-and-writing-files-in-python/



I DON’T WANT THE WHOLE FILE!

The readline() function will read from a file line by line (rather than pulling the entire file in at once).

Use readline() when you want to get the first line of the file, subsequent calls to readline() will return successive lines.

Basically, it will read a single line from the file and return a string containing characters up to \n.

# prints first linefile = open('newfile.txt', 'r') print file.readline()

#prints first 100 linesfile = open('oots.txt', 'r')for i in range(100): print file.readline()

# prints first linefile = open('newfile.txt', 'r')line=file.readline()print line

#prints entire file using in operatorfile = open('oots.txt', 'r')for line in file: print file

DOES THE ORIGIN HAVE THE WORD EVOLUTION IN IT?

#searches for the word ‘evolution’ in the file. It checks every #line individually in this program. This saves space over #reading in the entire book into memory.

file = open('oots.txt', 'r')

for line in file:

if line.find('evolution')!= -1: # if not in line return -1

print line

print 'done'

#Is this true of the 6th edition? Check it out.

What if we want to know which line the string occurs in?

LETS DOWNLOAD SOME DNA

Where do we get DNA? Well http://en.wikipedia.org/wiki/List_of_biological_databases

contains a nice list

Lets use this one http://www.ncbi.nlm.nih.gov/

Under nucleotide type in Neanderthal and download KC879692.1 ( it was the fifth one in my search) This is the entire Mitochondria sequence for a Neanderthal found in the Denisova cave in the Altai mountains.

Here it is

http://www.ncbi.nlm.nih.gov/nuccore/KC879692.1

Here is the Denisovian mitochondria.

http://en.wikipedia.org/wiki/List_of_biological_databases

http://en.wikipedia.org/wiki/List_of_biological_databases

http://www.ncbi.nlm.nih.gov/

http://www.ncbi.nlm.nih.gov/

http://en.wikipedia.org/wiki/Denisova_Cave





http://www.ncbi.nlm.nih.gov/nuccore/FR695060.1

http://www.ncbi.nlm.nih.gov/nuccore/FR695060.1

STRIPPING THE ANNOTATION INFO The annotation info for a Genbank file is everything written above the ORIGIN line. Lets get rid of this stuff using a flag variable

file = open('neanderMito.gb', 'r')

fileout = open("stripNeander.txt", "w")

# This code strips all lines above and including the ORIGIN line

# It uses a flag variable called originFlag

originFlag = False

for line in file:

if originFlag == True:

print line, #The comma suppresses the line feed

fileout.write(line)

if line.find('ORIGIN')!= -1: # When this turns false start printing

originFlag = True # to the output file

fileout.close() An absolute requirement to dump buffer

STRIPPING THE ANNOTATION INFO 2The annotation info for a Genbank file is everything written above the ORIGIN line. Another method

file = open('neanderMito.gb', 'r')

fileout = open("stripNeander.txt", "w")

line = file.readline()

while not line.startswith('ORIGIN'): # skip up to ORIGIN



while not line.startswith('//'):

print line,

fileout.write(line)


fileout.close() An absolute requirement to dump buffer

another string method. Lookit up!

NOW WE HAVE NOW 1 gatcacaggt ctatcaccct attaaccact cacgggagct ctccatgcat ttggtatttt

61 cgtctggggg gtgtgcacgc gatagcattg cgagacgctg gagccggagc accctatgtc

121 gcagtatctg tctttgattc ctgccccatc ctattattta tcgcacctac gttcaatatt

181 acagacgagc atacctacta aagtgtgtta attaattaat gcttgtagga cataataata

241 acgattaaat gtctgcacag ccgctttcca cacagacatc ataacaaaaa atttccacca

301 aacccccccc ctccccccgc ttctggccac agcacttaaa catatctctg ccaaacccca

361 aaaacaaaga accctaacac cagcctaacc agatttcaaa ttttatcttt tggcggtata

421 cacttttaac agtcaccccc taactaacac attattttcc cctcccactc ccatactact

481 aatctcatca atacaacccc cgcccatcct acccagcaca caccgctgct aaccccatac

541 cccgagccaa ccaaacccca aagacacccc ccacagttta tgtagcttac ctcctcaaag

We want to get rid of the numbers and spaces. How does one do this?

What type of characters are left in this file?digits, a,c,t,g, spaces, and CR’s

SO LETS STRIP EVERYTHING BUT A,C,T,G

file = open("stripNeander.txt", "r")

fileout = open('neanderMitostripped.txt', 'w')

# This code strips all characters but a,c,t,g

for line in file:

for char in line:

if char in ['a','c','t','g']: # I’m using a list here

fileout.write(char)

fileout.close()

What is in the fileout now?

One very long line, i.e. there are NO spaces or CR’s

WHAT IF WE WANT TO DO ALL THIS ON A LOT OF FILES?The easiest way would be to turn the previous processing to a function . Then we can use the function on the files.

# This function strips all characters but a,c,t,g from file name, returns string

def stripGenBank(name):

file = open(name, "r")

sequence = ''

originFlag=False

for line in file:

if originFlag == True:

for char in line:

if char in ['a','c','t','g']: # I’m using a list here

sequence = sequence + char # attach the new char on the end

if line.find('ORIGIN')!= -1:

originFlag = True

return (sequence)

print stripGenBank('neanderMito.gb')

LETS COMPARE NEANDERTHAL WITH DENISOVAN

neander =stripGenBank('neanderMito.gb')

denison = stripGenBank('denosovanMito.gb')

for i in range(10000):

if neander[i]!=denison[i]:

print '\nFiles first differ at location ',i

index = i+1

print 'Neanderthal is ',neander[i], ' and Denisovan is ',denison[i]

break

print neander[:index] #Dump up to where they differ

print denison[:index]

THE OUTPUT OF THIS COMPARISON

Files first differ at location 145

Neanderthal is c and Denisovan is t

gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcaccctatgtcgcagtatctgtctttgattcctgccc

gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcaccctatgtcgcagtatctgtctttgattcctgcct

python strings chapter 8 from think python how to think like a computer scientist

Documents