python strings chapter 8 from think python how to think like a computer scientist
TRANSCRIPT
PYTHONSTRINGSCHAPTER 8
FROM
THINK PYTHON
HOW TO THINK LIKE A COMPUTER SCIENTIST
STRINGSA string is a sequence of characters. You may access the individual characters one at a time with the bracket operator.
>>> name = ‘Simpson’
>>> FirstLetter = name[0]
‘Simpson’
name[0] name[1] name[4] name[-1]==name[6]
name[len(name)-1]
Also remember that len(name) is 7 #number of characters
TRAVERSING A STRING SEVERAL WAYS
name =‘Richard Simpson’
index=0
while index < len(name):
letter = name[index]
print letter
index = index + 1
for i in range(len(name)): letter = name[i] print letter
for char in name: print char
for i in range(len(name)): print name[i]
This make sense?
CONCATENATION
#The + operator is used to concat #two strings together
first=‘Monty’
second = ‘Python’
full = first+second
print full
MontyPython
#Reversing a stringword = 'Hello Monty'rev_word = ''for char in word: rev_word = char + rev_word print rev_word
ytnoM olleH
STRING SLICES
A slice is a connect subsegment(substring) of a string.
s = ‘Did you say shrubberies? ‘
a= s[0:7] slice from 0 to 6 ( not including 7)
a is ‘Did you’
b= s[8:11] slice from 8 to 10
b is ‘say’
c=s[12:] slice from 12 to end (returns a suffix)
is ‘shrubberies’
d=s[:3] from 0 to 2 (returns a prefix)
is ‘Did’
STRINGS ARE IMMUTABLE
You can only build new strings, you CANNOT modify and existing one. Though you can redefine it. For example
name = ‘Superman’
name[0]=‘s’ Will generate an error
name = ‘s’+name[1:] this would work
print name
superman
METHODS VRS FUNCTIONS
type.do_something() # here do_something is a method
‘Hello’.upper() # returns ‘HELLO’
value.isdigit() # returns True if all char’s are digits
name.startswidth(‘Har’) # returns True if so!
do_something(type) # here do_something is a function
Examples:
len(“TATATATA”) # returns the length of a string
math.sqrt(34.5) # returns the square root of 34.5
STRING METHODS
Methods are similar to functions except they are called in a different way (ie different syntax) It uses dot notation
word =‘rabbit’
uword = word.upper() return the string capitalized
string method no arguments
there are a lot of string methods. Here is another
string.capitalize() Returns a copy of the string with only its first character capitalized.
FIND() A STRING METHOD
string.find(sub[, start[, end]])
Return the lowest index in the string where substring sub is found, such that ub is contained in the range [start, end]. Optional arguments start and end are interpreted as in slice notation. Returns -1 if sub is not found.
statement = ‘What makes you think she's a witch? Well she turned me into a newt’ index = statement.find('witch')print indexindex2 = statement.find('she')print index2index3 = statement.find('she',index)print index3
>>> 292141>>>
THE IN OPERATOR WITH STRINGS
The word in is a boolean operator that takes two strings and returns True if the first appears as a substring in the second.
>>> ‘a’ in ‘King Arthur’
False
>>> ‘Art’ in ‘King Arthur’
True
What does this function do?def mystery (word1,word2) for letter in word1: if letter in word2: print letter
Prints letters that occur in both words
MORE EXAMPLES
>>> ‘TATA’ in ‘TATATATATATA’
True
‘AA’ in ‘TATATATATATATATA’
False
>>> ‘AC’ + ‘TG’
‘ACTG’
>>> 5* ‘TA’
‘TATATATATA’
>>>‘MNKMDLVADVAEKTDLS’[1:4]‘NKM’>>>‘MNKMDLVADVAEKTDLS’[8:-1]‘DVAEKTDL’>>>‘MNKMDLVADVAEKTDLS’[-5,-4]‘K’>>>‘MNKMDLVADVAEKTDLS’[10:]‘AEKTDLS’>>>‘MNKMDLVADVAEKTDLS’[5:5]‘’>>>‘MNKMDLVADVAEKTD’.find(‘LV’)5
STRING COMPARISIONS
The relational operators also work here
if word == ‘bananas’:
print ‘Yes I want one’
Put words in alphabetical order
if word1 < word2:
print word1,word2
else:
print word2, word1
NOTE: in python all upper case letters come before lower case! i.e. ‘Hello’ is before ‘hello’
LETS DOWNLOAD A BOOK AND ANALYZE ITGo to http://www.gutenberg.org/ and download the first edition of Origin of the Species by Charles Darwin. Be sure in download the pure text file version and save as oots.txt
(http://www.gutenberg.org/files/1228/1228.txt)
This little program will read in the file and print it to the screen.
file = open('oots.txt', 'r') #open for reading
print file.read()
NOTE: The entire file is read in and stored in the memory of the computer under the name file!
See: http://www.pythonforbeginners.com/systems-programming/reading-and-writing-files-in-python/
I DON’T WANT THE WHOLE FILE!
The readline() function will read from a file line by line (rather than pulling the entire file in at once).
Use readline() when you want to get the first line of the file, subsequent calls to readline() will return successive lines.
Basically, it will read a single line from the file and return a string containing characters up to \n.
# prints first linefile = open('newfile.txt', 'r') print file.readline()
#prints first 100 linesfile = open('oots.txt', 'r')for i in range(100): print file.readline()
# prints first linefile = open('newfile.txt', 'r')line=file.readline()print line
#prints entire file using in operatorfile = open('oots.txt', 'r')for line in file: print file
DOES THE ORIGIN HAVE THE WORD EVOLUTION IN IT?
#searches for the word ‘evolution’ in the file. It checks every #line individually in this program. This saves space over #reading in the entire book into memory.
file = open('oots.txt', 'r')
for line in file:
if line.find('evolution')!= -1: # if not in line return -1
print line
print 'done'
#Is this true of the 6th edition? Check it out.
What if we want to know which line the string occurs in?
LETS DOWNLOAD SOME DNA
Where do we get DNA? Well http://en.wikipedia.org/wiki/List_of_biological_databases
contains a nice list
Lets use this one http://www.ncbi.nlm.nih.gov/
Under nucleotide type in Neanderthal and download KC879692.1 ( it was the fifth one in my search) This is the entire Mitochondria sequence for a Neanderthal found in the Denisova cave in the Altai mountains.
Here it is
http://www.ncbi.nlm.nih.gov/nuccore/KC879692.1
Here is the Denisovian mitochondria.
STRIPPING THE ANNOTATION INFO The annotation info for a Genbank file is everything written above the ORIGIN line. Lets get rid of this stuff using a flag variable
file = open('neanderMito.gb', 'r')
fileout = open("stripNeander.txt", "w")
# This code strips all lines above and including the ORIGIN line
# It uses a flag variable called originFlag
originFlag = False
for line in file:
if originFlag == True:
print line, #The comma suppresses the line feed
fileout.write(line)
if line.find('ORIGIN')!= -1: # When this turns false start printing
originFlag = True # to the output file
fileout.close() An absolute requirement to dump buffer
STRIPPING THE ANNOTATION INFO 2The annotation info for a Genbank file is everything written above the ORIGIN line. Another method
file = open('neanderMito.gb', 'r')
fileout = open("stripNeander.txt", "w")
line = file.readline()
while not line.startswith('ORIGIN'): # skip up to ORIGIN
line = file.readline()
line = file.readline()
while not line.startswith('//'):
print line,
fileout.write(line)
line = file.readline()
fileout.close() An absolute requirement to dump buffer
another string method. Lookit up!
NOW WE HAVE NOW 1 gatcacaggt ctatcaccct attaaccact cacgggagct ctccatgcat ttggtatttt
61 cgtctggggg gtgtgcacgc gatagcattg cgagacgctg gagccggagc accctatgtc
121 gcagtatctg tctttgattc ctgccccatc ctattattta tcgcacctac gttcaatatt
181 acagacgagc atacctacta aagtgtgtta attaattaat gcttgtagga cataataata
241 acgattaaat gtctgcacag ccgctttcca cacagacatc ataacaaaaa atttccacca
301 aacccccccc ctccccccgc ttctggccac agcacttaaa catatctctg ccaaacccca
361 aaaacaaaga accctaacac cagcctaacc agatttcaaa ttttatcttt tggcggtata
421 cacttttaac agtcaccccc taactaacac attattttcc cctcccactc ccatactact
481 aatctcatca atacaacccc cgcccatcct acccagcaca caccgctgct aaccccatac
541 cccgagccaa ccaaacccca aagacacccc ccacagttta tgtagcttac ctcctcaaag
We want to get rid of the numbers and spaces. How does one do this?
What type of characters are left in this file?digits, a,c,t,g, spaces, and CR’s
SO LETS STRIP EVERYTHING BUT A,C,T,G
file = open("stripNeander.txt", "r")
fileout = open('neanderMitostripped.txt', 'w')
# This code strips all characters but a,c,t,g
for line in file:
for char in line:
if char in ['a','c','t','g']: # I’m using a list here
fileout.write(char)
fileout.close()
What is in the fileout now?
One very long line, i.e. there are NO spaces or CR’s
WHAT IF WE WANT TO DO ALL THIS ON A LOT OF FILES?The easiest way would be to turn the previous processing to a function . Then we can use the function on the files.
# This function strips all characters but a,c,t,g from file name, returns string
def stripGenBank(name):
file = open(name, "r")
sequence = ''
originFlag=False
for line in file:
if originFlag == True:
for char in line:
if char in ['a','c','t','g']: # I’m using a list here
sequence = sequence + char # attach the new char on the end
if line.find('ORIGIN')!= -1:
originFlag = True
return (sequence)
print stripGenBank('neanderMito.gb')
LETS COMPARE NEANDERTHAL WITH DENISOVAN
neander =stripGenBank('neanderMito.gb')
denison = stripGenBank('denosovanMito.gb')
for i in range(10000):
if neander[i]!=denison[i]:
print '\nFiles first differ at location ',i
index = i+1
print 'Neanderthal is ',neander[i], ' and Denisovan is ',denison[i]
break
print neander[:index] #Dump up to where they differ
print denison[:index]
THE OUTPUT OF THIS COMPARISON
Files first differ at location 145
Neanderthal is c and Denisovan is t
gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcaccctatgtcgcagtatctgtctttgattcctgccc
gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctggagccggagcaccctatgtcgcagtatctgtctttgattcctgcct