python for biomedicine - uefwhamalai/pybio/pybio_ravantti_lectures_v1.0.pdf · python-packages for...

Post on 23-Jun-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Python for Biomedicinejanne.ravantti@helsinki.fi

Basic informationTime: spring term 2019, 5 contact sessions and self-studying, lasting about 6 weeks.

Contact teaching: 15.4. 4h (12-16), 16.4. 4h (10-14), 29.4. 2h (12-14), 9.5. 4h (12-16), 10.5. 4h (10-14).

The deadline for project works will be in the end of May-beginning of June.

Teachers:

Lecturer: Janne Ravantti, adjunct professor, Faculty of Medicine, UH

Assistants: Wilhelmiina Hämäläinen & Vittorio Fortino, Institute of Biomedicine, UEF

This is a beginners class...

Today & tomorrow● the course goals● how to pass the course● why this course might be good for you

● what is programming in general

● working Python-programming environment installed

● basics of Python-programming○ “theory”○ the first program○ exercises○ more “theory”

Ask questions!

Bioinformatics & Python?

----------------------------------------- Python -----------------------------------------

Comp.sciStatisticsMathematics

BiologyBioinfos’methodsdevelopment

Processingbiologicaldata

Programming 1/2

Programming requires peculiar way of thinking(but it can be learned!)

Programming 2/2

Good* way to learn programming is to program!

*The Best?

Programming & bioinformatics

Goodness of your program is (mostly) defined by the biological question*

*Wilhelmiina might disagree...

Opinionated tips for programming

● Start small (e.g. not aligning 1000-genomes humans!) and one step at a time

● Don’t worry (about errors) (too much - testing is important, but...)

● Think! What...:

○ is the biological question?

○ is the data?

○ the program is supposed to do (methods, algorithms, ...)?

○ input (DNA-sequence? Set of RNA-seq data, names of plants, …)

○ can go wrong => then what (disk full, memory full, bad methods, too little data, ...)?

● Learn to save your code (naming, locations, even something like git)

Caveats● Everything changes...

○ Data (WXS => WGS => WGBS; RNA-seq, …; HG37 vs. HG38...)○ Methods (bowtie => bowtie2 => bwa mem => minimap2 => …)○ Links go stale (404 Not Found)

○ Python 2.7 => 3.6+○ Python-libraries (Standard library, Biopython, ...)

○ Operating systems / platforms○ System libraries

=> Do not get stuck with the old unless absolutely necessary, but don’t worry too much about newest trends!

Learning to program depends on you!

(The best way to learn programming is to program...)

Please, ask questions! We can pace the course on your preferences

Learning outcomes● to make computer work for you!

● understand what kind of data processing tasks can be automated and how

● working programming environment that can be used with your own research

Advantages● Study credits :)

● Automate your own data cleaning, analysis and reporting

● After understanding the general ideas behind programming languages, new languages (R, java, …) are easier to learn and use

● Jobs - also from other fields - biomedicine and biosciences are nowadays naturally “Big Data” and “Data Science”-oriented

Questions?

Programming● Program is only(...) a detailed “recipe” what the programmable object

(computer, phone, refrigerator, car, …) must (should…) do● E.g. instructions for a robot to warm up dinner:

1. Put pizza on the plate2. Put plate into micro3. Warm up a minute4. If pizza not warm, goto step 35. Feed to Master

● Is this detailed enough? What could go wrong?

Basic programming concepts 1/3● Program is collection of commands / rules (“statements”) to the computer

(here: with Python-interpreter)

● Commands in the different programming languages can look quite different, but in the end they do same things:

○ Data input / output (“read, write”)○ Internal data handling: variables and data structures (“list, dictionary, ...”) and their

manipulation (“operators”)○ Control flow statements

■ Conditional statements (“if then else”)■ Loops (“for, while, goto, ...”)

○ Functions / subroutines / methods (“sqrt(x), sin(x),...”)

Basic programming concepts 2/3● Programming language’s syntax (i.e. rules and regulations) causes problems

in the beginning (and later in the life…)○ Capital and small letters matter (“print” vs. “Print”)○ Commans (“,”) matter more than in natural languages

● Errors can come from○ Syntax => program doesn’t even start○ Program(mer)’s logic => program might stop (“crash”) during the run or does wrong things

=> which one is worse?

https://en.wikibooks.org/wiki/Computer_Programming/Hello_world

Basic programming concepts 3/3● Programming requires its own(?) way of thinking

● Think before writing even one line○ What information / data the program needs?○ What should the program do with the input and what to produce as an output?○ Is the input/output coming/going to a file(s)?○ What separate (recurring) parts the program has?

● What could go wrong? What to do then?

● Especially larger programming projects should be split to smaller pieces (in your mind, in paper, …) => “divide and conquer”

ExerciseOn your own words, how would you make a program that reads from computer’s folder all text files and reports those in which a string “TTAGGG” occurs*.

What information the program needs as input?

How would you report the files?

How should the program proceed?

Any repeatable parts?

What can go worng? What then?

(*)https://en.wikipedia.org/wiki/Microsatellite

Documentation● Should have instructions how to run the program, without reader to

understand inner workings of the program

● Should describe how the program works, regardless the language used

● Should help other programmers to understandstep-by-step how the program works

● Documentation is not only the program listing!

● Has to be clearly and correctly written

Questions?

Python programming language● Named after TV-show Monty Python’s flying circus

● Two main versions○ 2.7+, older still in use, no new development○ 3.x, newer, actively developed, but stable language

● There is a lot of learning material based on version 2.x. Be careful out there

● Python 3.7 is used in this course

https://docs.python.org/3/tutorial/index.html

Python’s characteristics● Easy to read (for a programming language...)

● Open Source, available to all kinds of machines / platforms

● Large standard library (“Batteries included” philosophy)

● Lots of material/libraries/modules in the Internet (web-programming, graphics, bioinformatics, …)

● Fast enough for scientific computing○ large numerical and scientific libraries○ can be used with other (faster) languages (“C”, “C++”, …)

Python in the wild● In teaching...

● Scientific computing○ Bioinformatics (e.g. Biopython-package - later!)○ Machine learning○ Visualization○ Robotics○ “Big Data”, “Data Science”

● System administration○ ILM, Weta, Rackspace,...

● Mobile programming (e.g. http://kivy.org/docs/gettingstarted/intro.html)

Anaconda / Python

● Anaconda is Continuum Analytics’ Python-distribution package

● Works in Mac- / Win- / Linux-environments

● Contains over 330 720 something(...) Python-packages for all kinds of

programming tasks

● Free version also for commercial use

● Smaller installation package “miniconda”

● conda-program has become a general package manager e.g. for

bioinformatics’ programs (samtools, bwa, emboss, ...)

Let’s install Anaconda!https://www.anaconda.com/

Python programming environments● Many, many, many options

○ IDLE = Integrated DeveLopment Environment○ Spyder○ Jupyter (lab)

● Interactive command interpreter○ Takes commands (1+1, print(“ATGC”), ...))○ Shows results of the commands / programs○ Prints contents of the data structures and variables

● There is usually also an editor for writing and editing programs○ Use a one with syntax highlighting / coloring

● Let’s try...

Python vs. IDE● Many, many, many options

○ IDLE = Integrated DeveLopment Environment○ Spyder○ Jupyter (lab)

● Python itself understands only textfiles, reads them and executes them line-by-line. Everything else is made to make programming easier

● In this course, we are learning Python!

The first programLet’s program the machine to greet us!

In the command line

Write the following and press Enter:

print(“What is thy bidding my master?”)

In the IDE (idle3, Spyder, jupyter, …):

Write the same print-command and select e.g. Run -> Run Module

Save the program e.g. as “greeting.py”

The first program - cont’dSaved programs can be naturally opened, edited and rerun

”#”-character means a comment i.e. free text that the Python-interpreter does not care about, for example:

# this is my first programprint(”What is thy bidding my master?”)

Reasonable and informative comments help to understand the code/program later!

Saving programs● As with any writing - save often!

● Use informative filenames (“test.py” vs. “calculate_GC_ratio.py”)

● Understand the difference between files saved by default in IDEs (e.g. jupyter notebook) vs. text file containing the actual Python-code (usually “something.py” - file)

● The Python-code is portable and can be run anywhere with the same(ish) Python-environment

ExerciseMake program that prints the following:

Nucleobases in DNA are:

* cytosine [C]* guanine [G]* adenine [A]* thymine [T]

Remember to save your program!

Error messages● The interpreter will complain about possible errors

● Start time errors○ Typos (missing parenthesis, commas, …)○ Uninitialized variables (will come back to this later)

● Runtime errors○ Division by zero○ Type errors (will come back to this later)

● Understanding error messages can be difficult - patience!

This far...● Introduction to programming & Python

● Some ideas how programming problems can be approached e.g. with by “divide & conquer”

● You have a working Python-programming environment installed

● You have written your first program!

Questions?

ExerciseDescribe in general level a program that reads two Genbank-formatted(*) files and count how many same genes they have

What input the program needs & output program produces?

How program knows, if there are same genes?

What repeatable parts the program has?

What can go wrong? Then what?

(*)http://scikit-bio.org/docs/0.5.2/generated/skbio.io.format.genbank.html

(*)http://www.insdc.org/files/feature_table.html

ExerciseMake program that prints the following:

*-------*| || || |*-------*

Variables● Programming can be thought to be manipulation of the input to output using

statements and variables as temporary holders of internal state

● Variables are “boxes” that hold different kinds of information. In Python, variable has always some type

○ Integer (1, 5, -100)○ Float (3.14, -1.06e-10)○ Character / String (“ATGGGA”, “1234567890”)○ Boolean (True, False)

● Naming is case sensitive but quite free. Sensible variable names make more sense especially later (“xxx_ver_001” vs. “total_sum”)

Assignment (statement) 1/2● Equal sign (“=”) assigns value (right hand side) to a variable (left hand side)

● Variables can be combined○ a = 1○ b = 2○ c = a + b

● Variables’ values can be printed using print-statement separated by commas

○ print(a, b, c)

● Arithmetic operations (”+”, ”-”, ”/”, ”*”) work for numerical values and, ”+” and ”*” for characters and strings

Assignment (statement) 2/2● Equal sign (“=”) assigns value (right hand side) to a variable (left hand side)

● Read assignment “a = b + 2” as: “a gets a value b plus two”, not: “a equals b plus two”

● Reason being, that “a = a + 2” makes little sense mathematically, but makes sense, when reading it as:”assign a new value to a which is the sum of previous value of a and two”

● Equality is tested with operator “==” (more later)

Operators● Operators do “something” to variables

● Most common operators are○ arithmetic operators (“+”, “-”, “*”, “/”)○ comparison operators (“>”, “<”, “==”)○ logical operators (“and”, “or”, “not”)

● We’ll return to these later...

Exercise● Make a program that calculates how many exons transcript variants X1, X2

and X3 have, when○ X1 has 23 exons○ X2 has four exons less than X1○ X3 has half the exons of X2

● Use assignments and variables in calculations

Type conversions● Combining different types of variables can surprise / give (semi-)cryptic error

messages (“a”+3.14, 10.0+1, 10*”TA”)

● Type of the variable can be changed (, if it makes sense)○ int(a) => change variable’s type to integer○ float(a) => -:- float○ str(a) => -:- string

● E.g. (let’s try!)

character_1 = “1”number_1 = int(character_1)

Reading input● Python interpreter reads user input with input-command(*) as a string

● input takes an argument that will be shown to user (e.g. questions)

● input returns a string and it can be assigned directly to a variable, but type might need a conversion

● Examples○ gene_name = input(“Enter gene name: ”)○ genome_length = int(input(“Enter genome length: ”))

(*) “input” is actually a function, but more later...

Exercises● Make a program that asks two gene lengths and prints out lengths’ mean

○ Ask lengths one by one

● Make a program that calculates area, given height and length

Test your programs with different inputs - what can go wrong?

Intermission● If you have extra time in the lectures (& tonight)

○ https://www.practicepython.org/

● Libraries (e.g.)○ https://docs.python.org/3/library/index.html○ https://matplotlib.org/○ https://pydata.org/

■ https://pandas.pydata.org/■ https://seaborn.pydata.org/index.html■ https://biopython.org/

■ ...

This far you can...● Input data to your program

● Manipulate data using variables and operators

● Print out variables and/or results

Questions?

Programming...● Program is collection of commands / rules (“statements”) to the computer

(here: with Python-interpreter)

● Commands in the different programming languages can look quite different, but in the end they mostly do same things:

○ Data input / output (“read, write”)○ Internal data handling: variables and data structures (“list, dictionary, ...”) and their

manipulation (“operators”)○ Control flow statements

■ Conditional statements (“if then else”)■ Loops (“for, while, goto, ...”)

○ Functions / subroutines / methods (“sqrt(x), sin(x),...”)

Conditional statements 1/4● Control structures direct the order of execution of the statements in a program

● In Python, if statement handles the decisions

● if understands only boolean types True and False

● The colon (“:”) ends the if-statement - do not forget it!

if statement_is_True:do_something

Conditional statements 2/4● The statement is whatever “thing” that produces True/False result e.g.

a = 1

b = “Hello”

c = False

a > 2 (False)

b == “Hello” (True)

c == True (False)

Conditional statements 3/4● Statements can be constructed using constants, variables and operators

○ Arithmetic operators (+, -, *, /, %)○ Comparison operators (==, <, >, !=, >=, <=)○ Logical operators (and, or, not)

E.g. if a + b > 5 and c == 5:

do_something

if not c == 5:do_something

Conditional statements 4/4● If the statement is True, then the lines in the same block are executed

if statement_is_true: do_something do_something_morecontinue_the_program

highlights the code block / indentation

Code blocks● Blocks group together (one or more) statements (if-statements, but also

loops, functions etc.) that are executed at the same part of the program.

● The block is marked with indentation which is done using tab-character not with spaces - especially do not mix both! Editors help with indentation

if a > 100 : a = a / 2 b = a + 10print(a, b)

ExerciseMake program that asks for three genome lengths and prints the longest one. If the length is zero or negative, print a warning.

What could go wrong? How?

Conditional statements - more● If-statement understands only boolean values True and False

● If statement is True, then the following code block is executed

● else-statement with colon (“:”) can follow the if-statement’s code block. If the original statement was False, then else code block is executed

if statement_is_True:do_something

else: # <= the statement was Falsedo_something_else

continue_the_program

Conditional statements - even more...● If-statement can also have extra elif-statements

● elif == “else if”

if statement_1_is_True:do_something

elif statement_1_is_False_and_statement_2_is_True:do_something_else

else: # <= all previous statements Falsedo_something_else_2

continue_normal_program

Exercise● Make a program that asks user a number and prints out, if the number is even

or odd

● Use if- and else-statements

● The % (modulo) operator yields the remainder from the division of the first argument by the second

Tips for programming● Divide & conquer - split program into smaller, understandable pieces - even

before writing a single line!

● Name programs and variables with reason

● Always initialize (i.e. give starting value) all variables

● Smart commenting pays off!

● Print intermediate results when in doubt - those can be commented out later

ExerciseMake a program that asks three numbers from user in any order, substracts 5 from the largest, add then 8 to smallest number and finally prints old and new numbers from smallest to largest

What is the output with input:

2, 4, 8

10, 11, 11

This far...● I/O (Input/Output)

● Comparisons

Questions?

Strings 1/2● Python has many, many, many tools (operators, functions, methods) to

manipulate strings

E.g.gene = “AUG” + “AATGAATCTGGA”+ “TAG”

print(len(gene)) # length

print(gene.find(”TGA”))

if “TAG” in gene:print(“Stop codon found!”)

https://docs.python.org/3/library/stdtypes.html#string-methods

Strings 2/2● Remember case sensitivity!

codon = “ACT”

if codon == “act”:print(“Threonine found”)

”in” operator tests, if the exact substring is found in the other string:

if ”ct” in codon:print(“CT found”)

ExerciseMake a program that asks two nucleotide sequences, prints longer one and the shorter one, only if it is not found in the longer one.

From strings to lists 1/2● Strings are immutable lists containing characters

● Lists are (“linear”) collections of any kind of data not unlike lockers

● Length of the list tells how many individuallockers there are

● E.g. characters of word “protein” can be storedin seven different lockers

“locker”: |p|r|o|t|e|i|n| 0 1 2 3 4 5 6

From strings to lists 2/2● Each character in the string can be accessed using indices and brackets

“[“,”]”

● Every element in a list has its own “serial number” (index)

● The only(?) tricky part is that indexing starts at 0 (zero)

E.g.gene = “titin”len(gene) = 5 # the length of the word, not the protein!gene[0] # the first letter “t”gene[4] # the last letter “n”

Strings, lists and indices● The only(?) tricky part in indexing is to remember to start from 0 (zero) and

that the last position is length of the list - 1

● Negative index means counting from the end○ -1 is the index of the last element

E.g.gene = “titin”gene[-1] # the last letter “n”last_index = len(gene)-1gene[last_index] # the last letter again

ExerciseMake a program that asks name of the gene and prints out the first, the middle and the last letter

Lists & slices● Lists (and strings) in general can be sliced i.e. return a part of the list

● General form (“syntax”) is: my_list[start:end:step], where start and end indices can be omitted, if start == 0 and end == -1

● Note that the list ends in index end-1!

E.g.sequence = “ATTTGTAAAGTCCCCCG”

sequence[1:2] # returns “T”sequence[1:3] # “TT”sequence[7:] # “AAGTCCCCCG”

Lists 1/3● List index range is from 0 to len(list)-1

● Lists can contain any data types freely○ collection = [”A”, 400, ”3.14”, 4.6, ”Hello World”, [1, 2, ,3]]

● append-method adds elements to the endof the list

○ hiv_genes = [“gag”, “pol”, “env”, “tat”, “rev”, “nef”, “vpr”, “vif”]○ hiv_genes.append(“vpu”)

=> [“gag”, “pol”, “env”, “tat”, “rev”, “nef”, “vpr”, “vif”, “vpu”]

● append does not work with strings!

Lists 2/3● Strings are immutable and single characters cannot be changed individually,

i.e. think them constants like integer 123

● List elements can be manipulated freely e.g. using indices

hiv_genes = [“gag”, “pol”, “env”, “tat”, “rev”, “nef”, “vpr”, “vif”, “vpu”]hiv_structural_proteins = hiv_genes[:3] # [“gag”, “pol”, “env”]hiv_accessory = hiv_genes[5:]

# change the name of “env” to “ENV”hiv_genes[2] = “ENV”print(hiv_genes)

Lists 3/3● Python has many ways to manipulate lists

○ https://docs.python.org/3/tutorial/datastructures.html

E.g.a = [1, 2, 3]b = [5, 4, -11]c = a + b # => c = [1, 2, 3, 5, 4, -11]d = sorted(c) # => d = [-11, 1, 2, 3, 4, 5]…max(a)min(a)

ExerciseMake a program that asks three gene names and print them in alphabetical order with the length of the gene name

This far...● I/O (Input/Output)

● Comparisons

● Strings & their manipulation

● Compound data type List

Questions?

Programming...● Program is collection of commands / rules (“statements”) to the computer

(here: with Python-interpreter)

● Commands in the different programming languages can look quite different, but in the end they mostly do same things:

○ Data input / output (“read, write”)○ Internal data handling: variables and data structures (“list, dictionary, ...”) and their

manipulation (“operators”)○ Control flow statements

■ Conditional statements (“if then else”)■ Loops (“for”, “while”, “goto”, …)■ Functions / subroutines / methods (“sqrt(x), sin(x),...”)

List traversal 1/2● in-operator tests if variable (or constant) is found in the list

hiv_genes = [“gag”, “pol”, “env”, “tat”, “rev”, “nef”, “vpr”, “vif”, “vpu”]if “GAG” in hiv_genes:

print(“GAG found!”)

● Lists can be accessed element by element with for-statement (loop!)

for gene in hiv_genes: # remember the “:”print(“found gene:”, gene)

● Syntax is the same as in if-statements

List traversal 2/2● It is often useful to access list element by element with index (variable)

● Indices can be generated with range-function

● Syntax: range(star, end, step). N.B. this is the same as slicing lists and strings

● Start is by default 0 (zero) and step is 1, so range(10)==range(0,10,1)

E.g. my_list = [1,-1,5,6,10,33,1,2.0,3,19,-2,0,0,“END”]for index in range(0,9,2):

print(index, my_list[index]) # what are we printing?

ExerciseMake a program that traverses through the previous hiv_gene-list one by one and prints list out like:

1. first_gene_name2. second_gene_name

...

Nested loops● for-statements can be used like any other commands inside loops - note the

indentation!

for i in range(10): for j in range(10): print(i*j) # what is the output?

oncoviruses = [“HBV”, “HCV”, “HTLV”,”HPV”,”HHV-8”,”MCPyV”,”EBV”]for virus_1 in oncoviruses: for virus_2 in oncoviruses: print(“Testing:”, virus_1, “vs.”, virus_2) # output?

while-statement● while-statement repeats associated code block as long as the specific

condition is True

● General syntax iswhile condition_True:

do_somethingdo_something_else

continue_the_program

E.g.a = 1while a < 10:

print(a)print(“Loop done!”)

while cont’d● while-statement repeats associated code block as long as the specific

condition is True

● If the condition is never False loop never ends (“infinite loop”)

● So...a = 1while a < 10:

print(a)a = a + 1 # if this is what we want...

print(“Loop done!”)

ExerciseMake a program that asks gene names one at a time until user inputs “STOP”. After that, the program prints out all names

ExerciseMake a program that prints out multiplication table for two integers user inputs

E.g.

number_1 = 3number_2 = 2=>1*1 = 12*1 = 23*1 = 31*2 = 2...

This far...● I/O (Input/Output)

● Comparisons

● Strings & their manipulation

● Compound data type List

● Loops for and while

Questions?

Extra exerciseMake a program that calculates and prints row- and column sums of 3x3 matrix from user input

E.g.1 2 3 => 1 2 3; 63 5 -1 => 3 5 -1; 77 5 2 => 7 5 2; 14

--- --- ---21 50 -6

Input data and format the output any way you wish

If you have problems, at least use comments to sketch out the program flow

Dictionary data type / structure 1/2● List is sequential data structure where elements are indexed

● Dictionary keeps data accessible using keys

● Syntax: my_dict = {key1:data1, key2:data2, ...}

● Assignment & accessing data: my_dict[key] = data

E.g.hiv_gene_sequence = {} # empty dictionaryhiv_gene_sequence[“gag”] = “MGARASVLSGGELDRWEKIRLRPGGK…”hiv_gene_sequence[“pol”] = “FFREDLAFLQGKAREFSSEQTRANSPTRRE”...

Dictionary data type / structure 2/2● Key can be any immutable variable (lists will not do) or constant

● Dictionary can be accessed key-by-key with for-loop

for gene in hiv_gene_sequence:print(gene, hiv_gene_sequence[gene])

● in-operator tests, if a key exists

if “vif” in hiv_gene_sequence:print(“vif gene sequence:”, hiv_gene_sequence[“vif”])

else:print(“vif not found!”)

ExerciseMake a program that asks protein sequences until “STOP” is encountered. After “STOP”, the program prints how many times each sequence was seen in the input.

What can go wrong? Why? What to do?

More on strings 1/3● Python has many, many useful functions / methods for string-manipulation

(see: https://docs.python.org/3.6/library/stdtypes.html#string-methods)

● E.g. split(delimiter)-method splits string to list of words, based on delimiter

input_line = input(“Give gene name and length: “)input_words = input_line.split() # ['nef', '71']

file_path = “/home/ravantti/Downloads”path_parts = file_path.split(“/”)print(“root-directory:”, path_parts[1] # WHY?

More on strings 2/3● E.g. replace(old_string, new_string)-method replaces all matching

instances on old_string to new_string - handy for data cleaning

pi_string = “3,14”pi = pi_string.replace(“,”,”.”)

pi = float(pi) # vs. float(pi_string)

More on strings 3/3● E.g. strip() removes “extra” spaces and tabulator-characters from

beginning and end of a string

gene_name = “ ABCB1 “gene_name = gene_name.strip() # cleaned up version

ExerciseMake a program that asks genome and gene names one pair per line until “STOP” and then prints the genomes and genes sorted both alphabetically

Hint: data structures can have data structures

Odds and sods● Make small program to all data handling => easier to reproduce

● Use code editor with syntax coloring & bracket matching

● KISS (at least in the beginning)

● Think naming of the program files & variables

● Good comments make program better, bad comments make program dangerous - comments are not executable code!

Files 1/2● Our programs have asked user to input the data by hand => laborious

● Most(?) programs read & write at least some data from/to files

● Files are operated through file objects

● Files are opened with open-statement and closed with close-method

● open returns a file object (handle) that is used for reading and writing

Files 2/2Open for reading from a file:file_handle = open(“file_name”, “rt”)

Open for writing to a file:file_handle = open(“file_name”, “wt”)

N.B. opening file for writing and writing anything to it will destroy the original file!

Closing the file with close-method - note the parenthesis

file_handle.close()

Reading filesfile_handle = open(“file_name”, “rt”) # open as a text file

line = file_handle.readline() # reads one line to a string

lines = file_handle.readlines() # reads all the lines into a list of strings

● for-statement can traverse through file line-by-line

for line in file_handle:do_something_with_the_line

● Line includes the the line feed (“\n”) character in the end!

ExerciseMake a program that asks file name from user and then prints out each line of the file

You can use the file “genes.txt” from the course’s web-page

Writing files 1/2file_handle = open(”file_name”, ”wt”) # open text file for writing

file_handle.write(“PRD1 is a bacteriophage”) # writing a string

The line feed character must be added, if needed

Remember to close the file with

file_handle.close()

Writing files 2/2E.g.

fusion_gene_part_1 = “BCR”fusion_gene_part_2 = “ABL1”

fusion_file = open(“my_fusion_genes.txt”, “wt”)

fusion_file.write(fusion_gene_part_1)fusion_file.write(fusion_gene_part_2)

fusion_file.close()

In the file: BRCABL1 <= why?

ExerciseMake a program that asks gene names from a user and writes them to a file line by line

Example 1/348,904,912 rows in a file: merged_bams.sam

903878_1_165_3.1_0.503 16 1 79062 69 165M * 0 0CTGCCCCCCACCTGACGACTTCAATAAGAAGTAGCAGCATTTCTCCAAGGAGGAAATACCAGAGTCAATTCACAACCACTGCAATTGCAGTGGTACCACCATAACAGCCCTTGGGCTGCAGAAGGAACTAAGAGTCTAGTCACTACAGTGGCACCTTCAGCAC * NM:i:0 AS:i:990RG:Z:fixed_XXX_sorted179915_1_164_5.4_0.366 16 1 85929 100 99M1X36M1X17M * 0 0GGATTGGCAATGCGTTCTTAGATAATACACCAAATACAAGCATGAAACAAACAAATGCAGCCAAAATGTACCAGAATCTGAAAACATCTATTATCTACGAAGAATTAGAGGGGAATTTGGTGAAAGAAATATGGCAGAATGGGACATTGCTCTGTGAATGCT * NM:i:2 AS:i:936RG:Z:YYY_sorted305674_1_344_4.9_0.459 0 1 87516 100 342M * 0 0CAGAGATGAGTTTGTTTATTTTTTTATTTTTTAAAAAATTGCTAATTTACAGAACATGGAGATGAGTATGTTTTGAAGGCTTGGAAGCATGCAAGTGGGAGAAGAAAGGAGTCAGCTACATTCTGGCTGTGTGCAGAGGCAGGTCACTGTGGTGGGAGTGTTCCTGTTTCATGGACTCTGCAAATCGCAATGCTTGGCATGGCCTCCCGACCCTGATGGCAGAGAAGCAAACCAGTCGGAGAGCTGGGGTCCTCCCAGCCCTCTTGGCCCTGTGGCCAATTTTTTCTTCAATAGCCTCATAAAATCACATTATTTGAGTGCCCATGGCTCCAAAACAAGCAG * NM:i:0 AS:i:2064 RG:Z:fixed_ZZZ_sorted

...

1st column is wrong, must include the name of the sample from the end

Example 2/3XXX_903878_1_165_3.1_0.503 16 1 7962 69 165M * 0 0CTGCCCCCCACCTGACGACTTCAATAAGAAGTAGCCCAGCATTTCTCCAAGGAGGAAATACCAGAGTCAATTCACAACCACTGCAATTGCAGTGGTACCACCATAACAGCCCTTGGGCTGCAGAAGGAACTAAAGTCTAGTCACTACAGTGGCACCTTCAGCAC * NM:i:0 AS:i:990RG:Z:fixed_XXX_sortedYYY_179915_1_164_5.4_0.366 16 1 85929 100 99M1X36M1X27M * 0 0GGATTGGCAATGCGTTCTTAGATAATACACCAAAAATACAAGCATGAAACAAACAAATGCAGCCAAAATGTACCAGAATCTGAAAACATCTATTATCTACGAAGAATTAGAGGGGAATTTGAAAGAAATATGGCAGAATGGGACATTGCTCTGTGAATGCT * NM:i:2 AS:i:936RG:Z:fixed_YYY_sortedZZZ_305674_1_344_4.9_0.459 0 1 87516 100 34M * 0 0CAGAGATGAGTTTGTTTATTTTTTTATTTTTTAAAAAATTGCTAATTTACAGAACATGGAGATGAGTATGTTTTGAAGGCTTGGAAGCATGCAAGTGGGAGAAGAAAGGAGTCAGCTACATTCTGGCTGGCAGAGGCAGGTCACTGTGGTGGGAGTGTTCCTGTTTCATGGACTCTGCAAATCGCAATGCTTGGCATGGCCTCCCGACCCTGATGGCAGAGAAGCAAACACCAGTCGGAGAGCTGGGGTCCTCCCAGCCCTCTTGGCCCTGTGGCCAATTTTTTCTTCAATAGCCTCATAAAATCACATTATTTGAGTGCCCATGGCTCCAAAACAAGCAG * NM:i:0 AS:i:2064 RG:Z:fixed_ZZZ_sorted

...

Example 3/3fix_sam.py:

#903878_1_165_3.1_0.503 16 1 78062 69 195M * 0 0 CTGCCCCCCACCTGACGACAATAAGAAGTAGCCCAGCATTTCTCCAAGGAGGAAATACCAGAGTCAATTCACAACCACTGCAATTGCAGTGGTACCACCATAACAGCCCTTGGGCTGCAGAAGGAACAAGAGTCTAGTCACTACAGTGGCACCTTCAGCAC * NM:i:0 AS:i:990 RG:Z:fixed_XXX_sorted

file_handle = open("merged_bams.sam", "rt")for line in file_handle: w = line.split() sample_name = w[-1].split("_")[1] read_name = sample_name + "_" + w[0] w[0] = read_name print("\t".join(w))

Exercise● Make a program that writes identical sequences from two separate files into

third

● You can assume that there is only one sequence per line

● You can use files seq_1.txt and seq_2.txt from the course’s webpage

● Think first what your program is supposed to do○ Logic - program flow?○ Inputs?○ Output?○ What could go wrong?

This far...● I/O (Input/Output)● Comparisons● Strings & their manipulation● Compound data types list & dictionary

● Loops

● File I/O

Questions?

Programming...● Program is collection of commands / rules (“statements”) to the computer

(here: with Python-interpreter)

● Commands in the different programming languages can look quite different, but in the end they mostly do same things:

○ Data input / output (“read, write”)○ Internal data handling: variables and data structures (“list, dictionary, ...”) and their

manipulation (“operators”)○ Control flow statements

■ Conditional statements (“if then else”)■ Loops (“for”, “while”, “goto”, …)■ Functions / subroutines / methods (“sqrt(x), sin(x),...”)

Functions 1/3● Splitting a program to (smallish) logical pieces usually pays off

○ testing different pieces is easier○ parts of a program can be modified without changing the whole program○ parts can be reused in other programs

● In Python the separate parts are called functions (or classes, which we not using in this course)

● Define functions to do one thing and one thing only

Functions 2/3● Self defined functions are called like any other function in Python

● Functions are defined with def-statement

def double_number(my_number):print(“Double of”, my_number, “is”, 2*my_number)

● Functions can be used after they are defined

input_number = input(“Give number to be doubled: “)double_number(input_number)

What will go wrong and why?

Functions 3/3● def-statement starts the function definition and function ends when the

indentation returns to previous level

● Syntax: def function_name(parameter1, parameter2, …):○ After function’s name list used parameters separated by commas○ Parameters carry information (variables, data structures) into function

● return-statement carry information back from a function

● Functions must be defined before they are used

● Functions can have other function definitions, call other functions and call themselves...

ExerciseWrite a function that returns cube (x*x*x) of a number

Make a program that reads one number per line from a file and prints out the number and its cube using the previous function

More on functions 1/2● Write always comments about

○ What the function does○ What parameters the function needs○ What the function returns, if anything

● Functions can return multiple values => e.g. immutable list, tuple

● Returned values can be unpacked with single assignment

More on functions 2/2E.g.

def min_max(num1, num2):# this program returns two numbers in increasing order: smaller, largerif num1 > num2:

return(num2, num1)else:

return(num1, num2)

### The Main Program ###small, big = min_max(100, 50) # <= Here we unpack the return values to varsprint("Smaller was:", small, "and larger was:", big)

ExerciseWrite a function that returns a median value of numbers given to a function

Make a program that calculates Quartile(*) from a set of numbers

(*)https://en.wikipedia.org/wiki/Quartile

Libraries / modules 1/2● Libraries bring new commands/functions to the programming environment

● Python has “everything and the kitchen sink” type of standard library https://docs.python.org/3/library/

● There are also widely used and accepted libraries (e.g. matplotlib, numpy and biopython - more on these in the latter part of the course)

● Users can write their own libraries (we’ll skip that in this course)

Libraries / modules 2/2● Library is activated with import statement

● Functions within library are called as library.function

E.g.

import random # get access to various random number functionsprint(“A random integer between [0,10] just for you:”, random.randint(0,10))

ExerciseMake a program that draws seven different numbers between range [1, 40]

Use random library

Run the program several times - do the results change?

Few words about objects in Python● Object-Oriented Programming (OOP) is one way to program with Python

● However, in this course, we stick with procedural programming (sequences of commands)

● Objects are entities that encapsulate both attributes (≅ data) and methods (≅ functions) how the data is manipulated

● Objects are defined with class-statement

● Attributes and methods are referred with the dot (“.”) in end of the object

Objects - in action● We have already seen objects, methods and attributes e.g. with file I/O and

string manipulation

E.g.my_file = open(“test.txt”, “rt”)my_file.closed # attribute / state of the file => Falsemy_file.close() # method to close the filemy_file.closed # testing again => True

my_name_is = “janne”my_name_is.capitalize() # string method => “Janne”

Libraries 1/3import numpya = numpy.random.randint(15, size=(2,5))b = numpy.random.randint(15, size=(2,5))

print("a:", a, "\n")print("b:", b, "\n")

print("a+b:", a+b, "\n")print("a*b:", a*b, "\n") # per each element in the table

print("a[0]*b[0]:", a[0]*b[0], "\n") # for the first rows only

https://docs.scipy.org/doc/numpy/user/index.html

Libraries 2/3import datetime

today = datetime.date.today()next_time = datetime.date(2019, 5, 9)

print("Time between now and the next time:", next_time - today)

https://docs.python.org/3/library/datetime.html

Libraries 3/3Try in the notebook (assuming that matplotlib is installed!)

# magic command to inline graphics, note the “%” sign%matplotlib inline

import matplotlib.pyplot

x = [1,2,3,4]y = [-3, 11, 4, 9]

matplotlib.pyplot.plot(x,y)

https://matplotlib.org/users/pyplot_tutorial.html

This far● I/O, including file I/O

● conditionals / comparisons

● data structures list & dictionary

● Loops for & while

● Functions, methods & libraries

=> you can program! Now it is only practise, practise, practise (& read more about Python, libraries, …)

Tips for programming 1/3● Split the problem / program into smaller logical pieces even before writing a

single line of code (simulate the program flow in your mind / pen & paper)

● The whole program can be completed piece-by-piece. Start e.g. by reading the data into some reasonable data structure

● Learn to comment your code well! It will pay off later.

● Outline the program using comments

Tips for programming 2/3● What the program reads in? What is the wanted output?

● What choices program need to make?

● What data structures are needed?

● Are there any repeated parts? Loops and/or functions?

● What can go wrong? What then?

● Naming and initialization: variables, data structures, files, ...

Tips for programming 3/3● The program can print (& save to a file!) intermediate results

● Extra output can be commented out later

● Test your programs with simple inputs

● Python has many, many, many functions and libraries for common tasks - read the documentation!

● Beware of incompatible libraries, Python 2.7+ code snippets, ...

The next time...1. Python-programming warm up :)2. Anaconda Distribution revisited3. Python as an Integration language4. Libraries...

4.1. Numpy & Scipy4.2. Matplotlib4.3. Pandas4.4. Biopython4.5. ...

Recap

print(“The End”)

print(“KIITOS!”)

top related