csi5180. machine learning for bioinformatics applications ... · essential bioinformatics skills...
TRANSCRIPT
![Page 1: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/1.jpg)
CSI5180. Machine Learning forBioinformatics Applications
Essential Bioinformatics Skills
by
Marcel Turcotte
Version November 6, 2019
![Page 2: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/2.jpg)
Preamble 2/59
Preamble
![Page 3: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/3.jpg)
Preamble
Preamble 3/59
Essential Bioinformatics Skills
The lecture gives an overview of the available resources that are essential forbioinformatics projects. This includes the main databases, software applications,programming languages and computing environments. We also emphasize the skillsthat are essential to produce robust and reproducible results.
General objective :Summarize the essential resources for conducting a bioinformatics project
![Page 4: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/4.jpg)
Learning objectives
Preamble 4/59
Describe the best practices for handling large bioinformatics projectsIntroduce essential toolsPresent the major repositories and file formats, along with the commandline and REST API access
Reading:See below
![Page 5: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/5.jpg)
Plan
Preamble 5/59
1. Preamble
2. Literature
3. Guidelines
4. Computing Environment
5. Data
6. REST
7. Prologue
![Page 6: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/6.jpg)
Literature 6/59
Literature
![Page 7: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/7.jpg)
Bioinformatics Data Skills
Literature 7/59
Vince Buffalo. Bioinformatics Data Skills: Reproducible and RobustResearch with Open Source Tools. O’Reilly Media, 2015.
![Page 8: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/8.jpg)
A Practical Introduction to. . .
Literature 8/59
Röbbe Wünschiers. Computational Biology - A Practical Introduction toBioData Processing and Analysis with Linux, MySQL, and R. Springer, 2013.(https://link.springer.com/book/10.1007/978-3-642-34749-8)
![Page 9: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/9.jpg)
The Biostar Handbook
Literature 9/59
The Biostar Handbook: Bioinformatics data analysis guide, 2019https://biostar.myshopify.com
![Page 10: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/10.jpg)
Ten (10) simple rules for. . .
Literature 10/59
Sandve, G. K., Nekrutenko, A., Taylor, J. & Hovig, E. Ten Simple Rules forReproducible Computational Research. PLoS Comput Biol 9, (2013).Boulesteix, A.-L. Ten simple rules for reducing overoptimistic reporting inmethodological computational research. PLoS Comput Biol 11, e1004191(2015).Prlic, A. & Procter, J. B. Ten Simple Rules for the Open Development ofScientific Software. PLoS Comput Biol 8, e1002802 (2012).Perez-Riverol, Y. et al. Ten Simple Rules for Taking Advantage of Git andGitHub. PLoS Comput Biol 12, e1004947 (2016).Sholler, D. et al. Ten simple rules for helping newcomers becomecontributors to open projects. PLoS Comput Biol 15, e1007296 (2019).Rule, A. et al. Ten simple rules for writing and sharing computationalanalyses in Jupyter Notebooks. PLoS Comput Biol 15, e1007007 (2019).
![Page 11: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/11.jpg)
Ten (10) simple rules for. . .
Literature 11/59
Osborne, J. M. et al. Ten simple rules for effective computational research.PLoS Comput Biol 10, e1003506 (2014).Elofsson, A. et al. Ten simple rules on how to create open access andreproducible molecular simulations of biological systems. PLoS Comput Biol15, e1006649 (2019).Lee, B. D. Ten simple rules for documenting scientific software. PLoSComput Biol 14, e1006561 (2018).Carey, M. A. & Papin, J. A. Ten simple rules for biologists learning toprogram. PLoS Comput Biol 14, e1005871 (2018).Zook, M. et al. Ten simple rules for responsible big data research. PLoSComput Biol 13, e1005399 (2017).
![Page 12: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/12.jpg)
(One more) Definition
Literature 12/59
“Bioinformatics is conceptualizing biology in terms of macromolecules (in thesense of physical-chemistry) and then applying “informatics” techniques
(derived from disciplines such as applied maths, computer science, and statistics)to understand and organize the information associated with these molecules,
on a large-scale.”
Luscombe, N. M., Greenbaum, D. & Gerstein, M.What is bioinformatics? A proposed definition and overview of the field.
Methods of information in medicine 40, 346358 (2001).
![Page 13: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/13.jpg)
Guidelines 13/59
Guidelines
![Page 14: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/14.jpg)
Robust research (Vince Buffalo)
Guidelines 14/59
Pay attention to your experimental designWrite code for humans, write code for computersLet the computer do the workWrite down your assumptions and test them (unit testing)Use existing librariesTreat data as read-only
![Page 15: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/15.jpg)
Reproducible research (Vince Buffalo)
Guidelines 15/59
Share your source code and your dataMeta-data:
Versions of the software and databases you are usingWrite down the parameters or better yet, make it a scriptOne README file directory
Make figures, statistics, and tables from scriptsNot only is this more scientific, it is almost certain that you will need toredo your analyses!
![Page 16: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/16.jpg)
Computing Environment 16/59
Computing Environment
![Page 17: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/17.jpg)
UNIX
Computing Environment 17/59
Both, Bioinformatics and Machine Learning, favour UNIXQuoting François Cholette (Deep Learning with Python): “Youll need accessto a UNIX machine; it’s possible to use Windows, too, but I don’trecommend it”Compute Canada (https://docs.computecanada.ca)
Cedar - 58,416 CPU cores and 584 GPU devicesGraham - 36,160 cores and 320 GPU devicesBéluga - 34,880 cores and 688 GPU devicesNiagara - 61,920 cores
![Page 18: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/18.jpg)
Access to UNIX
Computing Environment 18/59
Your laptop or workstationAs primary or secondary OS (dual boot, USB key, etc.)In a virtual machine(VMWare is free for EECS students, VirtualBox is also free)Windows Subsystem for Linux Installation Guide for Windows 10(https://docs.microsoft.com/en-us/windows/wsl/install-win10)
CloudI have vouchers for Google Cloud Platform and Amazon (just ask me)
Ubuntu is a popular distribution, but there are many others
![Page 19: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/19.jpg)
UNIX key concepts
Computing Environment 19/59
Modularity“This is the Unix philosophy: Write programs that do one thing and do itwell. Write programs to work together. Write programs to handle textstreams, because that is a universal interface.” — Doug McIlory
The file system plays a central role/dev/null, /dev/random, /dev/zero
$ head -c 10 /dev/zero > test10bytes.datThe command line
$ grep -c '>̂' input.fastaShell (anatomy of a script, the magic line, and more)RedirectionPipehttps://www.ks.uiuc.edu/Training/Tutorials/Reference/unixprimer.html
![Page 20: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/20.jpg)
Conda/Anaconda/Bioconda
Computing Environment 20/59
https://conda.ioConda is a package, dependency and environment management for anyprogramming language (Python, R, Ruby, Lua, Scala, Java, and more)
https://anaconda.orgAnaconda is a package management service, primarily for Python and R,hundreds of packages such as numpy, scipy, scikit-learn, keras, tensorflow
https://bioconda.github.ioBioconda is a channel for the conda package manager specializing inbioinformatics software.
![Page 21: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/21.jpg)
Using conda/anaconda/bioconda
Computing Environment 21/59
$ conda create -n csi5180$ conda install -n csi5180 keras$ conda activate csi5180$ conda install bwa$ conda deactivate$ conda update --all
![Page 22: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/22.jpg)
Other considerations
Computing Environment 22/59
Consider using a (distributed) version control systemGit/GitHub has become the de facto standardFeatures
Manage changes in your documentsIn a distributed version control system, each developer has its own version ofthe source codeMultiple contributorsCreating/merging multiple branches
https://git-scm.com/doc
![Page 23: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/23.jpg)
Data 23/59
Data
![Page 24: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/24.jpg)
Major repositories
Data 24/59
Annotated/assembled nucleotide sequenceNational Center for Biotechnology Information (NCBI)
https://www.ncbi.nlm.nih.govEuropean Bioinformatics Institute (EBI)
https://www.ebi.ac.ukDNA Data Bank of Japan (DDBJ)
https://www.ddbj.nig.ac.jp/See also: International Nucleotide Sequence Database Collaboration(http://www.insdc.org)
![Page 25: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/25.jpg)
Major repositories (continued)
Data 25/59
GenBank: annotated and identified DNA sequence informationSRA (Short Read Archive): measurements from high throughputsequencing experimentsUniProt (Universal Protein Resource ): protein sequence dataPDB (Protein Data Bank): 3D structural information of macromolecules
![Page 26: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/26.jpg)
Other data sources?
Data 26/59
UCSC Genome BrowserFlyBase (Drosophila [fruit fly], WormBase (nematode), SGD: SaccharomycesGenome Database, TAIR (Arabidopsis), EcoCyc (Encyclopedia of E. coliGenes and Metabolic Pathways), etc.RNA-Central: meta-database
![Page 27: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/27.jpg)
Nucleic Acids Research (NAR)
Data 27/59
Each year, NAR, a high-impact journal, publishes its “database issue”:https://academic.oup.com/nar/issue/47/D1
![Page 28: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/28.jpg)
Major file formats (biostar)
Data 28/59
Data that captures prior knowledge (aka reference: FASTA, GFF, BED)Experimentally obtained data (aka sequencing reads: FASTQ)Data generated by the analysis (aka results: BAM, VCF, formats frompoint 1 above, and many nonstandard formats)
![Page 29: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/29.jpg)
Entrez Direct
Data 29/59
$ conda i n s t a l l −c b ioconda en t r e z −d i r e c t
![Page 30: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/30.jpg)
GENBANK
Data 30/59
$ e f e t c h −db nucco re − i d NM_000020 −format gb | l e s s
LOCUS NM_000020 4177 bp mRNA linear PRI 16-SEP-2019DEFINITION Homo sapiens activin A receptor like type 1 (ACVRL1), transcript
variant 1, mRNA.ACCESSION NM_000020VERSION NM_000020.3KEYWORDS RefSeq; RefSeq Select.SOURCE Homo sapiens (human)
ORGANISM Homo sapiensEukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 4177)AUTHORS Leng H, Zhang Q and Shi L.TITLE [Gene diagnosis and treatment of hereditary hemorrhagic
(...)
![Page 31: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/31.jpg)
GENBANK (continued)
Data 31/59
(...)FEATURES Location/Qualifiers
source 1..4177/organism="Homo sapiens"/mol_type="mRNA"/db_xref="taxon:9606"/chromosome="12"/map="12q13.13"
gene 1..4177/gene="ACVRL1"/gene_synonym="ACVRLK1; ALK-1; ALK1; HHT; HHT2; ORW2;SKR3; TSR-I"/note="activin A receptor like type 1"/db_xref="GeneID:94"/db_xref="HGNC:HGNC:175"/db_xref="MIM:601284"
exon 1..192/gene="ACVRL1"/gene_synonym="ACVRLK1; ALK-1; ALK1; HHT; HHT2; ORW2;
(...)
![Page 32: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/32.jpg)
GENBANK (continued)
Data 32/59
(...)ORIGIN
1 cccagtcccg ggaggctgcc gcgccagctg cgccgagcga gcccctcccc ggctccagcc61 cggtccgggg ccgcgcccgg accccagccc gccgtccagc gctggcggtg caactgcggc
121 cgcgcggtgg aggggaggtg gccccggtcc gccgaaggct agcgccccgc cacccgcaga181 gcgggcccag agggaccatg accttgggct cccccaggaa aggccttctg atgctgctga241 tggccttggt gacccaggga gaccctgtga agccgtctcg gggcccgctg gtgacctgca
(...)4081 aaattacact tctcgtacct ggagacgctg tttgtgggag cactgggctc atgcctggca4141 cacaataggt ctgcaataaa ccatggttaa atcctga
//
![Page 33: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/33.jpg)
FASTA
Data 33/59
$ e f e t c h −db nucco re − i d NM_000020 −format f a s t a | l e s s
>NM_000020.3 Homo sapiens activin A receptor like type 1 (ACVRL1), transcript variant 1, mRNACCCAGTCCCGGGAGGCTGCCGCGCCAGCTGCGCCGAGCGAGCCCCTCCCCGGCTCCAGCCCGGTCCGGGGCCGCGCCCGGACCCCAGCCCGCCGTCCAGCGCTGGCGGTGCAACTGCGGCCGCGCGGTGGAGGGGAGGTGGCCCCGGTCCGCCGAAGGCTAGCGCCCCGCCACCCGCAGAGCGGGCCCAGAGGGACCATGACCTTGGGCTCCCCCAGGAAAGGCCTTCTGATGCTGCTGATGGCCTTGGTGACCCAGGGAGACCCTGTGAAGCCGTCTCGGGGCCCGCTGGTGACCTGCACGTGTGAGAGCCCACATTGCAAGGGGCCTACCTGCCGGGGGGCCTGGTGCACAGTAGTGCTGGTGCGGGAGGAGGGGAGGCACCCCCAGGAACATCGGGGCTGCGGGAACTTGCACAGGGAGCTCTGCAGGGGGCGCCCCACCGAGTTCGTCAACCACTACTGCTGCGACAGCCACCTCTGCAACCACAACGTGTCCCTGGTGCTGGAGGCCACCCAACCTCCTTCGGAGCAGCCGGGAACAGATGGCCAGCTGGCCCTGATCCTGGGCCCCGTGCTGGCCTTGCTGGCCCTGGTGGCCCTGGGTGTCCTGGGCCTGTGGCATGTCCGAC(...)GGCCCAATGGCCAGGGAGTGAAGGAGGTGGCGTTGCTGAGAGCAGTCTGCACATGCTTCTGTCTGAGTGCAGGAAGGTGTTCCAGGGTCGAAATTACACTTCTCGTACCTGGAGACGCTGTTTGTGGGAGCACTGGGCTCATGCCTGGCACACAATAGGTCTGCAATAAACCATGGTTAAATCCTGA
![Page 34: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/34.jpg)
GFF/GTF/BED
Data 34/59
Interval formatsTab delimitedChromosomal coordinate, start, end, strand, and morehttps://useast.ensembl.org/info/website/upload/gff3.html
![Page 35: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/35.jpg)
BED
Data 35/59
3 columns:
chr7 127471196 127472363chr7 127472363 127473530chr7 127473530 127474697
6 columns:
chr1 134212701 134230065 Nuak2 8 +chr1 134212701 134230065 Nuak2 7 +chr1 33510655 33726603 Prim2, 14 -chr1 25124320 25886552 Bai3, 31 -
![Page 36: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/36.jpg)
Bedtools
Data 36/59
“Collectively, the bedtools utilities are a swiss-army knife of tools fora wide-range of genomics analysis tasks. The most widely-used tools enablegenome arithmetic: that is, set theory on the genome. For example, bedtoolsallows one to intersect, merge, count, complement, and shuffle genomicintervals from multiple files in widely-used genomic file formats such as BAM,BED, GFF/GTF, VCF.”
$ conda i n s t a l l −c b ioconda b e d t o o l s
https://www.biostars.org/p/17162/
![Page 37: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/37.jpg)
.2bit
Data 37/59
$ conda i n s t a l l −c b ioconda ucsc−t w o b i t t o f a
$ URL=ht tp :// hgdownload . c s e . ucsc . edu/ go ldenpa th /mm9/ b i g Z i p s /mm9. 2 b i t$ twoBitToFa −udcDi r =. $URL1 s t d o u t > mm9. f a
$ URL=ht tp :// hgdownload . c s e . ucsc . edu/ go ldenPath /mm9/ b i g Z i p s /mm9. chrom . s i z e s$ c u r l $URL > mm9. c h r o m s i z e s
![Page 38: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/38.jpg)
Bedtools (continued)
Data 38/59
Given genes.bed:
chr1 134212701 134230065 Nuak2 8 +chr1 134212701 134230065 Nuak2 7 +chr1 33510655 33726603 Prim2 14 -chr1 25124320 25886552 Bai3 31 -
$ b e d t o o l s f l a n k − i genes . bed −g mm9. c h r o m s i z e s − l 2000 −r 0 −s
chr1 134210701 134212701 Nuak2 8 +chr1 134210701 134212701 Nuak2 7 +chr1 33726603 33728603 Prim2 14 -chr1 25886552 25888552 Bai3 31 -
$ b e d t o o l s g e t f a s t a − f i mm9. f a −bed promote r s . bed −f o p romote r s . f a
![Page 39: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/39.jpg)
promoters.fa
Data 39/59
>chr1:134210701-134212701TTCTGGCACTTGGTTGTTCT...GTTTTATAGCAATTCGGAAC>chr1:134210701-134212701TTCTGGCACTTGGTTGTTCT...GTTTTATAGCAATTCGGAAC>chr1:33726603-33728603TCTCCCAGTGGCGGGAGAGT...ATTTATTTTTATGTTTATAA>chr1:25886552-25888552TTGCGCCTTATCCAAGTGAA...TCCCAGGAACAAATCACCAG
![Page 40: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/40.jpg)
Creating a script automating our work
Data 40/59
Let’s now create a script capturing all this information
![Page 41: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/41.jpg)
Magic line (shebang)
Data 41/59
In a Unix-like operating system, the content of an executable is passed tothe interpreter designated on the magic line.
#! / b in / bash
I am saving this to a file called 01_get_data.shThen, I make it executable
$ chmod u+x 01_get_data.sh
![Page 42: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/42.jpg)
Test your assumptions
Data 42/59
You can test for the presence of absence of a file or a directory
#! / b in / bash
INPUT=genes . bed
i f [ ! −f $INPUT ] ; thenecho " f i l e not found : $INPUT"e x i t 1
f i
![Page 43: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/43.jpg)
Temporary space
Data 43/59
Sometimes you don’t want to create temporary files in your user account.These temporary files might be big and you don’t want them to be saved bythe backup system or your quota might not allow you to save them in youruser space.
Do not use /tmp/, this is temporary storage for the operating system, andsometimes the partition is rather small.Use /var/tmp/ or a designated space, such as /scratch.
Beware! The system will automatically remove those files after a given periodof time.
![Page 44: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/44.jpg)
Data 44/59
#! / b in / bash
# Sample Bash s c r i p t to download a genome and e x t r a c t i n f o r m a t i o n
INPUT=genes . bed
i f [ ! −f $INPUT ] ; thenecho " f i l e not found : $INPUT"e x i t 1
f i
PROJECT=cs i5180 −demo
# Proce s s ID and t ime stamp as s u f f i xTMP_DIR=/va r /tmp/$PROJECT−‘ date +"%FT%H%M%S" ‘−$$
i f [ −d TMP_DIR ] ; thenecho "$TMP_DIR e x i s t s ! "e x i t 1
f i
![Page 45: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/45.jpg)
Data 45/59
# C r e a t i n g the temporary d i r e c t o r ymkdir $TMP_DIR
# The URL where the mouse genome v e r s i o n 9 (MM9) can be foundMM9_URL=ht tp :// hgdownload . c s e . ucsc . edu/ go ldenpa th /mm9/ b i g Z i p s /mm9. 2 b i t
# Where to save the mouse genome as a f a s t a f i l eMM9_FILE_NAME=$TMP_DIR/mm9. f a
# Download an uncompress the genometwoBitToFa −udcDi r=$TMP_DIR $MM9_URL s t d o u t > $MM9_FILE_NAME
# URL o f the f i l e c o n t a i n i n g the s i z e o f each chromosomeMM9_SIZE_URL=ht tp :// hgdownload . c s e . ucsc . edu/ go ldenPath /mm9/ b i g Z i p s /mm9. chrom . s i z e s
MM9_SIZE_FILE_NAME=$TMP_DIR/mm9. c h r o m s i z e s
# Downloading the s i z e f i l e ( to the c u r r e n t d i r e c t o r y )c u r l $MM9_SIZE_URL > $MM9_SIZE_FILE_NAME
![Page 46: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/46.jpg)
Data 46/59
# C a l c u l a t i n g the c o o r d i n a t e s o f the promoter r e g i o n sb e d t o o l s f l a n k − i $INPUT −g $MM9_SIZE_FILE_NAME − l 2000 −r 0 −s > promote r s . bed
# E x t r a c t i n g the promote r sb e d t o o l s g e t f a s t a − f i $MM9_FILE_NAME −bed promote r s . bed −f o p romote r s . f a
# C l e an i n grm − r f $TMP_DIR
# E O F
![Page 47: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/47.jpg)
REST 47/59
REST
![Page 48: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/48.jpg)
Representational state transfer (REST)
REST 48/59
Client and server interactions using HTTP (hypertext transfer protocol)Madeira, F. et al. The EMBL-EBI search and sequence analysis tools APIs in2019. Nucleic Acids Res 47, W636W641 (2019).Tarkowska, A. et al. Eleven quick tips to build a usable REST API for lifesciences. PLoS Comput Biol 14, e1006542 (2018).https://www.ebi.ac.uk/training/online/course/ensembl-rest-api
https://www.ncbi.nlm.nih.gov/home/develop/api/
https://rest.ensembl.org
https://www.encodeproject.org/help/rest-api/
Examples:/sequence/id/ENST00000288602?type=cds;content-type=text/x-fasta/sequence/id/ENST00000288602?type=cds;content-type=text/x-fasta;start=10;end=110
![Page 49: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/49.jpg)
ENSEMBL: GET sequence/id/:id
REST 49/59
https://rest.ensembl.org/documentation/info/sequence_id
import r e q u e s t s , s y s
s e r v e r = " h t t p s : // r e s t . ensembl . org "ex t = "/ sequence / i d /ENST00000288602? type=cdna "
r = r e q u e s t s . ge t ( s e r v e r+ext , h eade r s={ " Content−Type " : " t e x t /x−f a s t a " })
i f not r . ok :r . r a i s e _ f o r _ s t a t u s ( )s y s . e x i t ( )
p r i n t ( r . t e x t )
![Page 50: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/50.jpg)
A Python script can also be made executable
REST 50/59
#!/ u s r / b i n / env python3
import r e q u e s t s , s y s
s e r v e r = " h t t p s : // r e s t . ensembl . org "ex t = "/ sequence / i d /ENST00000288602? type=cdna "
r = r e q u e s t s . ge t ( s e r v e r+ext , h eade r s={ " Content−Type " : " t e x t /x−f a s t a " })
i f not r . ok :r . r a i s e _ f o r _ s t a t u s ( )s y s . e x i t ( )
p r i n t ( r . t e x t )
![Page 52: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/52.jpg)
Pipelines
REST 52/59
https://www.encodeproject.org/pipelines/https://www.encodeproject.org/chip-seq/transcription_factor/https://github.com/ENCODE-DCC/chip-seq-pipeline
![Page 53: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/53.jpg)
Discussion groups
REST 53/59
https://bioinformatics.stackexchange.com/https://www.biostars.org/
![Page 54: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/54.jpg)
Tutorials
REST 54/59
https://www.nihlibrary.nih.gov/services/bioinformatics-support/online-bioinformatics-tutorialshttps://www.biostars.org/
![Page 55: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/55.jpg)
Prologue 55/59
Prologue
![Page 56: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/56.jpg)
Summary
Prologue 56/59
Strive to make your research robust and reproducible
UNIX is the preferred environment for bioinformatics and machinelearningConda/Anaconda/Bioconda will simplify your life tremendouslyNCBI/EBI/DDBJ are the major repositories for bioinformatics dataThere are many specialized bioinformatics repositoriesGenBank, Fasta, and BED are examples of file formatsEntrez Direct/RESTPipelines
![Page 57: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/57.jpg)
Summary
Prologue 56/59
Strive to make your research robust and reproducibleUNIX is the preferred environment for bioinformatics and machinelearning
Conda/Anaconda/Bioconda will simplify your life tremendouslyNCBI/EBI/DDBJ are the major repositories for bioinformatics dataThere are many specialized bioinformatics repositoriesGenBank, Fasta, and BED are examples of file formatsEntrez Direct/RESTPipelines
![Page 58: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/58.jpg)
Summary
Prologue 56/59
Strive to make your research robust and reproducibleUNIX is the preferred environment for bioinformatics and machinelearningConda/Anaconda/Bioconda will simplify your life tremendously
NCBI/EBI/DDBJ are the major repositories for bioinformatics dataThere are many specialized bioinformatics repositoriesGenBank, Fasta, and BED are examples of file formatsEntrez Direct/RESTPipelines
![Page 59: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/59.jpg)
Summary
Prologue 56/59
Strive to make your research robust and reproducibleUNIX is the preferred environment for bioinformatics and machinelearningConda/Anaconda/Bioconda will simplify your life tremendouslyNCBI/EBI/DDBJ are the major repositories for bioinformatics data
There are many specialized bioinformatics repositoriesGenBank, Fasta, and BED are examples of file formatsEntrez Direct/RESTPipelines
![Page 60: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/60.jpg)
Summary
Prologue 56/59
Strive to make your research robust and reproducibleUNIX is the preferred environment for bioinformatics and machinelearningConda/Anaconda/Bioconda will simplify your life tremendouslyNCBI/EBI/DDBJ are the major repositories for bioinformatics dataThere are many specialized bioinformatics repositories
GenBank, Fasta, and BED are examples of file formatsEntrez Direct/RESTPipelines
![Page 61: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/61.jpg)
Summary
Prologue 56/59
Strive to make your research robust and reproducibleUNIX is the preferred environment for bioinformatics and machinelearningConda/Anaconda/Bioconda will simplify your life tremendouslyNCBI/EBI/DDBJ are the major repositories for bioinformatics dataThere are many specialized bioinformatics repositoriesGenBank, Fasta, and BED are examples of file formats
Entrez Direct/RESTPipelines
![Page 62: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/62.jpg)
Summary
Prologue 56/59
Strive to make your research robust and reproducibleUNIX is the preferred environment for bioinformatics and machinelearningConda/Anaconda/Bioconda will simplify your life tremendouslyNCBI/EBI/DDBJ are the major repositories for bioinformatics dataThere are many specialized bioinformatics repositoriesGenBank, Fasta, and BED are examples of file formatsEntrez Direct/REST
Pipelines
![Page 63: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/63.jpg)
Summary
Prologue 56/59
Strive to make your research robust and reproducibleUNIX is the preferred environment for bioinformatics and machinelearningConda/Anaconda/Bioconda will simplify your life tremendouslyNCBI/EBI/DDBJ are the major repositories for bioinformatics dataThere are many specialized bioinformatics repositoriesGenBank, Fasta, and BED are examples of file formatsEntrez Direct/RESTPipelines
![Page 64: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/64.jpg)
Next module
Prologue 57/59
Fundamentals of Machine Learning
![Page 65: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/65.jpg)
References
Prologue 58/59
Vince Buffalo.Bioinformatics Data Skills: Reproducible and Robust Research with Open SourceTools.O’Reilly Media, 2015.
Röbbe Wünschiers.Computational Biology - A Practical Introduction to BioData Processing andAnalysis with Linux, MySQL, and R.Springer, 2013.
The Biostar Handbook: Bioinformatics data analysis guide, 2019.Shopify, 2019.
![Page 66: CSI5180. Machine Learning for Bioinformatics Applications ... · Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics](https://reader033.vdocument.in/reader033/viewer/2022060308/5f09fad47e708231d4296dec/html5/thumbnails/66.jpg)
Prologue 59/59
Marcel [email protected]
School of Electrical Engineering and Computer Science (EECS)University of Ottawa