computational tools for data scienc · ideas for final presentation ‣ awk ‣ julia ‣ vowpal...

32
Computational Tools for Data Science Week 2: The UNIX Shell & Version Control & Amazon EC2

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Computational Tools for Data Science

Week 2: The UNIX Shell & Version Control & Amazon EC2

Page 2: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Evaluations for week 1

2

• The lecture was too long. Break it up with exercises or have students read beforehand.

!• Focus on most important aspects and explain it

deeper. !• Lectures on syntax does not work.

Page 3: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Evaluations for week 1

3

• The slides for week 1 were nowhere to be found.

!• Clear info about which exercises are expected

of the students to finish. This info should be on the webpage.

I expect you to finish all exercises,!

unless otherwise stated. For the weeks

where I have not created the material, I

will give further instructions.

Page 4: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Evaluations for week 1• Clearer description of demand for passing the

course. What actually happens at the final presentation, and does the work in lectures/exercises count?

!• Is the final presentation a project and can we

start working on it earlier?

4

The final presentation is a presentation. It

does not have to be a large project. You

can start working now if you want.!

!

Your work during the lectures/exercises

does not count towards you passing/

failing the course.

Page 5: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Ideas for final presentation‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout ‣ A cool deep learning library ‣ Hashing Tricks ‣ Locality-sensitive hashing ‣ Feature hashing

5

Page 6: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Computational Tools for Data Science

Week 2: The UNIX Shell & Version Control & Amazon EC2

Page 7: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Which commands?

top, screen, chmod, diff, find which, apt-get, ssh, wget, curl

7

nano, vi, emacs

head, tail, more, less, cat, grep, sed, cut, sort, uniq, awk

Standard tools

Working with files

Editors

Page 8: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Covered here

top, screen, chmod, diff, find which, apt-get, ssh, wget, curl

8

nano, vi, emacs

head, tail, more, less, cat, grep, sed, cut, sort, uniq, awk

Standard tools

Working with files

Editors

Page 9: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

SSH

~ ssh [email protected] authenticity of host 'hald.gbar.dtu.dk (192.38.95.41)' can't be established.RSA key fingerprint is 78:74:43:13:9d:23:02:95:78:18:48:24:47:cf:6d:05.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added 'hald.gbar.dtu.dk,192.38.95.41' (RSA) to the list of known hosts.Password: ~gray1(dawi) $ pwd/zhome/5b/c/51358

Secure Shell (SSH) is a cryptographic network protocol for secure data communication, remote command-line login, remote

command execution, and other secure network services between two networked computers.

We could do a whole lecture on RSA-keys for SSH.

9

Page 10: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Pipes and redirections

10

> Redirect output from a command to a file on disk.!

>> Append output from a command to a file on disk.!

< Read a command’s input from a disk on file.!

| Pass the output of one command to another for further processing.

tee Redirect output to a file and pass it to further commands.

Page 11: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

CAT

11

~ cat > temp1This will go in temp1~ cat > temp2This will go in temp2~ cat temp1This will go in temp1~ cat temp1 temp2This will go in temp1This will go in temp2~ cat temp1 temp2 > temp3~ cat temp3This will go in temp1This will go in temp2

The cat program is a utility that will output the contents of a specific file and can be used to concatenate and list files.

Page 12: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

CAT

12

~ cat temp3This will go in temp1This will go in temp2~ cat >> temp3This will also go in temp3~ cat temp3This will go in temp1This will go in temp2This will also go in temp3~ cat -n temp3 1 This will go in temp1 2 This will go in temp2 3 This will also go in temp3

Page 13: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

CUT

13

~ cat > tempThis is a temp file1234567890With some content~ cut -c4 temps4h~ cut -c4,6 tempsi46hs

Page 14: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

CUT

14

~ cut -c4-6 temps i456h s~ cut -c-6 temp This i123456With s~ cut -d' ' -f2 temp is1234567890some~ cut -d' ' -f2,3 tempis a1234567890some content

Page 15: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

SORT

15

~ cat > temp1001 Søren33 Emil25 Helge1001 David~ sort temp1001 David1001 Søren25 Helge33 Emil~ sort -n temp25 Helge33 Emil1001 David1001 Søren

~ sort -k2 temp1001 David33 Emil25 Helge1001 Søren~ sort -k2 -r temp1001 Søren25 Helge33 Emil1001 David~ sort -k1,1 -k2,2 temp1001 David1001 Søren25 Helge33 Emil~ sort -k1,1 -k2,2r temp1001 Søren1001 David25 Helge33 Emil

Page 16: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

SORT

16

~ cat > temp2David,1Helge,10Søren,5~ sort -t',' -nk2 temp2David,1Søren,5Helge,10

Page 17: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

UNIQ

17

~ cat > temp11232~ uniq temp1232~ sort -n temp | uniq123~ sort -n temp | uniq -c 2 1 2 2 1 3

Page 18: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

GREP

18

~ cat > tempbig bad bugbagbiggerboogy~ grep big tempbigbigger~ grep b.g tempbigbad bugbagbigger~ grep "b.*g" tempbigbad bugbagbiggerboogy~ grep -w big tempbig

Page 19: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

GREP

19

~ cat > temp1this is a line in the first file~ cat > temp2this is a line in the second file~ grep this temp*temp1:this is a line in the first filetemp2:this is a line in the second file

Display N lines after matchgrep -A <N> "string" FILENAME!Display N lines before matchgrep -B <N> "string" FILENAME!Search recursively in foldersgrep -r "string" *!Invert matchgrep -v "string" FILENAME!Count number of matchesgrep -c "string" FILENAME!Display only the file namesgrep -l "string" FILENAME

Page 20: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

SED

20

~ cat > temp one two three, one two three four three two oneone hundred~ sed 's/one/ONE/' tempONE two three, one two three four three two ONEONE hundred~ sed 's_one_ONE_' tempONE two three, one two three four three two ONEONE hundred~ sed 's/[a-z]*/LOL/' tempLOL two three, one two three LOL three two oneLOL hundred~ sed 's/[a-z]*/(&)/' temp(one) two three, one two three (four) three two one(one) hundred~ sed 's/[a-z]*/(&)/g' temp(one) (two) (three),() (one) (two) (three) ()(four) (three) (two) (one)(one) (hundred)

Page 21: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Combinations

21

~ head tempId,Prediction17000,017001,017002,117003,117004,117005,017006,017007,017008,0~ cut -d',' -f2 temp | grep -c '0'1983~ cut -d',' -f2 temp | grep -c '1'2075

Page 22: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Computational Tools for Data Science

Week 2: The UNIX Shell & Version Control & Amazon EC2

Page 23: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

23

Page 24: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

24

“Git is a distributed revision control

and source code management

system with an emphasis on speed,

data integrity, and support for

distributed, non-linear workflows.”

Page 25: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

25

Page 26: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

GitHub is a web-based hosting service for software development projects that use the Git revision control system.

26

Page 27: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

➜ ~ mkdir temp_git➜ ~ cd temp_git ➜ temp_git cat > file1 This is in file1 ➜ temp_git cat > file2This is in file2➜ temp_git git initInitialized empty Git repository in /Users/dawi/temp_git/.git/➜ temp_git git:(master) ✗ git add file1 file2➜ temp_git git:(master) ✗ git commit -m "first commit" 2 files changed, 2 insertions(+) create mode 100644 file1 create mode 100644 file2➜ temp_git git:(master) git statusOn branch masternothing to commit, working directory clean➜ temp_git git:(master) git remote add origin https://github.com/utdiscant/ctfds.git➜ temp_git git:(master) git push -u origin masterUsername for 'https://github.com': utdiscant Password for 'https://[email protected]': Counting objects: 4, done.Delta compression using up to 4 threads.Compressing objects: 100% (2/2), done.Writing objects: 100% (4/4), 284 bytes | 0 bytes/s, done.Total 4 (delta 0), reused 0 (delta 0)To https://github.com/utdiscant/ctfds.git * [new branch] master -> masterBranch master set up to track remote branch master from origin.

27

Page 28: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

➜ n-62-14-11(dawi) $ git clone https://github.com/utdiscant/ctfds.gitInitialized empty Git repository in /zhome/5b/c/51358/git_temp/ctfds/.git/remote: Counting objects: 4, done.remote: Compressing objects: 100% (2/2), done.remote: Total 4 (delta 0), reused 4 (delta 0)Unpacking objects: 100% (4/4), done.!➜ n-62-14-11(dawi) $ lsctfds!➜ n-62-14-11(dawi) $ cd ctfds/!➜ n-62-14-11(dawi) $ lsfile1 file2

28

Page 29: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

Computational Tools for Data Science

Week 2: The UNIX Shell & Version Control & Amazon EC2

Page 30: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

30

Page 31: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

31

As part of AWS’s Free Usage Tier, new AWS customers can get started with Amazon EC2 for free. Upon sign-up, new AWS customers receive the following EC2 services each month for one year: !• 750 hours of EC2 running Linux, RHEL, or SLES t2.micro

instance usage • 750 hours of Elastic Load Balancing plus 15 GB data

processing • 30 GB of Amazon Elastic Block Storage in any combination of

General Purpose (SSD) or Magnetic, plus 2 million I/Os (with Magnetic) and 1 GB of snapshot storage

• 15 GB of bandwidth out aggregated across all AWS services • 1 GB of Regional Data Transfer

Free tier

Page 32: Computational Tools for Data Scienc · Ideas for final presentation ‣ Awk ‣ Julia ‣ Vowpal Wabbit ‣ BigTable ‣ Cool library for Python ‣ Matlab tricks ‣ Apache Mahout

32

Paid

http://aws.amazon.com/ec2/pricing/