doing research with jupyter notebooks · data scientist @ siteground hosting. outline writing...

72
Doing research with Jupyter Notebooks Georgi Karadzhov @G_Karadzhov https://gkaradzhov.com

Upload: others

Post on 06-Oct-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Doing research with Jupyter NotebooksGeorgi Karadzhov

@G_Karadzhov

https://gkaradzhov.com

Page 2: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

What is this all about

● Writing code is not reserved for computer scientists● Researchers in all fields write software daily ● Researchers leverage interactive computing

environments, such as Jupyter and RMarkdown● In pursuit of “open science”, we share our

code/data/models hoping our research can be used as a stepping stone for further advances

Page 3: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Yet poor quality code is a common occurrence

Page 4: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

We share everything, yet nothing is reproducible

Page 5: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

We strive for rapid prototyping, but sometimes we spend days finding bugs

in our code

Page 6: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

About me

● Professional software developer for the past 6 years● Researcher in the fields of natural language processing,

focusing on fact-checking and rumour detection● Heavy notebook user

Page 7: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

About me

● Data Scientist @ SiteGround Hosting

Page 8: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Outline

● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong● Tools and processes for writing reliable research code

Page 9: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Let’s address the elephant in the room

Page 10: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

It’s OK if you don’t use or like Python! This talk is *mostly* about software

engineering.

Page 11: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Code examples will be in Python and Jupyter, but the concepts are

transferable.

Page 12: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Outline

● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong● Tools and processes for writing reliable research code

Page 13: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Automation of processes

● Preprocessing data ● Data collection● Calculating result measures

Page 14: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Data analysis

● Calculating correlations in data● Finding outliers● Time-series analysis

Page 15: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Data modeling

● Predictive modeling● Machine learning ● Knowledge extraction

Page 16: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

What tools do we need

Page 17: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Fast prototyping

Page 18: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Easy-to-write

Page 19: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Reproducible

Page 20: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

What else ?

Page 21: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Outline

● Writing research code

● Python and Jupyter notebooks for research

● What can possibly go wrong● Tools and processes for writing reliable research code

Page 22: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Why do I like python ?

1. Easy-to-learn2. Easy-to-read3. Batteries included + Third Party Modules

a. NumPy, SciPy, Pandas, Matplotlib, Seabornb. Sklearn, StatsModelsc. Tensorflow, Keras, PyTorch

Page 23: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

The Zen of Pythonby Tim Peters

>> import thisBeautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.Special cases aren't special enough to break the rules.Although practicality beats purity.Errors should never pass silently.Unless explicitly silenced.In the face of ambiguity, refuse the temptation to guess.There should be one-- and preferably only one --obvious way to do it.Although that way may not be obvious at first unless you're Dutch.Now is better than never.Although never is often better than *right* now.If the implementation is hard to explain, it's a bad idea.If the implementation is easy to explain, it may be a good idea.Namespaces are one honking great idea -- let's do more of those!

Page 24: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools
Page 25: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Less time for development and more time for experiments!

● Most of the code we write is a boilerplate code, that is tedious to write and easy to mess up

● With Python we try to minimise the code we write, but maximise the things it does

● The unexpected effectiveness of Python in science

Page 26: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Java:class hello {

public static void main(String []args){System.out.println("Hello World");

}}

C:#include <stdio.h>main() {

printf("Hello World");}

Python:print("Hello World")

Page 27: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Jupyter notebooks

Page 28: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Jupyter notebooks

● Interactive python environment ● Open source● Large community● Jupyter Lab in development

Page 29: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Jupyter notebooks

● Client - Server architecture● Doesn’t require any installation to use, if the server is

hosted elsewhere (or if using Google Colab or similar services)

Page 30: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

1_jupyter_intro.ipynb

Page 31: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Writing research code

● Exploratory code● Data processing/modeling pipelines ● Data visualisation

Page 32: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

2_research_python.ipynb

Page 33: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Outline

● Writing research code● Python and Jupyter notebooks for research

● What can possibly go wrong● Tools and processes for writing reliable research code

Page 34: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

“What if my code is a bit messy?”

Page 35: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

“I just play around with my data here, my code is not supposed to be perfect.”

Page 36: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools
Page 37: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

3_buggy_notebook.ipynb

Page 38: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

What if I have bugs in my code ?

Page 39: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Preprocessing data

● Inconsistent preprocessing● Missing data● Duplicate data

Page 40: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Data collection

● Implicit selection bias● Wrong data collected● Slow and inefficient

Page 41: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Modeling pipelines

● Unexpected errors● Slow research ( we don’t want that)● Suboptimal results ● Too-good-to-be-true results

Page 42: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Calculating result measures

● Inadequate evaluation● Wrong or misleading results

Page 43: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

All this leads to:

● Loss of productivity and time● Debugging code is frustrating

Page 44: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Finding out, that your results are wrong 2 hours before paper deadline is even more frustrating

Page 45: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Outline

● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong

● Tools and processes for writing reliable research code

Page 46: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Fix it now

● Probably the most cheesy advice you will ever receive● BUT: 5 minutes now, typically saves 30 minutes later

Page 47: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Write unit tests

● Traditionally used by software developers● Ensures that function have a desired behaviour● If used properly can identify bugs early● If your code requires a modification, you can verify that no

additional bugs are introduced

Page 48: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

4_unittests.ipynb

Page 49: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Sanity checks in the pipeline

● Print before and after processing● Check your data at each step

Page 50: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

5_sanity_checks.ipynb

Page 51: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Validate the output of the notebook

● https://github.com/computationalmodelling/nbval● Install with:

pip install nbval

● Execute with:

py.test --nbval 3_buggy_notebook.ipynb

Page 52: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools
Page 53: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

If you copy-paste similar code within a notebook more than 3 times - extract it in

a function

Page 54: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

If you reuse similar code between more than 3 notebooks - extract it into an

external file

Page 55: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Use version control systems

● Keeping track of different code versions● It will be easier to reproduce previous results● Track changes● Enables code sharing● Easy to set-up:● Have a free tier:

○ https://github.com/○ https://bitbucket.org/product

Page 56: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Use version control systems

● The learning curve for learning Git (or other version control) is steep, but it pays off

● There are a lot of good tutorials online:○ https://backlog.com/git-tutorial/what-is-git/○ https://try.github.io/

Page 57: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Code Reviews

● Once you did a significant change to your code - ask someone to review it

● It may be your friend, colleague, your supervisor● If your code is not proprietary you can ask someone from

outside your lab for help (reddit/stackoverflow)● You can go one step further and contribute to opensource

projects on github

Page 58: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools
Page 59: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools
Page 60: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools
Page 61: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools
Page 62: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Use virtual environments

● Self-contained● Can execute different projects with different versions of

your packages● Easy to create and use:

python3 -m venv name_of_your_venv

source name_of_your_venv/bin/activate

Page 63: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Save requirements.txt● One line of code:

pip freeze > requirements.txt

Page 64: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Save requirements.txtappnope==0.1.0attrs==19.1.0backcall==0.1.0bleach==3.1.0certifi==2019.3.9chardet==3.0.4cycler==0.10.0decorator==4.3.2defusedxml==0.5.0entrypoints==0.3idna==2.8ipykernel==5.1.0ipympl==0.2.1ipython==7.3.0ipython-genutils==0.2.0ipywidgets==7.4.2jedi==0.13.3Jinja2==2.10jsonschema==3.0.1jupyter==1.0.0jupyter-client==5.2.4

Page 65: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Use rich file directory structure

Page 66: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Before:

Page 67: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

After:

Page 68: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Write the docs

● Write README files● Write system descriptions

Page 69: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Save your code revision (or notebook) + requirements.txt + data + README and

archive it.

Page 70: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Share your code with others

● Open science and whatnot● High accountability ● Empirically proven that code that is shared has less bugs

Page 71: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Fixing issues in our code is hard. We do it anyway !

Page 72: Doing research with Jupyter Notebooks · Data Scientist @ SiteGround Hosting. Outline Writing research code Python and Jupyter notebooks for research What can possibly go wrong Tools

Questions ?

Slides and additional information:

@G_Karadzhov or [email protected]

https://gkaradzhov.com/research-code-with-jupyter-notebooks/

Ask now or ping me later at: