doing research with jupyter notebooks · data scientist @ siteground hosting. outline writing...

Post on 06-Oct-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Doing research with Jupyter NotebooksGeorgi Karadzhov

@G_Karadzhov

https://gkaradzhov.com

What is this all about

● Writing code is not reserved for computer scientists● Researchers in all fields write software daily ● Researchers leverage interactive computing

environments, such as Jupyter and RMarkdown● In pursuit of “open science”, we share our

code/data/models hoping our research can be used as a stepping stone for further advances

Yet poor quality code is a common occurrence

We share everything, yet nothing is reproducible

We strive for rapid prototyping, but sometimes we spend days finding bugs

in our code

About me

● Professional software developer for the past 6 years● Researcher in the fields of natural language processing,

focusing on fact-checking and rumour detection● Heavy notebook user

About me

● Data Scientist @ SiteGround Hosting

Outline

● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong● Tools and processes for writing reliable research code

Let’s address the elephant in the room

It’s OK if you don’t use or like Python! This talk is *mostly* about software

engineering.

Code examples will be in Python and Jupyter, but the concepts are

transferable.

Outline

● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong● Tools and processes for writing reliable research code

Automation of processes

● Preprocessing data ● Data collection● Calculating result measures

Data analysis

● Calculating correlations in data● Finding outliers● Time-series analysis

Data modeling

● Predictive modeling● Machine learning ● Knowledge extraction

What tools do we need

Fast prototyping

Easy-to-write

Reproducible

What else ?

Outline

● Writing research code

● Python and Jupyter notebooks for research

● What can possibly go wrong● Tools and processes for writing reliable research code

Why do I like python ?

1. Easy-to-learn2. Easy-to-read3. Batteries included + Third Party Modules

a. NumPy, SciPy, Pandas, Matplotlib, Seabornb. Sklearn, StatsModelsc. Tensorflow, Keras, PyTorch

The Zen of Pythonby Tim Peters

>> import thisBeautiful is better than ugly.Explicit is better than implicit.Simple is better than complex.Complex is better than complicated.Flat is better than nested.Sparse is better than dense.Readability counts.Special cases aren't special enough to break the rules.Although practicality beats purity.Errors should never pass silently.Unless explicitly silenced.In the face of ambiguity, refuse the temptation to guess.There should be one-- and preferably only one --obvious way to do it.Although that way may not be obvious at first unless you're Dutch.Now is better than never.Although never is often better than *right* now.If the implementation is hard to explain, it's a bad idea.If the implementation is easy to explain, it may be a good idea.Namespaces are one honking great idea -- let's do more of those!

Less time for development and more time for experiments!

● Most of the code we write is a boilerplate code, that is tedious to write and easy to mess up

● With Python we try to minimise the code we write, but maximise the things it does

● The unexpected effectiveness of Python in science

Java:class hello {

public static void main(String []args){System.out.println("Hello World");

}}

C:#include <stdio.h>main() {

printf("Hello World");}

Python:print("Hello World")

Jupyter notebooks

Jupyter notebooks

● Interactive python environment ● Open source● Large community● Jupyter Lab in development

Jupyter notebooks

● Client - Server architecture● Doesn’t require any installation to use, if the server is

hosted elsewhere (or if using Google Colab or similar services)

1_jupyter_intro.ipynb

Writing research code

● Exploratory code● Data processing/modeling pipelines ● Data visualisation

2_research_python.ipynb

Outline

● Writing research code● Python and Jupyter notebooks for research

● What can possibly go wrong● Tools and processes for writing reliable research code

“What if my code is a bit messy?”

“I just play around with my data here, my code is not supposed to be perfect.”

3_buggy_notebook.ipynb

What if I have bugs in my code ?

Preprocessing data

● Inconsistent preprocessing● Missing data● Duplicate data

Data collection

● Implicit selection bias● Wrong data collected● Slow and inefficient

Modeling pipelines

● Unexpected errors● Slow research ( we don’t want that)● Suboptimal results ● Too-good-to-be-true results

Calculating result measures

● Inadequate evaluation● Wrong or misleading results

All this leads to:

● Loss of productivity and time● Debugging code is frustrating

Finding out, that your results are wrong 2 hours before paper deadline is even more frustrating

Outline

● Writing research code● Python and Jupyter notebooks for research● What can possibly go wrong

● Tools and processes for writing reliable research code

Fix it now

● Probably the most cheesy advice you will ever receive● BUT: 5 minutes now, typically saves 30 minutes later

Write unit tests

● Traditionally used by software developers● Ensures that function have a desired behaviour● If used properly can identify bugs early● If your code requires a modification, you can verify that no

additional bugs are introduced

4_unittests.ipynb

Sanity checks in the pipeline

● Print before and after processing● Check your data at each step

5_sanity_checks.ipynb

Validate the output of the notebook

● https://github.com/computationalmodelling/nbval● Install with:

pip install nbval

● Execute with:

py.test --nbval 3_buggy_notebook.ipynb

If you copy-paste similar code within a notebook more than 3 times - extract it in

a function

If you reuse similar code between more than 3 notebooks - extract it into an

external file

Use version control systems

● Keeping track of different code versions● It will be easier to reproduce previous results● Track changes● Enables code sharing● Easy to set-up:● Have a free tier:

○ https://github.com/○ https://bitbucket.org/product

Use version control systems

● The learning curve for learning Git (or other version control) is steep, but it pays off

● There are a lot of good tutorials online:○ https://backlog.com/git-tutorial/what-is-git/○ https://try.github.io/

Code Reviews

● Once you did a significant change to your code - ask someone to review it

● It may be your friend, colleague, your supervisor● If your code is not proprietary you can ask someone from

outside your lab for help (reddit/stackoverflow)● You can go one step further and contribute to opensource

projects on github

Use virtual environments

● Self-contained● Can execute different projects with different versions of

your packages● Easy to create and use:

python3 -m venv name_of_your_venv

source name_of_your_venv/bin/activate

Save requirements.txt● One line of code:

pip freeze > requirements.txt

Save requirements.txtappnope==0.1.0attrs==19.1.0backcall==0.1.0bleach==3.1.0certifi==2019.3.9chardet==3.0.4cycler==0.10.0decorator==4.3.2defusedxml==0.5.0entrypoints==0.3idna==2.8ipykernel==5.1.0ipympl==0.2.1ipython==7.3.0ipython-genutils==0.2.0ipywidgets==7.4.2jedi==0.13.3Jinja2==2.10jsonschema==3.0.1jupyter==1.0.0jupyter-client==5.2.4

Use rich file directory structure

Before:

After:

Write the docs

● Write README files● Write system descriptions

Save your code revision (or notebook) + requirements.txt + data + README and

archive it.

Share your code with others

● Open science and whatnot● High accountability ● Empirically proven that code that is shared has less bugs

Fixing issues in our code is hard. We do it anyway !

Questions ?

Slides and additional information:

@G_Karadzhov or georgi.m.karadjov@gmail.com

https://gkaradzhov.com/research-code-with-jupyter-notebooks/

Ask now or ping me later at:

top related