tommaso marcelli - fablabaosta.it

37
Tommaso Marcelli Introduction to Data Science June 25, 2021

Upload: others

Post on 22-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tommaso Marcelli - fablabaosta.it

Tommaso MarcelliIntroduction to Data ScienceJune 25, 2021

Page 2: Tommaso Marcelli - fablabaosta.it

About meI currently work in Kotuko as Chief Technology Officer. In recent years, I have been working as full-stack developer on several web and mobile projects.

Before that, I spent around 10 years doing academic research and teaching in experimental and computational organic chemistry..

Tommaso MarcelliDeveloper and chemist

Page 3: Tommaso Marcelli - fablabaosta.it

AboutThis lecture aims at giving an overview of the exciting field of Data Science, highlighting some of the key concepts and providing information and relevant resources.

In the first part, we will focus on the definition of Data Science and on the underlying concepts, with as little math as possible.

We will then move to notebooks and look at the most common tools which are used to clean, manipulate and visualize data, with a case study.

Finally, we will look at an interactive example, where we will load a real-world dataset and try to work with it, in order to extract meaningful information and visualize relevant data.

Page 5: Tommaso Marcelli - fablabaosta.it

What is Data Science?

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 6: Tommaso Marcelli - fablabaosta.it

● A form of analytics in which software programs learn about data and find patterns or insights

● Uses a variety of algorithms and analytical models to support different types of data analysis

● Includes a combination of supervised, unsupervised and reinforcement learning mechanisms

Data Science vs ML vs AI

Data Science Machine Learning Artificial Intelligence● Development of computerized

applications that simulate human intelligence and interaction

● Handles specific tasks now; general intelligence and cognitive capabilities are a future goal

● Uses algorithms for machine learning, natural language processing, automation and more

● The process of using advanced analytics to extract relevant information from data

● Involves a wide range of analytics applications that can aid decision-making in organizations

● Incorporates predictive modeling statistical analysis, machine learning and other functions

https://searchbusinessanalytics.techtarget.com/feature/Data-science-vs-machine-learning-vs-AI-How-they-work-together

Page 7: Tommaso Marcelli - fablabaosta.it

Data Science Applications

https://www.edureka.co/blog/what-is-data-science/

Page 8: Tommaso Marcelli - fablabaosta.it

https://theblog.okcupid.com/the-most-important-questions-on-okcupid-32e80bad0854

Page 10: Tommaso Marcelli - fablabaosta.it

http://brussels-diversity.jetpack.ai/

Page 11: Tommaso Marcelli - fablabaosta.it

The GIGO principleThe principle of Garbage In, Garbage Out summarizes that computers simply process the data that they are fed.

Garbage is a broad and expanding category in data science: poorly labeled or inaccurate data, data that reflects underlying human prejudices, incomplete data.

This means the quality of the output depends on the quality of the input. With bad data, applications will produce results that are inaccurate, incomplete or incoherent.

https://towardsdatascience.com/data-quality-garbage-in-garbage-out-df727030c5eb

Page 12: Tommaso Marcelli - fablabaosta.it

COVID data overflowSince the beginning of the COVID pandemic, a stunning amount of data has been produced and made available to the general public.

Experts, journalists and data nerds have been trying to make sense of them and try to rationalize the numbers which came from all around the world.

● Is the case-fatality a good metric to compare countries?

● What is the underlying assumption for it to make any sense?

● What about the number of deaths?

https://coronavirus.jhu.edu/data/mortality

Page 13: Tommaso Marcelli - fablabaosta.it

COVID data overflow

https://www.ft.com/content/6bd88b7d-3386-4543-b2e9-0d5c6fac846c

Page 14: Tommaso Marcelli - fablabaosta.it

JargonData set A collection of data

Model A mathematical representation of a real-world process

Outlier An observation that lies an unusual distance from other values

Bias A systematic difference from the population parameter being considered

Feature An individual independent variable that act as the input in your system

Label The variable which needs to be predicted by a model

Page 15: Tommaso Marcelli - fablabaosta.it

The process

https://towardsdatascience.com/5-steps-of-a-data-science-project-lifecycle-26c50372b492

Page 16: Tommaso Marcelli - fablabaosta.it

How do you get data?

DatabasesData can be obtained from a variety of databases, such as relational (MySQL, Oracle) and non-relational ones (MongoDB).Most likely, a series of queries need to be executed to return data which can be used in the subsequent steps of a data science project.

APIAPI stands for Application Programming Interface and indicates a way for a service to expose data and functionality. This approach involves a series of requests to a service to obtain the data required before moving to the next step.

ScrapingScraping indicates an automated process where many web pages are downloaded and analyzed in a programmatic manner to extract data. In some cases it is illegal and the scraping algorithm should be very robust to avoid getting incomplete or corrupted data.

Page 17: Tommaso Marcelli - fablabaosta.it

Kaggle

https://www.kaggle.com/datasets

Page 18: Tommaso Marcelli - fablabaosta.it

Public Open Data

http://aperto.comune.torino.it/dataset

Page 19: Tommaso Marcelli - fablabaosta.it

Real data are messy

Page 20: Tommaso Marcelli - fablabaosta.it

Exploring datasetsOnce a dataset has been cleaned, simple data visualization can be valuable to get a rough idea of trends.

Plotting the distribution of the values for a specific feature (on the main diagonal) can help identify problems with the dataset, such as an unexpected distribution or capping.

Plotting the relation between the values for each couple of features shows the level of correlation between them.

This can be used to obtain insights and to identify redundant features in the dataset.

Page 21: Tommaso Marcelli - fablabaosta.it

Data Modeling

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491

Page 22: Tommaso Marcelli - fablabaosta.it

Train, test, validate

https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Page 23: Tommaso Marcelli - fablabaosta.it

Underfitting and Overfitting

https://datascience.foundation/sciencewhitepaper/underfitting-and-overfitting-in-machine-learning

Page 24: Tommaso Marcelli - fablabaosta.it

Model interpretationComplex models (such as neural networks) are notoriously difficult to interpret (black box). Model interpretability is necessary to verify that what the model is doing is in line with what is expected and it allows to create trust with the users and ease the transition from manual to automated processes.

Why is it essential to do an in-depth analysis of your models?

● Identify and mitigate bias● Accounting for the context of the problem● Improving generalisation and performance● Ethical and legal reasons

https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f

Page 25: Tommaso Marcelli - fablabaosta.it

Human BiasMachine learning models are not inherently objective. Engineers train models by feeding them a data set of training examples, and human involvement in the provision and curation of this data can make a model's predictions susceptible to bias.

Common types of human bias

● Reporting Bias● Automation Bias● Selection Bias● Group Attribution Bias● Implicit Bias

https://developers.google.com/machine-learning/crash-course/fairness/types-of-bias

Page 26: Tommaso Marcelli - fablabaosta.it

Bias / Variance TradeoffGeneralization of a model is affected by three different types of error:

BiasDue to wrong assumptions. High-bias models underfit the training data

VarianceDue to excessive sensitivity to small variations. High-variance models overfit the training data.

Irreducible errorPart of the error due to data noiseness.

Increasing a model complexity increases variance and reduces bias (and the other way around).

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Page 27: Tommaso Marcelli - fablabaosta.it

Why Python?In recent years, Python quickly became the language of choice for machine learning and for data science in general.

Python is considered to have a soft learning curve and to be quite intuitive to read, which makes it the language of choice for beginners.

Also, when compared to other languages, Python offers a stunning ecosystem of libraries and software which can be used for data science.

Although a basic understanding of the language is fundamental, it might be more wise to focus efforts on the underlying theory and the use of popular libraries.

https://www.python.org/about/gettingstarted/

~]# python3 variable-args.py<class 'tuple'>Output of our function is: 40

Page 28: Tommaso Marcelli - fablabaosta.it

MarkDown

https://guides.github.com/features/mastering-markdown/

Page 29: Tommaso Marcelli - fablabaosta.it

JupyterJupyter notebooks are the de facto standard for prototyping and sharing data science projects.

Python Jupyter notebooks (.ipynb files) are documents made up by a list of cells, which can contain either MarkDown or Python code.

In Jupyter, the Python code can be executed inline, displaying the results directly under code blocks.

Markdown blocks can be used to integrate the code by adding comments and documentation to the content of the notebook.

Jupyter can be installed and run on a local machine (see docs for your platform)

https://jupyter-notebook.readthedocs.io/en/stable/notebook.html

Page 30: Tommaso Marcelli - fablabaosta.it

Google ColabGoogle Colaboratory (mostly abbreviated as Colab) is a free cloud-hosted Jupyter Notebook service, which can be accessed using a Google account.

Google Colab offers storage and the possibility to use small CPUs, GPUs or TPUs to carry out relatively simple analyses directly in the browser.

Colab files are saved in Google Drive and the software is nicely integrated with the storage service, allowing easy inclusion of files in a notebook.

Artifacts can be easily downloaded from Colab.

https://colab.research.google.com/notebooks/basic_features_overview.ipynb

Page 31: Tommaso Marcelli - fablabaosta.it

NumPyNumPy is the leading Python scientific library for working with arrays and matrices.

NumPy vectors (one-dimensional arrays) are significantly easier to work with than with Python lists of numbers, their native counterparts.

Machine learning (and data science in general) requires many mathematical operations on vectors: NumPy has optimized implementations for all most common operations and, in general, it performs significantly better than with Python lists.

In addition to that, NumPy provides methods to solve common mathematical problems (for instance solving linear equations).

https://numpy.org/doc/stable/user/quickstart.html

Page 32: Tommaso Marcelli - fablabaosta.it

PandasPandas is the most renowned Python library to work with data sets.

It allows loading, manipulating and visualizing information from large data sets, providing useful metrics (for instance the amount of missing values).

Results can be rendered inline to get a quick overview of what a data set looks like (Scrub and Explore steps).

When working with Pandas, the key data structure is the DataFrame, defined as two-dimensional, size-mutable, potentially heterogeneous tabular data.

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Page 33: Tommaso Marcelli - fablabaosta.it

MatplotlibMatplotlib is a powerful data-visualization library in Python, which is particularly useful to get a graphical representation of data.

Matplotlib allows a detailed customization of the produced charts using a number of parameters; however, the library has rather standard default, which lets beginner getting results quickly.

Matplotlib handles a stunning amount of different visualizations, from standard charts (scatter plots, pie charts, histograms) up to very complex representations, giving full control on colors, scales, legends and other graphical details

https://matplotlib.org/stable/gallery/index.html

Page 36: Tommaso Marcelli - fablabaosta.it

Other Resources

Kaggle Courseshttps://www.kaggle.com/learn

Machine Learning Crash Coursehttps://developers.google.com/machine-learning/crash-course

Page 37: Tommaso Marcelli - fablabaosta.it

Thank you for listening!