tommaso marcelli - fablabaosta.it

Tommaso MarcelliIntroduction to Data ScienceJune 25, 2021

About meI currently work in Kotuko as Chief Technology Officer. In recent years, I have been working as full-stack developer on several web and mobile projects.

Before that, I spent around 10 years doing academic research and teaching in experimental and computational organic chemistry..

Tommaso MarcelliDeveloper and chemist

AboutThis lecture aims at giving an overview of the exciting field of Data Science, highlighting some of the key concepts and providing information and relevant resources.

In the first part, we will focus on the definition of Data Science and on the underlying concepts, with as little math as possible.

We will then move to notebooks and look at the most common tools which are used to clean, manipulate and visualize data, with a case study.

Finally, we will look at an interactive example, where we will load a real-world dataset and try to work with it, in order to extract meaningful information and visualize relevant data.

Bibliography

https://www.oreilly.com/library/view/data-science-from/9781492041122/

https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

http://themlbook.com/

What is Data Science?

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

● A form of analytics in which software programs learn about data and find patterns or insights

● Uses a variety of algorithms and analytical models to support different types of data analysis

● Includes a combination of supervised, unsupervised and reinforcement learning mechanisms

Data Science vs ML vs AI

Data Science Machine Learning Artificial Intelligence● Development of computerized

applications that simulate human intelligence and interaction

● Handles specific tasks now; general intelligence and cognitive capabilities are a future goal

● Uses algorithms for machine learning, natural language processing, automation and more

● The process of using advanced analytics to extract relevant information from data

● Involves a wide range of analytics applications that can aid decision-making in organizations

● Incorporates predictive modeling statistical analysis, machine learning and other functions

https://searchbusinessanalytics.techtarget.com/feature/Data-science-vs-machine-learning-vs-AI-How-they-work-together



Data Science Applications

https://www.edureka.co/blog/what-is-data-science/

https://www.edureka.co/blog/what-is-data-science/

https://theblog.okcupid.com/the-most-important-questions-on-okcupid-32e80bad0854

https://theblog.okcupid.com/the-most-important-questions-on-okcupid-32e80bad0854

https://techcrunch.com/2020/04/27/google-medical-researchers-humbled-when-ai-screening-tool-falls-short-in-real-life-testing



http://brussels-diversity.jetpack.ai/

http://brussels-diversity.jetpack.ai/

The GIGO principleThe principle of Garbage In, Garbage Out summarizes that computers simply process the data that they are fed.

Garbage is a broad and expanding category in data science: poorly labeled or inaccurate data, data that reflects underlying human prejudices, incomplete data.

This means the quality of the output depends on the quality of the input. With bad data, applications will produce results that are inaccurate, incomplete or incoherent.

https://towardsdatascience.com/data-quality-garbage-in-garbage-out-df727030c5eb

https://towardsdatascience.com/data-quality-garbage-in-garbage-out-df727030c5eb

COVID data overflowSince the beginning of the COVID pandemic, a stunning amount of data has been produced and made available to the general public.

Experts, journalists and data nerds have been trying to make sense of them and try to rationalize the numbers which came from all around the world.

● Is the case-fatality a good metric to compare countries?

● What is the underlying assumption for it to make any sense?

● What about the number of deaths?

https://coronavirus.jhu.edu/data/mortality

https://coronavirus.jhu.edu/data/mortality

COVID data overflow

https://www.ft.com/content/6bd88b7d-3386-4543-b2e9-0d5c6fac846c

https://www.ft.com/content/6bd88b7d-3386-4543-b2e9-0d5c6fac846c

JargonData set A collection of data

Model A mathematical representation of a real-world process

Outlier An observation that lies an unusual distance from other values

Bias A systematic difference from the population parameter being considered

Feature An individual independent variable that act as the input in your system

Label The variable which needs to be predicted by a model

The process

https://towardsdatascience.com/5-steps-of-a-data-science-project-lifecycle-26c50372b492

https://towardsdatascience.com/5-steps-of-a-data-science-project-lifecycle-26c50372b492

How do you get data?

DatabasesData can be obtained from a variety of databases, such as relational (MySQL, Oracle) and non-relational ones (MongoDB).Most likely, a series of queries need to be executed to return data which can be used in the subsequent steps of a data science project.

APIAPI stands for Application Programming Interface and indicates a way for a service to expose data and functionality. This approach involves a series of requests to a service to obtain the data required before moving to the next step.

ScrapingScraping indicates an automated process where many web pages are downloaded and analyzed in a programmatic manner to extract data. In some cases it is illegal and the scraping algorithm should be very robust to avoid getting incomplete or corrupted data.

Kaggle

https://www.kaggle.com/datasets

https://www.kaggle.com/datasets

Public Open Data

http://aperto.comune.torino.it/dataset

http://aperto.comune.torino.it/dataset

Real data are messy

Exploring datasetsOnce a dataset has been cleaned, simple data visualization can be valuable to get a rough idea of trends.

Plotting the distribution of the values for a specific feature (on the main diagonal) can help identify problems with the dataset, such as an unexpected distribution or capping.

Plotting the relation between the values for each couple of features shows the level of correlation between them.

This can be used to obtain insights and to identify redundant features in the dataset.

Data Modeling

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491

Train, test, validate

https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Underfitting and Overfitting

https://datascience.foundation/sciencewhitepaper/underfitting-and-overfitting-in-machine-learning



Model interpretationComplex models (such as neural networks) are notoriously difficult to interpret (black box). Model interpretability is necessary to verify that what the model is doing is in line with what is expected and it allows to create trust with the users and ease the transition from manual to automated processes.

Why is it essential to do an in-depth analysis of your models?

● Identify and mitigate bias● Accounting for the context of the problem● Improving generalisation and performance● Ethical and legal reasons

https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f

https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f

Human BiasMachine learning models are not inherently objective. Engineers train models by feeding them a data set of training examples, and human involvement in the provision and curation of this data can make a model's predictions susceptible to bias.

Common types of human bias

● Reporting Bias● Automation Bias● Selection Bias● Group Attribution Bias● Implicit Bias

https://developers.google.com/machine-learning/crash-course/fairness/types-of-bias

https://developers.google.com/machine-learning/crash-course/fairness/types-of-bias

Bias / Variance TradeoffGeneralization of a model is affected by three different types of error:

BiasDue to wrong assumptions. High-bias models underfit the training data

VarianceDue to excessive sensitivity to small variations. High-variance models overfit the training data.

Irreducible errorPart of the error due to data noiseness.

Increasing a model complexity increases variance and reduces bias (and the other way around).

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Why Python?In recent years, Python quickly became the language of choice for machine learning and for data science in general.

Python is considered to have a soft learning curve and to be quite intuitive to read, which makes it the language of choice for beginners.

Also, when compared to other languages, Python offers a stunning ecosystem of libraries and software which can be used for data science.

Although a basic understanding of the language is fundamental, it might be more wise to focus efforts on the underlying theory and the use of popular libraries.

https://www.python.org/about/gettingstarted/

~]# python3 variable-args.py<class 'tuple'>Output of our function is: 40

https://www.python.org/about/gettingstarted/

MarkDown

https://guides.github.com/features/mastering-markdown/

https://guides.github.com/features/mastering-markdown/

JupyterJupyter notebooks are the de facto standard for prototyping and sharing data science projects.

Python Jupyter notebooks (.ipynb files) are documents made up by a list of cells, which can contain either MarkDown or Python code.

In Jupyter, the Python code can be executed inline, displaying the results directly under code blocks.

Markdown blocks can be used to integrate the code by adding comments and documentation to the content of the notebook.

Jupyter can be installed and run on a local machine (see docs for your platform)

https://jupyter-notebook.readthedocs.io/en/stable/notebook.html

https://jupyter-notebook.readthedocs.io/en/stable/notebook.html

Google ColabGoogle Colaboratory (mostly abbreviated as Colab) is a free cloud-hosted Jupyter Notebook service, which can be accessed using a Google account.

Google Colab offers storage and the possibility to use small CPUs, GPUs or TPUs to carry out relatively simple analyses directly in the browser.

Colab files are saved in Google Drive and the software is nicely integrated with the storage service, allowing easy inclusion of files in a notebook.

Artifacts can be easily downloaded from Colab.

https://colab.research.google.com/notebooks/basic_features_overview.ipynb

https://colab.research.google.com/notebooks/basic_features_overview.ipynb

NumPyNumPy is the leading Python scientific library for working with arrays and matrices.

NumPy vectors (one-dimensional arrays) are significantly easier to work with than with Python lists of numbers, their native counterparts.

Machine learning (and data science in general) requires many mathematical operations on vectors: NumPy has optimized implementations for all most common operations and, in general, it performs significantly better than with Python lists.

In addition to that, NumPy provides methods to solve common mathematical problems (for instance solving linear equations).

https://numpy.org/doc/stable/user/quickstart.html

https://numpy.org/doc/stable/user/quickstart.html

PandasPandas is the most renowned Python library to work with data sets.

It allows loading, manipulating and visualizing information from large data sets, providing useful metrics (for instance the amount of missing values).

Results can be rendered inline to get a quick overview of what a data set looks like (Scrub and Explore steps).

When working with Pandas, the key data structure is the DataFrame, defined as two-dimensional, size-mutable, potentially heterogeneous tabular data.

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

MatplotlibMatplotlib is a powerful data-visualization library in Python, which is particularly useful to get a graphical representation of data.

Matplotlib allows a detailed customization of the produced charts using a number of parameters; however, the library has rather standard default, which lets beginner getting results quickly.

Matplotlib handles a stunning amount of different visualizations, from standard charts (scatter plots, pie charts, histograms) up to very complex representations, giving full control on colors, scales, legends and other graphical details

https://matplotlib.org/stable/gallery/index.html

https://matplotlib.org/stable/gallery/index.html

Milano Melting PotBackgroundhttps://github.com/spaghetti-open-data/-milano-melting-pot

Datasethttps://dati.comune.milano.it/dataset/ds27-popolazione-residenti-cittadinanza-quartiere-serie-storica

Interactive Maphttps://musing-haibt-1009e9.netlify.app/

Milan suburbshttps://datawrapper.dwcdn.net/r5VQV/3/

https://github.com/spaghetti-open-data/-milano-melting-pot

https://github.com/spaghetti-open-data/-milano-melting-pot

https://dati.comune.milano.it/dataset/ds27-popolazione-residenti-cittadinanza-quartiere-serie-storica



https://musing-haibt-1009e9.netlify.app/

https://datawrapper.dwcdn.net/r5VQV/3/

California House PricesDatasethttps://www.kaggle.com/camnugent/california-housing-prices

Training Notebookhttps://colab.research.google.com/drive/1EYjbKopbv5DKJSDmYWIlUbB23mlo1p3C?usp=sharing

Original Notebookhttps://colab.research.google.com/github/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb

https://www.kaggle.com/camnugent/california-housing-prices

https://www.kaggle.com/camnugent/california-housing-prices

https://colab.research.google.com/drive/1EYjbKopbv5DKJSDmYWIlUbB23mlo1p3C?usp=sharing

https://colab.research.google.com/drive/1EYjbKopbv5DKJSDmYWIlUbB23mlo1p3C?usp=sharing

https://colab.research.google.com/github/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb



Other Resources

Kaggle Courseshttps://www.kaggle.com/learn

Machine Learning Crash Coursehttps://developers.google.com/machine-learning/crash-course

https://www.kaggle.com/learn

https://developers.google.com/machine-learning/crash-course

https://developers.google.com/machine-learning/crash-course

Thank you for listening!

tommaso marcelli - fablabaosta.it

Documents