automatic data validation and cleaning with pysemantic

Automatic Data Validation & Cleaning with PySemantic

Jaidev DeshpandeData Scientist, Cube26 Software Pvt Ltd

About Me

● Data Scientist at Cube26 Software Pvt Ltd● Previously software developer at Enthought● Research assistant at TIFR and UoP● Active contributor to the SciPy stack

/ jaidevd

/ jaidevd

Typical Data Pipeline

The Problem● Curating and the data and standardizing across the team● Data quality problems:

○ Unstructured data○ Unorganized data○ Duplicated data○ Irrelevant data

● Communication problems:○ Large and distributed teams○ “What has happened to get the dataset to the current stage?”○ Messier data means more communication.

HOW DO I DESCRIBE THE STRUCTURE OF THE DATA EFFECTIVELY?

PySemantic

Pythonically, PySemantic is:● A wrapper around pandas parsers and dataframe manipulation routines.● Not a parser● A loader for feature extraction for machine learning tasks● A logger for all operations on a dataset

PySemantic supports:● Recursive elimination of parser errors● Automatic validation based on rules

How it works

$ semantic add mydictionary.yaml

mydataset1: path: /path/to/mydataset.csv nrows: 100 use_columns:

- col_a- col_b- col_c

>>> from pysemantic import Project>>> project = Project(“myproject”)>>>project.load_dataset(“mydataset”)

PySemantic Internals

● Infer and validate parser arguments from the schema using traits

● Dynamically change parser arguments based on the errors raised, if any

● Log everything● Post loading a dataset, apply common preprocessing

methods by default

Software Development Practices

● Fully test-driven● Fully documented● Pylint score > 9.0

Limitations

● Only supports local files and MySQL tables (untested)● Not as smart as MS Excel● Architecture isn’t very clean - the main classes are

somewhat confusing

Feedback, Issues, PRs Welcome!

http://github.com/jaidevd/pysemantic

automatic data validation and cleaning with pysemantic

Engineering