automatic data validation and cleaning with pysemantic
TRANSCRIPT
Automatic Data Validation & Cleaning with PySemantic
Jaidev DeshpandeData Scientist, Cube26 Software Pvt Ltd
About Me
● Data Scientist at Cube26 Software Pvt Ltd● Previously software developer at Enthought● Research assistant at TIFR and UoP● Active contributor to the SciPy stack
/ jaidevd
/ jaidevd
Typical Data Pipeline
The Problem● Curating and the data and standardizing across the team● Data quality problems:
○ Unstructured data○ Unorganized data○ Duplicated data○ Irrelevant data
● Communication problems:○ Large and distributed teams○ “What has happened to get the dataset to the current stage?”○ Messier data means more communication.
HOW DO I DESCRIBE THE STRUCTURE OF THE DATA EFFECTIVELY?
PySemantic
Pythonically, PySemantic is:● A wrapper around pandas parsers and dataframe manipulation routines.● Not a parser● A loader for feature extraction for machine learning tasks● A logger for all operations on a dataset
PySemantic supports:● Recursive elimination of parser errors● Automatic validation based on rules
How it works
$ semantic add mydictionary.yaml
mydataset1: path: /path/to/mydataset.csv nrows: 100 use_columns:
- col_a- col_b- col_c
>>> from pysemantic import Project>>> project = Project(“myproject”)>>>project.load_dataset(“mydataset”)
PySemantic Internals
● Infer and validate parser arguments from the schema using traits
● Dynamically change parser arguments based on the errors raised, if any
● Log everything● Post loading a dataset, apply common preprocessing
methods by default
Software Development Practices
● Fully test-driven● Fully documented● Pylint score > 9.0
Limitations
● Only supports local files and MySQL tables (untested)● Not as smart as MS Excel● Architecture isn’t very clean - the main classes are
somewhat confusing
Feedback, Issues, PRs Welcome!
http://github.com/jaidevd/pysemantic