Data-driven code analysis: Learning from other's mistakes
Andreas Dewes (@japh44)
13.04.2015
PyCon 2015 – Montreal
About
Physicist and Python enthusiast
CTO of a spin-off of the
University of Munich (LMU):
We develop software for data-driven code analysis.
Our mission
Tools & Techniques for Ensuring Code Quality
static dynamic
automated
manual
Debugging
Profiling
...
Manual
code reviews
Static analysis /
automated
code reviews
Unit testing
System testing
Integration testing
Discovering problems in code
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
obj returns only thekeys of the dictionary.(obj.items() is needed)
value.imaginary does not exist. (value.imag would be correct)
Dynamic Analysis (e.g. unit testing)
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
def test_encode(): d = {'a' : 1j+4,
's' : {'d' : 4+5j}}
r = encode(d) #this will fail...
assert r['a'] == {'type' : 'complex', 'r' : 4,'i' : 1}
assert r['s']['d'] == {'type' : 'complex', 'r' : 4,'i' : 5}
Static Analysis (for humans)
encode is a function with 1 parameterwhich always returns a dict.
I: obj should be an iterator/list of tupleswith two elements.
encode gets called with adict, which does not satisfy (I).
a value of type complex does nothave an .imaginary attribute!
encode is called with a dict, whichagain does not satisfy (I).
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
How static analysis tools works (short version)
1. Compile the code into a data
structure, typically an abstract syntax
tree (AST)
2. (Optionally) annotate it with
additional information to make
analysis easier
3. Parse the (AST) data to find problems.
Python Tools for Static Analysis
PyLint (most comprehensive tool)http://www.pylint.org/
PyFlakes (smaller, less verbose)https://pypi.python.org/pypi/pyflakes
Pep8 (style and some structural checks)https://pypi.python.org/pypi/pep8
(... and many others)
Limitations of current tools & technologies
Checks are hard to create / modify...(example: PyLint code for analyzing 'try/except' statements)
Long feedback cycles
Rethinking code analysis for Python
Our approach
1. Code is data! Let's not keep it in text
files but store it in a useful form that we
can work with easily (e.g. a graph).
2. Make it super-easy to specify errors
and bad code patterns.
3. Make it possible to learn from user
feedback and publicly available code.
Building the Code Graph
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
dict
name
nameassign
functiondef
body
body
targets
for
body iterator
Building the Code Graph
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
value
{i : 1}
{id : 'e'}
{name: 'encode',args : [...]}
{i:0}
Building the Code Graph
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
e4fa76b...
a76fbc41...
c51fa291...
74af219...
name
nameassign
body
body
targets
for
body iterator
value
dict
functiondef
$type: dict
Example: Tornado Project
10 modules from the tornado project
Modules
Classes
Functions
Advantages
- Simple detection of (exact) duplicates
- Semantic diffing of modules, classes, functions, ...
- Semantic code search on the whole tree
Describing Code Errors / Anti-Patterns
Code issues = patterns on the graph
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
name
attribute
value
attr
{id : imaginary}
name
$type {id : value}
complex
Using YAML to describe graph patterns
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
node_type: attribute
value:
$type: complex
attr: imaginary
Generalizing patterns
def encode(obj): """Encode a (possibly nested) dictionary containing complex valuesinto a form that can be serializedusing JSON."""e = {} for key,value in obj: if isinstance(value,dict): e[key] = encode(value)
elif isinstance(value,complex): e[key] = {'type' : 'complex',
'r' : value.real, 'i' : value.imaginary}
return e
d = {'a' : 1j+4,'s' : {'d' : 4+5j}} print encode(d)
node_type: attribute
value:
$type: complex
attr:
$not:
$or: [real, imagin]
Learning from feedback / false positives
"else" in for loop without break statement
node_type: for
body:
$not:
$anywhere:
node_type: break
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values): if value == 'baz': print "Found it!"
else: print "didn't find 'baz'!"
Learning from false positives (I)
values = ["foo", "bar", ... ]
for i,value in enumerate(values): if value == 'baz': print "Found it!"return value
else: print "didn't find 'baz'!"
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
- $anywhere:
node_type: return
orelse:
$anything: {}
Learning from false positives (II)
node_type: for
body:
$not:
$or:
- $anywhere:
node_type: break
exclude:
node_type:
$or: [while,for]
- $anywhere:
node_type: return
orelse:
$anything: {}
values = ["foo", "bar", ... ]
for i,value in enumerate(values): if value == 'baz': print "Found it!"for j in ...:
#...break
else: print "didn't find 'baz'!"
patterns vs. code
handlers:node_type: excepthandlertype: null
node_type: tryexcept
handlers:- body:
- node_type: passnode_type: excepthandler
node_type: tryexcept
(no exception type specified)
(empty exception handler)
Summary & Feedback
1. Storing code as a graph opens up many
interesting possibilities. Let's stop thinking of
code as text!
2. We can learn from user feedback or even
use machine learning to create and adapt
code patterns!
3. Everyone can write code checkers!
=> crowd-source code quality!
Thanks!
www.quantifiedcode.comhttps://github.com/quantifiedcode
@quantifiedcode
Andreas Dewes (@japh44)
Visit us at booth 629!