machine learning in php
TRANSCRIPT
MACHINE LEARNING IN PHPThe roots of education are bitter, but the fruit is sweet
Verona, Italia, 2016
AGENDA
How to teach tricks to your PHP
Application : searching for code in comments
Complex learning
SPEAKER
Damien Seguy
Exakat CTO
Static analysis of PHP code
MACHINE LEARNING
Teaching the machine
Supervised learning : learning then applying
Application build its own model : training phase
It applies its model to real cases : applying phase
APPLICATIONS
Play go, chess, tic-tac-toe and beat everyone else
Fraud detection and risk analysis
Automated translation or automated transcription
OCR and face recognition
Medical diagnostics
Walk, welcome guest at hotels, play football
Finding good PHP code
PHP APPLICATIONS
Recommendations systems
Predicting user behavior
SPAM
conversion user to customer
ETA
Detect code in comments
REAL USE CASE
Identify code in comments
Classic problem
Good problem for machine learning
Complex, no simple solution
A lot of data and expertise are available
SUPERVISED TRAINING
Historydata Training
ModelReal data Results
THE FANN EXTENSION
ext/fann (https://pecl.php.net/package/fann)
Fast Artificial Neural Network
http://leenissen.dk/fann/wp/
Neural networks in PHP
Works on PHP 7, thanks to the hard work of Jakub Zelenka
https://github.com/bukka/php-fann
NEURAL NETWORKS
Imitation of nature
Input layer
Output layer
Intermediate layers
NEURAL NETWORK
Imitation of nature
Input layer
Output layer
Intermediate layers
INITIALIZATION<?php
$num_layers = 1; $num_input = 5; $num_neurons_hidden = 3; $num_output = 1; $ann = fann_create_standard($num_layers, $num_input, $num_neurons_hidden, $num_output);
// Activation function fann_set_activation_function_hidden($ann,
FANN_SIGMOID_SYMMETRIC); fann_set_activation_function_output($ann, FANN_SIGMOID_SYMMETRIC);
PREPARING DATA
Raw data Extract Filter Human review Fann ready
EXPERT AT WORK// Test if the if is in a compressed format
// none need yet
// icon
// There is a parser specified in `Parser::$KEYWORD_PARSERS`
// $result should exist, regardless of $_message
// $a && $b and multidimensional
// numGlyphs + 1
// TODO : fix this; var_dump($var);
// if(ob_get_clean()){
//$annots .= ' /StructParent ';
// $cfg['Servers'][$i]['controlpass'] = 'pmapass';
INPUT VECTOR
'length' : size of the comment
'countDollar' : number of $
'countEqual' : number of =
'countObjectOperator' number of -> operator ($o->p)
'countSemicolon' : number of semi-colon ;
INPUT DATA
46 5 1 825 0 0 0 1 0 37 2 0 0 0 0 55 2 2 0 1 1 61 2 1 3 1 1 ...
* This file is part of Exakat. * * Exakat is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * Exakat is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License for more details. * * You should have received a copy of the GNU Affero General Public License * along with Exakat. If not, see <http://www.gnu.org/licenses/>. * * The latest code can be found at <http://exakat.io/>. * */
// $x[3] or $x[] and multidimensional
//if ($round == 3) { die('Round '.$round);}
//$this->errors[] = $this->language->get('error_permission');
Number of input Number of incoming data Number of outgoing data
TRAINING$max_epochs = 500000; $desired_error = 0.001;
// the actual trainingif (fann_train_on_file($ann, 'incoming.data', $max_epochs, $epochs_between_reports, $desired_error)) { fann_save($ann, 'model.out'); }fann_destroy($ann); ?>
TRAINING
47 cases
5 characteristics
3 hidden neurons
+ 5 input + 1 output
Duration : 5.711 s
APPLICATION
Historydata Training
ModelReal data Results
APPLICATION<?php
$ann = fann_create_from_file('model.out');
$comment = '//$gvars = $this->getGraphicVars();';
$input = makeVector($comment); $results = fann_run($ann, $input);
if ($results[0] > 0.8) { print "\"$comment\" -> $results[0] \n"; }
?>
RESULTS > 0.8
Answer between 0 and 1
Values ranges from -14 to 0,999
The closer to 1, the safer. The closer to 0, the safer.
Is this a percentage? Is this a carrots count ?
It's a mix of counts…
-16
-12
-8
-4
0
60.000000
70.000000
80.000000
90.000000
100.000000
REAL CASES
Tested on 14093 comments
Duration 367.01ms
Found 1960 issues (14%)
0.99999893 // $cfg['Servers'][$i]['controlhost'] = '';
0.99999928 //$_SESSION['Import_message'] = $message->getDisplay();
/* 0.99999928 if (defined('SESSIONUPLOAD')) { // write sessionupload back into the loaded PMA session
$sessionupload = unserialize(SESSIONUPLOAD); foreach ($sessionupload as $key => $value) { $_SESSION[$key] = $value; }
// remove session upload data that are not set anymore foreach ($_SESSION as $key => $value) { if (mb_substr($key, 0, mb_strlen(UPLOAD_PREFIX)) == UPLOAD_PREFIX && ! isset($sessionupload[$key]) ) { unset($_SESSION[$key]); } } }
0.98780382 //LEAD_OFFSET = (0xD800 - (0x10000 >> 10)) = 55232
0.99361396 // We have server(s) => apply default configuration 0.98383027 // Duration = as configured
0.99999928 // original -> translation mapping
0.97590065 // = ( 59 x 84 ) mm = ( 2.32 x 3.31 ) in
True positive False positive
True negative False negative
Found by FANN
Target
True positive
False positive
True negative
False negative
Found by FANN
Target
// $cfg['Servers'][$i]['table_coords'] = 'pma__table_coords';
//(isset($attribs['height'])?$attribs['height']: 1);
// if ($key != null) did not work for index "0"
// the PASSWORD() function
0.99999923
0.73295981
0.99999851
0.2104115
RESULTS
1960 issues
50+% of false positive
With an easy clean, 822 issues reported
14k comments, analyzed in 367 ms
Total time of coding : 27 mins.
// = ( 59 x 84 ) mm = ( 2.32 x 3.31 ) in /* vim: set expandtab sw=4 ts=4 sts=4: */
LEARN BETTER, NOT HARDER
Better training data
Improve characteristics
Configure the neural network
Change algorithm
Automate learning
Update constantly
Real data
Historydata
Training
Model Results
Retroaction
BETTER TRAINING DATA
More data, more data, more data
Varied situations, real case situations
Include specific cases
Experience is capital
https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
IMPROVE CHARACTERISTICS
Add new characteristics
Remove the one that are less interesting
Find the right set of characteristics
NETWORK CONFIGURATION
Input vector
Intermediate neurons
Activation function
Output vector
0
5000
10000
15000
20000
1 2 3 4 5 6 7 8 9 10
1 layer 2 layers 3 layers 4 layers
Time of training (ms)
CHANGE ALGORITHM
First add more data before changing algorithm
Try cascade2 algorithm from FANN
0.6 => 0 found
0.5 => 2 found
Not found by the first algorithm
FINDING THE BEST
Test with 2-4 layers10 neurons
Measure results
0
2250
4500
6750
9000
1 2 3 4 5 6 7 8 9 10 11 12 13
1 layer 2 layers 3 layers 4 layers
DEEP LEARNING
Chaining the neural networks
Auto-encoders
Unsupervised Learning
Genetic algorithm, ant
OTHER TOOLS
PHP ext/fann
Langage R
https://github.com/kachkaev/php-r
Scikit-learn
https://github.com/scikit-learn/scikit-learn
Mahout
https://mahout.apache.org/
AUTRES CONFIGURATIONS
Fonction d'activation
FANN_SIGMOID_SYMMETRIC
FANN_LINEAR
FANN_THRESHOLD
FANN_SIN_SYMMETRIC
Linéaire Seuil
Tangeante
Gaussienne Quadratique
Sigmoide
QUELLES APPLICATIONS?
Non-déterministe
Elimination de tout ce qui est systématique à trouver
Accès à l'expertise et aux vecteurs de caractéristiques
Couche finale après les résultats
Classification, priorisation, approximation rapide
APPRENTISSAGE PAR RENFORCEMENT
Logiciel
Monde réel
RécompenseActionRéaction
FILTRES BAYESIENS
ALGORITHMES GÉNÉTIQUES
Population
Population
Selection
Reproduction
PopulationVariations