Download - Machine learning in PHP
MACHINE LEARNING IN PHPThe roots of education are bitter, but the fruit is sweet
PHPtek, Saint Louis, MO, USA, 2016
Agenda
• How to teach tricks to your PHP
• Application : searching for code in comments
• Complex learning
Speaker
• Damien Seguy
• Exakat CTO
• Static analysis of PHP code
Machine Learning
• Teaching the machine
• Supervised learning : learning then applying
• Application build its own model : training phase
• It applies its model to real cases : applying phase
Applications
• Play go, chess, tic-tac-toe and beat everyone else
• Fraud detection and risk analysis
• Automated translation or automated transcription
• OCR and face recognition
• Medical diagnostics
• Walk, welcome guest at hotels, play football
• Finding good PHP code
PHP Applications
• Recommendations systems
• Predicting user behavior
• SPAM
• conversion user to customer
• ETA
• Detect code in comments
Real use case
• Identify code in comments
• Classic problem
• Good problem for machine learning
• Complex, no simple solution
• A lot of data and expertise are available
Supervised Training
Historydata Training
ModelReal data Results
The Fann Extension
• ext/fann (https://pecl.php.net/package/fann)
• Fast Artificial Neural Network
• http://leenissen.dk/fann/wp/
• Neural networks in PHP
• Works on PHP 7, thanks to the hard work of Jakub Zelenka
• https://github.com/bukka/php-fann
NEURAL NETWORKS
• Imitation of nature
• Input layer
• Output layer
• Intermediate layers
Neural network
• Imitation of nature
• Input layer
• Output layer
• Intermediate layers
Initialisation
<?php
$num_layers = 1; $num_input = 5; $num_neurons_hidden = 3; $num_output = 1; $ann = fann_create_standard($num_layers, $num_input, $num_neurons_hidden, $num_output);
// Activation function fann_set_activation_function_hidden($ann,
FANN_SIGMOID_SYMMETRIC); fann_set_activation_function_output($ann, FANN_SIGMOID_SYMMETRIC);
Preparing data
Raw data Extract Filter Human review Fann ready
Expert at work
// Test if the if is in a compressed format
// none need yet
// There is a parser specified in `Parser::$KEYWORD_PARSERS`
// $result should exist, regardless of $_message
// $a && $b and multidimensional
// numGlyphs + 1
// TODO : fix this; var_dump($var);
// if(ob_get_clean()){
//$annots .= ' /StructParent ';
// $cfg['Servers'][$i]['controlpass'] = 'pmapass';
Input vector
• 'length' : size of the comment
• 'countDollar' : number of $
• 'countEqual' : number of =
• 'countObjectOperator' number of -> operator ($o->p)
• 'countSemicolon' : number of semi-colon ;
Input data
46 5 1 825 0 0 0 1 0 37 2 0 0 0 0 55 2 2 0 1 1 61 2 1 3 1 1 ...
* This file is part of Exakat. * * Exakat is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * Exakat is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License for more details. * * You should have received a copy of the GNU Affero General Public License * along with Exakat. If not, see <http://www.gnu.org/licenses/>. * * The latest code can be found at <http://exakat.io/>. * */
// $x[3] or $x[] and multidimensional
//if ($round == 3) { die('Round '.$round);}
//$this->errors[] = $this->language->get('error_permission');
Number of input Number of incoming data Number of outgoing data
1 5 1 37 2 0 0 0 0
// $x[3] or $x[] and multidimensional
ext/Fann
It's a comment
Training
$max_epochs = 500000; $desired_error = 0.001;
// the actual trainingif (fann_train_on_file($ann, 'incoming.data', $max_epochs, $epochs_between_reports, $desired_error)) { fann_save($ann, 'model.out'); }fann_destroy($ann); ?>
TRAINING
• 47 cases
• 5 characteristics
• 3 hidden neurons
• + 5 input + 1 output
• Duration : 5.711 s
Application
Historydata Training
ModelReal data Results
Application
<?php
$ann = fann_create_from_file('model.out');
$comment = '//$gvars = $this->getGraphicVars();';
$input = makeVector($comment); $results = fann_run($ann, $input);
if ($results[0] > 0.8) { print "\"$comment\" -> $results[0] \n"; }
?>
Results > 0.8
• Answer between 0 and 1
• Values ranges from -14 to 0,999
• The closer to 1, the safer. The closer to 0, the safer.
• Is this a percentage? Is this a carrots count ?
• It's a mix of counts…
-16
-12
-8
-4
0
60.000000
70.000000
80.000000
90.000000
100.000000
REAL CASES
• Tested on 14093 comments
• Duration 68.01ms
• Found 1960 issues (14%)
0.99999893 // $cfg['Servers'][$i]['controlhost'] = '';
0.99999928 //$_SESSION['Import_message'] = $message->getDisplay();
/* 0.99999928 if (defined('SESSIONUPLOAD')) { // write sessionupload back into the loaded PMA session
$sessionupload = unserialize(SESSIONUPLOAD); foreach ($sessionupload as $key => $value) { $_SESSION[$key] = $value; }
// remove session upload data that are not set anymore foreach ($_SESSION as $key => $value) { if (mb_substr($key, 0, mb_strlen(UPLOAD_PREFIX)) == UPLOAD_PREFIX && ! isset($sessionupload[$key]) ) { unset($_SESSION[$key]); } } }
0.98780382 //LEAD_OFFSET = (0xD800 - (0x10000 >> 10)) = 55232
0.99361396 // We have server(s) => apply default configuration 0.98383027 // Duration = as configured
0.99999928 // original -> translation mapping
0.97590065 // = ( 59 x 84 ) mm = ( 2.32 x 3.31 ) in
True positive False positive
True negative False negative
Found by FANN
Target
True positive
False positive
True negative
False negative
Found by FANN
Target
// $cfg['Servers'][$i]['table_coords'] = 'pma__table_coords';
//(isset($attribs['height'])?$attribs['height']: 1);
// if ($key != null) did not work for index "0"
// the PASSWORD() function
0.99999923
0.73295981
0.99999851
0.2104115
RESULTS
• 1960 issues
• 50+% of false positive
• With an easy clean, 822 issues reported
• 14k comments, analyzed in 68 ms (367ms in PHP5)
• Total time of coding : 27 mins.
// = ( 59 x 84 ) mm = ( 2.32 x 3.31 ) in /* vim: set expandtab sw=4 ts=4 sts=4: */
Learn better, not harder
• Better training data
• Improve characteristics
• Configure the neural network
• Change algorithm
• Automate learning
• Update constantly
Real data
Historydata
Training
Model Results
Retroaction
Better training data
• More data, more data, more data
• Varied situations, real case situations
• Include specific cases
• Experience is capital
• https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Improve characteristics
• Add new characteristics
• Remove the one that are less interesting
• Find the right set of characteristics
Network Configuration
• Input vector
• Intermediate neurons
• Activation function
• Output vector
0
5000
10000
15000
20000
1 2 3 4 5 6 7 8 9 10
1 layer 2 layers 3 layers 4 layers
Time of training (ms)
Change algorithm
• First add more data before changing algorithm
• Try cascade2 algorithm from FANN
• 0.6 => 0 found
• 0.5 => 2 found
• Not found by the first algorithm
Finding the BEST
• Test with 2-4 layers 10 neurons
• Measure results
0
2250
4500
6750
9000
1 2 3 4 5 6 7 8 9 10 11 12 13
1 layer 2 layers 3 layers 4 layers
DEEP LEARNING
• Chaining the neural networks
• Auto-encoders
• Unsupervised Learning
• Genetic algorithm, ant, random forest, naive Bayes
Other tools
• PHP ext/fann
• Langage R
• https://github.com/kachkaev/php-r
• Scikit-learn
• https://github.com/scikit-learn/scikit-learn
• Mahout
• https://mahout.apache.org/
Conclusion
• Machine learning is about data, not code
• There are tools to use it with PHP
• Fast to try, easy results or fast fail
• Use it for complex problems, that accepts error