machine learning in php

Post on 26-Jan-2017

454 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MACHINE LEARNING IN PHPThe roots of education are bitter, but the fruit is sweet

Verona, Italia, 2016

AGENDA

How to teach tricks to your PHP

Application : searching for code in comments

Complex learning

SPEAKER

Damien Seguy

Exakat CTO

Static analysis of PHP code

MACHINE LEARNING

Teaching the machine

Supervised learning : learning then applying

Application build its own model : training phase

It applies its model to real cases : applying phase

APPLICATIONS

Play go, chess, tic-tac-toe and beat everyone else

Fraud detection and risk analysis

Automated translation or automated transcription

OCR and face recognition

Medical diagnostics

Walk, welcome guest at hotels, play football

Finding good PHP code

PHP APPLICATIONS

Recommendations systems

Predicting user behavior

SPAM

conversion user to customer

ETA

Detect code in comments

REAL USE CASE

Identify code in comments

Classic problem

Good problem for machine learning

Complex, no simple solution

A lot of data and expertise are available

SUPERVISED TRAINING

Historydata Training

ModelReal data Results

THE FANN EXTENSION

ext/fann (https://pecl.php.net/package/fann)

Fast Artificial Neural Network

http://leenissen.dk/fann/wp/

Neural networks in PHP

Works on PHP 7, thanks to the hard work of Jakub Zelenka

https://github.com/bukka/php-fann

NEURAL NETWORKS

Imitation of nature

Input layer

Output layer

Intermediate layers

NEURAL NETWORK

Imitation of nature

Input layer

Output layer

Intermediate layers

INITIALIZATION<?php

$num_layers  = 1; $num_input  = 5; $num_neurons_hidden = 3; $num_output  = 1; $ann = fann_create_standard($num_layers, $num_input,  $num_neurons_hidden, $num_output);

// Activation function fann_set_activation_function_hidden($ann, 

FANN_SIGMOID_SYMMETRIC); fann_set_activation_function_output($ann,  FANN_SIGMOID_SYMMETRIC);

PREPARING DATA

Raw data Extract Filter Human review Fann ready

EXPERT AT WORK// Test if the if is in a compressed format

// none need yet

// icon

// There is a parser specified in `Parser::$KEYWORD_PARSERS`

// $result should exist, regardless of $_message

// $a && $b and multidimensional

// numGlyphs + 1

// TODO : fix this; var_dump($var);

// if(ob_get_clean()){

//$annots .= ' /StructParent ';

// $cfg['Servers'][$i]['controlpass'] = 'pmapass';

INPUT VECTOR

'length' : size of the comment

'countDollar' : number of $

'countEqual' : number of =

'countObjectOperator' number of -> operator ($o->p)

'countSemicolon' : number of semi-colon ;

INPUT DATA

46 5 1 825 0 0 0 1 0 37 2 0 0 0 0 55 2 2 0 1 1 61 2 1 3 1 1 ...

 * This file is part of Exakat.  *  * Exakat is free software: you can redistribute it and/or modify  * it under the terms of the GNU Affero General Public License as published by  * the Free Software Foundation, either version 3 of the License, or  * (at your option) any later version.  *  * Exakat is distributed in the hope that it will be useful,  * but WITHOUT ANY WARRANTY; without even the implied warranty of  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the  * GNU Affero General Public License for more details.  *  * You should have received a copy of the GNU Affero General Public License  * along with Exakat.  If not, see <http://www.gnu.org/licenses/>.  *  * The latest code can be found at <http://exakat.io/>.  * */

// $x[3] or $x[] and multidimensional

//if ($round == 3) { die('Round '.$round);}

//$this->errors[] = $this->language->get('error_permission');

Number of input Number of incoming data Number of outgoing data

TRAINING$max_epochs  = 500000; $desired_error  = 0.001;

// the actual trainingif (fann_train_on_file($ann,  'incoming.data',  $max_epochs,  $epochs_between_reports,  $desired_error)) {        fann_save($ann, 'model.out'); }fann_destroy($ann); ?>

TRAINING

47 cases

5 characteristics

3 hidden neurons

+ 5 input + 1 output

Duration : 5.711 s

APPLICATION

Historydata Training

ModelReal data Results

APPLICATION<?php 

$ann = fann_create_from_file('model.out'); 

$comment = '//$gvars = $this->getGraphicVars();';

$input = makeVector($comment); $results = fann_run($ann, $input); 

if ($results[0] > 0.8) {       print "\"$comment\" -> $results[0] \n";  } 

?>

RESULTS > 0.8

Answer between 0 and 1

Values ranges from -14 to 0,999

The closer to 1, the safer. The closer to 0, the safer.

Is this a percentage? Is this a carrots count ?

It's a mix of counts…

-16

-12

-8

-4

0

60.000000

70.000000

80.000000

90.000000

100.000000

REAL CASES

Tested on 14093 comments

Duration 367.01ms

Found 1960 issues (14%)

0.99999893 // $cfg['Servers'][$i]['controlhost'] = '';    

0.99999928 //$_SESSION['Import_message'] = $message->getDisplay();    

/* 0.99999928 if (defined('SESSIONUPLOAD')) {     // write sessionupload back into the loaded PMA session

    $sessionupload = unserialize(SESSIONUPLOAD);     foreach ($sessionupload as $key => $value) {         $_SESSION[$key] = $value;     }

    // remove session upload data that are not set anymore     foreach ($_SESSION as $key => $value) {         if (mb_substr($key, 0, mb_strlen(UPLOAD_PREFIX))             == UPLOAD_PREFIX             && ! isset($sessionupload[$key])         ) {             unset($_SESSION[$key]);         }     } }

0.98780382 //LEAD_OFFSET = (0xD800 - (0x10000 >> 10)) = 55232    

0.99361396 // We have server(s) => apply default configuration      0.98383027 // Duration = as configured    

0.99999928 // original -> translation mapping    

0.97590065 // = (   59 x 84   ) mm  = (  2.32 x 3.31  ) in    

True positive False positive

True negative False negative

Found by FANN

Target

True positive

False positive

True negative

False negative

Found by FANN

Target

// $cfg['Servers'][$i]['table_coords'] = 'pma__table_coords';    

//(isset($attribs['height'])?$attribs['height']: 1);    

// if ($key != null) did not work for index "0"    

// the PASSWORD() function    

0.99999923

0.73295981

0.99999851

0.2104115

RESULTS

1960 issues

50+% of false positive

With an easy clean, 822 issues reported

14k comments, analyzed in 367 ms

Total time of coding : 27 mins.

// = (   59 x 84   ) mm  = (  2.32 x 3.31  ) in     /* vim: set expandtab sw=4 ts=4 sts=4: */

LEARN BETTER, NOT HARDER

Better training data

Improve characteristics

Configure the neural network

Change algorithm

Automate learning

Update constantly

Real data

Historydata

Training

Model Results

Retroaction

BETTER TRAINING DATA

More data, more data, more data

Varied situations, real case situations

Include specific cases

Experience is capital

https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

IMPROVE CHARACTERISTICS

Add new characteristics

Remove the one that are less interesting

Find the right set of characteristics

NETWORK CONFIGURATION

Input vector

Intermediate neurons

Activation function

Output vector

0

5000

10000

15000

20000

1 2 3 4 5 6 7 8 9 10

1 layer 2 layers 3 layers 4 layers

Time of training (ms)

CHANGE ALGORITHM

First add more data before changing algorithm

Try cascade2 algorithm from FANN

0.6 => 0 found

0.5 => 2 found

Not found by the first algorithm

FINDING THE BEST

Test with 2-4 layers10 neurons

Measure results

0

2250

4500

6750

9000

1 2 3 4 5 6 7 8 9 10 11 12 13

1 layer 2 layers 3 layers 4 layers

DEEP LEARNING

Chaining the neural networks

Auto-encoders

Unsupervised Learning

Genetic algorithm, ant

OTHER TOOLS

PHP ext/fann

Langage R

https://github.com/kachkaev/php-r

Scikit-learn

https://github.com/scikit-learn/scikit-learn

Mahout

https://mahout.apache.org/

@exakathttps://joind.in/talk/42120

GRAZIE

AUTRES CONFIGURATIONS

Fonction d'activation

FANN_SIGMOID_SYMMETRIC

FANN_LINEAR

FANN_THRESHOLD

FANN_SIN_SYMMETRIC

Linéaire Seuil

Tangeante

Gaussienne Quadratique

Sigmoide

QUELLES APPLICATIONS?

Non-déterministe

Elimination de tout ce qui est systématique à trouver

Accès à l'expertise et aux vecteurs de caractéristiques

Couche finale après les résultats

Classification, priorisation, approximation rapide

APPRENTISSAGE PAR RENFORCEMENT

Logiciel

Monde réel

RécompenseActionRéaction

FILTRES BAYESIENS

ALGORITHMES GÉNÉTIQUES

Population

Population

Selection

Reproduction

PopulationVariations

top related