Documentation

Edge-ML is an Auto-ML library based on the mathematical approch MODL. It constitues a pipe of Automated Machine Learning allowing training a classifier from numerical, categorical or sequential data (ex : web session, texts, logs…). The features of Edge-ML are divided into 3 modules :

Starter 150

I - Starter module: Auto Data Preparation

This module automatises the data preparation before using standard Machine Learning algorithms - Rondom Forest, XGboost, neuronal network...

0. View Help Page

The help page of Edge ML can displayed by typing the following command line in a terminal :
$> edgeML -?

1. Preparation of the input file

The training set is a text file which contains separated value, and the separator is a single character (ex: a csv file is suitable). Each column of this file corresponds to a numerical or categorical variable. The target is the variable to predict and must be categorical. A header is required which contains the names of the variables. Edge ML is able to determine if a variable is numerical or categorical by examining the 100 first lines. But it is recommended to indicate the nature of the variables by using the special character '#' and the keywords 'numerical' or 'categorical'. For instance, let's consider the variables 'age', 'weight' and 'gender'. The following header indicates the types of these variables:

age#numerical    weight#numerical    gender#categorical

The target variable can be identified by using the keyword 'target'. In our example, we need to process the variables 'age' and 'weight' conditionally to the target variable 'gender'. In this case, the corresponding header is:

age#numerical    weight#numerical    gender#categorical#target

Alternatively, it is possible to specify the target variable when edgeML is launch, thanks to the keyword '-target' (see Section 2 to have an example).

The keywords 'numerical', 'categorical' and 'target' can be replaced by the following shorter versions 'num', 'cat' and 'tar'. The previous header example is equivalent to:

age#num    weight#num    gender#cat#tar

The keyword 'ignore' (or 'ign') allows you to avoid variables. The flowing header avoids the variable 'weight':

age#num    weight#num#ign    gender#cat#tar

2. Univariate preprocessing: discretization / grouping

The following command line executes the binary file 'edgeML' in order to launch data preprocessing (i.e. the learning of supervised discretization and grouping models)

edgeML -prepare data.txt -sep '\t' -target 'gender' -output model.txt -nbCore 3 -verbose

  • The keyword '-prepare' indicates that Edge ML will launch preprocessing,   and this keyword must be followed by the path of the training set.
  • The keyword '-sep' is used to define the field separator which must be a single character. The characters '.' and '#' can not be used as field separator. The following special characters can be used:

    • the tabulation is coded by        \t
    • the backslash is coded by         \\
    • the simple quote is coded by    \'
    • the double quote is coded by   \"

  • The keyword '-target' is optionnal and allows you to specify the target variable if the header of the input file has not been prepared (by using #target).
  • The keyword '-output' is used to indicate the path of the output file which contains the learned discretization model.
  • The keyword '-nbCore' is optional, it allows you to define the number of exploited cores. By default, Edge ML uses a single core if this keyword is missing. The number of cores can be replaced by 'max' which set off the dynamic mode of openMp (-nbCore max). In this case, all the cores of your computer will be exploited.
  • The keyword 'verbose' is optional, it set off the verbose mode of Edge ML.
  • The keyword 'sampleSize' is optional, it allows you to specify the size of the train set by using the reservoir sampling algorithm. This keyword must be followed by a value written with the scientific notation (ex: -sampleSize 10e+6).
  • The keyword 'overfit' is optional, it allows you to train over-fitted models which include more intervals and groups than the optimal models. This keyword must be followed by the over-fitting rate which is related to the decreasing of the compression gains. For instance, '-overfit 0.1' reaches to models with 10% lower compression gains. In some cases, the over-fitting option may be a good way to train accurate classifiers on recoded data.

3. Recoding the data by using dicretization and grouping models

The following command line executes the binary file 'edgeML' in order to recode a data file by using a previously learned discretization model:

edgeML -recode data2.txt -sep ',' -with model.txt -using proba -inplaceMode -output recodedData.txt

  • The keyword '-recode' indicates that Edge ML will recode a dataset by using a model file. In this case, some variables are added into the dataset. This keyword must be followed by the path of the dataset to be recoded.
  • As previously, the keyword '-sep' is used to define the field separator.
    The same special characters are supported (i.e. \t, \\, \', \").
  • The keyword '-with' indicates the path of the model file which is used to recode the dataset.
  • The keyword '-using' specifies the way of recoding the dataset which can be:

    • 'proba' to encode the conditional probabilities P(class|variable)
    • 'QI' to encode the quantity of information -log(P(class|variable))
    • 'partition' to encode the group IDs and the interval IDs

  • The keyword '-inplaceMode' is optionnal and allows you to remove the original variables from the output file. In all cases, the order of  the rows does not change between the input and output files.
  • The keyword '-output' is used to indicate the path of the output file which contains the recoded dataset.

4. Display useless variables from a model

The following command line displays the name of useless variables, i.e. which discretization/grouping model includes a single intervals/groups. This 'filter' variable selection algorithm reliably detects the non-informative variables, and can be used whatever the learned classifier (with regular ML approches).

edgeML -display-useless-var-from model.txt

The keyword '-lessVar' is an option which displays the name of variables with a nonzero weight within the learned ensemble classifier (see Section 8):

edgeML -display-useless-var-from model.txt -lessVar

5. Display usefull variables from a model

The following command line displays the name of usefull variables, i.e. which discretization/grouping model includes several intervals/groups:   

edgeML -display-usefull-var-from model.txt

The keyword '-lessVar' is an option which displays the name of variables with a nonzero weight within the learned ensemble classifier (see Section 8):

edgeML -display-usefull-var-from model.txt -lessVar

6. Split a dataset into a train-set and a test-set

The following command line split a dataset into a train-set and a test-set:

edgeML -split-train-test data.txt -rate 0.3

  • The keyword '-split-train-test' is followed by the path of the input data file.
  • The keyword '-rate' indicates the used sampling rate, given that, the test-set is always smaller than the train-set.

 => Two files are generated with the suffixes 'TRAIN_' and 'TEST_'.

7. Clean the target row: keep lines with a valid label

If the input data file is corrupted (ex: with incomplet lines or missing separators), the following command line cleans the target row while keeping only the lines with a valid label:

edgeML -keep-true-labels A#B#C -on data.txt -sep ','

  • The keyword '-keep-true-labels' is followed by the list of the valid labels separated by '#'.
  • The keyword '-on' indicates the path of the dataset to be cleaned, which includes a target variable tagged with the keyword '#target' within the header of the file.
  • As previously, the keyword '-sep' is used to define the field separator.
    The same special characters are supported (i.e. \t, \\, \', \").

 => A new file is generated with the suffixe 'TARGET_CLEANED_'.

Medium 150

II - Medium module: Auto Modeling

This module automatises the next steps of the pipe of automated Machine Learning, in order to learn a classifier. An efficient variable selection mechanism allows one to treat large datasets with a great number of variables.

8. Training of a Ensemble Naive Bayes classifier


The following command line executes the binary file 'edgeML' in order to learn a Bayesian ensemble classifier from data:

edgeML -learn-from data.txt -sep '\t' -target 'gender' -output model.txt

  • This step is similar to Section 2, the keyword '-learn-from' indicates that Edge ML will learn an ensemble classifier, in addition to the discretization and  
     the grouping models. This keyword must be followed by the path of the training set.
  • As previously, the keyword '-sep' is used to define the field separator.
     The same special characters are supported (i.e. \t, \\, \', \").
  • The keyword '-target' is optionnal and allows you to specify the target variable if the header of the input file has not been prepared (by using #target - see Section 1).
  • The keyword '-output' is used to indicate the path of the output file which contains the learned ensemble classifier. Bu default, the classifier is evaluated
     on the train set (see the end of the output file).
  • The keyword '-lessVariables' (or '-lessVar') enable a post-optimization algorithm which aims at improving the classifier by removing redundant and non-informative variables.
     (this option increases the computing time).
  • Similarly to Section 2, the keywords '-nbCore', 'verbose', 'sampleSize',
     and 'overfit' can be used.
  • The keyword '-selectVarForm' is optional, and must be followed by the path of an already learned model. This option is particulary usefull when large datasets are processed, and allows you to detect uninformative variables from an model which is learned on a smaller sample of data. For instance:   

edgeML -learn-from data.txt -sep '\t' -output model.txt -selectVarForm smallModel.txt

9. Evaluation on a test set

The previously obtained classifier can be evaluated by using the following command line:

edgeML -evaluate model.txt -on data_test.txt -sep '\t' -output evaluationReport.txt

  • The keyword '-evaluate' indicates that Edge ML will evaluate a previously  learned classifier. This keyword is followed by the path of the file which describes the ensemble classifier.
  • The keyword '-on' indicates the path of the test set.
  • As previously, the keyword '-sep' is used to define the field separator.
    The same special characters are supported (i.e. \t, \\, \', \").
  • The keyword '-output' is used to indicate the path of the output file which contains the evaluation report of the classifier.

10. Prediction of the class values on a new dataset

The following command line executes the binary file 'edgeML' in order to deploy a classifier on a dataset, in order to predict the class values:

edgeML -deploy model.txt -on data2.txt -sep ',' -inplaceMode -output data_with_predictions.txt

  • The keyword '-deploy' indicates that Edge ML will apply a classifier on a dataset. In this case, some variables are added into the dataset. This keyword must be followed by the path of the file which describes the classifier.
  • The keyword '-on' indicates the path of the dataset to be competed by the predictions.
  • As previously, the keyword '-sep' is used to define the field separator. The same special characters are supported (i.e. \t, \\, \', \").
  • The keyword '-inplaceMode' is optionnal and allows you to remove the original variables from the output file. In all cases, the order of the rows does not change between the input and output files.
  • The keyword '-output' is used to indicate the path of the output file which contains the new dataset provided by the prediction of the class values.

11. Evaluate drift between train and test sets

The following command line evaluates the drift (i.e. changes of distribution) between the train and test sets, for each variable:

edgeML -evaluate-drift-between -trainSet train.csv -testSet test.csv -sep ',' -output driftReport.txt

  • The keyword '-trainSet' is followed by the path of the train set
  • The keyword '-testSet' is followed by the path of the test set
  • The same separator is requiered for both files. As previously, the keyword '-sep' is used to define the field separator. The same special characters are supported (i.e. \t, \\, \', \").
  • The keyword '-output' is used to indicate the path of the output file which contains the drift level for each variable shared by both files.
  • As options, the keywords '-nbCore', 'verbose', 'sampleSize' can be used.

Premium 150

III - Premium module: Auto Feature Engineering

This module allows one to treat sequential variables (ex: texts, web sessions, logs…). Very robust sequential rules are extracted from the data, in order to caracterise the classe values to be predicted.

12. Preparation of the input file

Similarly to Section 1, the input file which contains the dataset need to be prepared. First of all, 3 kinds of vector variable can be processed:

  • The 'sequences'
  • The 'lists'
  • The 'sets'

The implemented learning algorithm is able to extract sub-sequences, sub-lists and sub-sets in order to reliably describe the distribution of the classe values (i.e. the target variable).

  • A sub-sequence consists of ordered symbols which are not necessary contiguous
  • A sub-list consists of ordered symbols which are necessary contiguous
  • A sub-set consit of unordered symbols which are present at least once

Level 1: Auto-detection

Edge ML is able to detect the vector variables by examining the 100 first lines. The vector variables are detected owing to the following special chracters within the dataset:
 
<..>   for the sequences
[..]     for the lists
{..}    for the sets

Here is an example of input dataset:

var1,var2,var3,var4,var5
1,a,<a,c,b,d,c>,[A,B],{a,b}
2,b,<b,e,c>,[B,C,F],{e,h,e}
...

In this cas, the variable 'var3' is automatically detected as a sequence due the special characters <..>
within the rows. Simillary, the variable 'var4' is detected as a list and the variable 'var' is detected as a set.

Level 2: Header

Alternatively, it is possible to specify the type of vector variables by using special keywords within
the header(#seq, #list and #set). For instance, the header of the following dataset exploits these keywords:

var1,var2,var3#seq,var4#list,var5#set
1,a,<a,c,b,d,c>,[A,B],{a,b}
2,b,<b,e,c>,[B,C,F],{e,h,e}
...

In order to easily change the type of a vector variable, the priority is given to the header.
For instance, in the following dataset the variable 'var4' is processed as a sequence due to the keyword '#seq' enven if the rows contain the special caracters [..].

var1, var2, var3#seq, var4#seq, var5#set
1,a,<a,c,b,d,c>,[A,B],{a,b}
2,b,<b,e,c>,[B,C,F],{e,h,e}

13. Mining of sequential rules as preprossessing

Similarly to Section 2, the vector variables can be pre-processed in the same time as the numerical and categorical variables. At this step, the learning algorithm extracts relevant sequential rules in order to charcterize the distribution of the classe values. The command line remains the same and uses several additional keywords:

edgeML -prepare data.txt -sep '\t' -output model.txt -durationOfRuleMining 20 -nbRules 100

  • The keyword '-durationOfRuleMining' indicates the duration (in minutes) of the Rule Mining step which is based on an  anytime algorithm. If this keyword is missing, a suitable default value is computed from the dataset.
  • The keyword '-nbRule' indicates the maximum number of extracted rules. If this keyword is missing, a suitable default value is computed from the dataset.
  • The keyword '-lessRule' is optionnal and allows you to keep only independent rules. The complexity of this selection algorithm is O(nbRule^2) and may increass the computing time.
  • The rule-mining algorithm exploits an index loaded in memory which is used to accelerate the processing of data. In certain cases, particulary when the number of distinc symbols is very important (ex: textual data), this index may grow excessively. The keyword '-indexMemoryLimit' allows to set the max size of this index in Go (with a default value set at 5Go). The index is errased if it exceeds the limit and the algorithm run slower.   

14. Training of a Ensemble classifier which exploits the sequential-rules

Similarly to Section 8, a Bayesian ensemble classifier can be automatically learned from data which contains vector variable:

edgeML -learn-from data.txt -sep '\t' -output model.txt -durationOfRuleMining 20 -nbRules 100

  • The command remains the same, all the keywords described in Section 8 can be used.
  • In addition the keywords presented in the previous Section 13 can be also used
      ('-durationOfRuleMining', '-nbRules', '-lessRule', '-indexMemoryLimit')

15. Recoding the data by using the sequential rules

Similarly to Section 3, a dataset can be recoded by using the learned model. The coresponding command line remains the same, all keywords presented in Section 3 can be used:

edgeML -recode data2.txt -sep ',' -with model.txt -using proba -inplaceMode -output recodedData.txt

In the particular case of the vector variables, data is recoded as binary variables which correspond to the learned sequencial-rules. Given a particular column (i.e. a rule) and a particular row, the recoded value is '1' if the current observation matches the current rule, and '0' otherwise.

16. Evaluation and Predictions on new data

The models which are learned from vector variables can evaluated on a test set (in the same way as explained in Section 9) and can be exploited to make predictions on new data (in the same way as Section 10). In both cases, the command lines remains exactly the same.

Icone Doc LDC

Command Line exemples

Edge ML is installed as a command line tool, thus it can be integrated within any production environments and any programming languages. The syntax is very compact, your projects will be carried out by writing very few lines.

Learn a multi-class classifier

Edge ML is able to automaticaly learn multi-class classifiers. The proposed approach is based on a Bayesian formalism and consists of an ensemble method. Each SNB (Selective Naïve Bayes) classifier exploits the univariate MODL discretization and the grouping methods. The outcome classifers are robust (no over-fitting) and fully automatically built (no data cleaning, no parameter to be tuned, no cross validation). Here is a script example which learn a classifier (for more details see Sections 6, 8, 9 of this help page).

$> wget edge-ml.fr/download/data/adult.txt
$> edgeML -split-train-test adult.txt -rate 0.3
$> edgeML -learn-from TRAIN_adult.txt -sep '\t' -output model.txt
$> edgeML -evaluate model.txt -on TEST_adult.txt -sep '\t' -output evaluationReport.txt

Recode a dataset

Edge ML can be used to recode a dataset in order to learn another classifier (ex: random forest, GBM ... etc.). There are two advantages of recoding data: i) a categorical variable with a large number of levels can be recoded into numerical variables which estimate the distribution of the class values; ii) the discretization and grouping models are regularized and reduce the risk of over-fitting, regardless of the trained classifier. Here is a script example which learns discretization and grouping models, then a dataset is recoded  (for more details see Sections 1 and 3 of this help page).

$> edgeML -prepare TRAIN_adult.txt -sep '\t' -output model.txt
$> edgeML -recode TEST_adult.txt -sep '\t' -with model.txt -using proba -output recodedData.txt

Uninformative variables detection

The discretization and grouping models which include a single interval / group correspond to variables which must be removed from the dataset. These variables are absolutly uninformative, with uniformly distributed class values over the numerical domains. Edge ML provides a reliable way of removing uninformative variables from a dataset, even if you train another classifier (ex: random forest, GBM ... etc.). The useless variables can be displayed as follows (for more details see Section 4 and 5 of this help page) :

$> edgeML -display-useless-var-from model.txt

Drift evaluation

In practice, a concept drift may occur between the train set and data on which the model is deployed. In other words, the distributions of explicative variables may change between both datasets, which represents a risk of wrong predictions. Just remove variables with an important drift level in oder to learn a robust classifier. Edge ML provides a reliable estimate of the drift levels (for more details see Section 11 of this help page) :

$> edgeML -evaluate-drift-between-trainSet train.csv <strong">-testSet test.csv -sep ',' -output driftReport.txt

Icone Doc Python

WRapper python

Edge ML can be used within Jupyter Notebook and any Python script thanks to the provided wrapper. The syntax is very simple and similar to that of scikit-learn. Here is an overview of the wrapper.

Automated Machine Learning

# Fit a classifier without any parameter:
classifier.fit(X_train, y_train)

# Score the learned classifier on the Test set:
classifier.score(X_test,y_test)

Data recoding

# Recode new data by using learned preprocessing models (i.e. distretizations and grouping models):
classifier.transform(X_test)

# Learn and recode at the same time:
classifier.fit_transform(X_train, y_train)

Uninformative variables detection

# Detect uninformative variables:
removeList = classifier.uselessVar()

Drift evaluation

# Evaluate univariate drift between Train and Test sets:
classifier.driftEval(X_train,X_test)

For more details see the turorials   tuto 1 and tuto 2