# AutoML with FLAML Library


|  | | | |
|-----|--------|--------|--------|
| <img src="https://www.microsoft.com/en-us/research/uploads/prod/2020/02/flaml-1024x406.png" alt="drawing" width="200"/> 


<style>
td, th {
   border: none!important;
}
</style>
### Goal
In this notebook, we demonstrate how to use AutoML with FLAML to find the best model for our dataset.


## 1. Introduction

FLAML is a Python library (https://github.com/microsoft/FLAML) designed to automatically produce accurate machine learning models 
with low computational cost. It is fast and economical. The simple and lightweight design makes it easy to use and extend, such as adding new learners. FLAML can 
- serve as an economical AutoML engine,
- be used as a fast hyperparameter tuning tool, or 
- be embedded in self-tuning software that requires low latency & resource in repetitive
   tuning tasks.

In this notebook, we use one real data example (binary classification) to showcase how to use FLAML library.

FLAML requires `Python>=3.8`. To run this notebook example, please install the following packages.

In [None]:
%pip install "openml==0.14.2" "scikit-learn>=1.3.0" "rgf-python==3.12.0"

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 7, Finished, Available, Finished)

Collecting openml==0.14.2
  Downloading openml-0.14.2.tar.gz (144 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.5/144.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l- \ | done
[?25h  Getting requirements to build wheel ... [?25l- done
[?25h  Installing backend dependencies ... [?25l- \ done
[?25h  Preparing metadata (pyproject.toml) ... [?25l- done
Collecting rgf-python==3.12.0
  Downloading rgf_python-3.12.0-py3-none-manylinux1_x86_64.whl (757 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m757.8/757.8 kB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
Collecting minio (from openml==0.14.2)
  Downloading minio-7.2.8-py3-none-any.whl (93 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.5/93.5 kB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
Collecting pycryptodome (from minio->openml==0.14.2)
  Downloading pycryptodome-3.20.0-cp35-abi3-manylinux_2_17_x86_64.ma

### Set the logging level

You can configure the logging level to suppress unnecessary outputs to keep the logs cleaner.

In [None]:
import logging
import warnings
 
logging.getLogger('synapse.ml').setLevel(logging.CRITICAL)
logging.getLogger('mlflow.utils').setLevel(logging.CRITICAL)
warnings.simplefilter('ignore', category=FutureWarning)
warnings.simplefilter('ignore', category=UserWarning)

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 9, Finished, Available, Finished)

### Set up MLflow experiment tracking

MLflow is an open source platform that is deeply integrated into the Data Science experience in Fabric and allows to easily track and compare the performance of different models and experiments without the need for manual tracking. For more information, see [Autologging in Microsoft Fabric](https://aka.ms/fabric-autologging).

In [None]:
import mlflow

# Set the MLflow experiment to "automl-tutorial" and enable automatic logging
mlflow.set_experiment("automl-tutorial")

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 10, Finished, Available, Finished)

<Experiment: artifact_location='', creation_time=1725256490333, experiment_id='0499e3cc-c690-4b30-86b9-55838e7df486', last_update_time=None, lifecycle_stage='active', name='automl-tutorial', tags={}>

## 2. Classification Example
### Load data and preprocess

Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

In [None]:
from flaml.automl.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir='./')

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 11, Finished, Available, Finished)

No permission to create OpenML directory at /home/trusted-service-user/.config/openml! This can result in OpenML-Python not working properly.
download dataset from openml
Dataset name: airlines
X_train.shape: (404537, 7), y_train.shape: (404537,);
X_test.shape: (134846, 7), y_test.shape: (134846,)


In [None]:
display(X_train.join(y_train))

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 12, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, e951537a-8f45-4ae7-8cd7-5835f56b7d40)

### Run FLAML
In the FLAML automl run configuration, users can specify the task type, time budget, error metric, learner list, whether to subsample, resampling strategy type, and so on. All these arguments have default values which will be used if users do not provide them. For example, the default classifiers are `['lgbm', 'xgboost', 'xgb_limitdepth', 'catboost', 'rf', 'extra_tree', 'lrl1']`. 

In [None]:
''' import AutoML class from flaml package '''
from flaml import AutoML
automl = AutoML()

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 13, Finished, Available, Finished)

In [None]:
settings = {
    "time_budget": 120,  # total running time in seconds
    "metric": 'accuracy',  # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    "task": 'classification',  # task type
    "seed": 42,    # random seed
}

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 14, Finished, Available, Finished)

In [None]:
'''The main flaml automl API'''
with mlflow.start_run(run_name="flight_delays_baseline"):
    automl.fit(X_train=X_train, y_train=y_train, **settings)

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 15, Finished, Available, Finished)

[flaml.automl.logger: 09-03 04:46:00] {1787} INFO - task = classification
[flaml.automl.logger: 09-03 04:46:00] {1798} INFO - Evaluation method: holdout
[flaml.automl.logger: 09-03 04:46:00] {1901} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 09-03 04:46:01] {2019} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd', 'catboost', 'lrl1']
[flaml.automl.logger: 09-03 04:46:01] {2329} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 09-03 04:46:02] {2464} INFO - Estimated sufficient time budget=158419s. Estimated necessary time budget=3905s.


[flaml.automl.logger: 09-03 04:46:27] {2513} INFO -  at 2.6s,	estimator lgbm's best error=0.3777,	best estimator lgbm's best error=0.3777
[flaml.automl.logger: 09-03 04:46:27] {2329} INFO - iteration 1, current learner lgbm


[flaml.automl.logger: 09-03 04:46:46] {2513} INFO -  at 27.9s,	estimator lgbm's best error=0.3777,	best estimator lgbm's best error=0.3777
[flaml.automl.logger: 09-03 04:46:46] {2329} INFO - iteration 2, current learner lgbm


[flaml.automl.logger: 09-03 04:47:07] {2513} INFO -  at 47.2s,	estimator lgbm's best error=0.3763,	best estimator lgbm's best error=0.3763
[flaml.automl.logger: 09-03 04:47:07] {2329} INFO - iteration 3, current learner lgbm


[flaml.automl.logger: 09-03 04:47:27] {2513} INFO -  at 67.6s,	estimator lgbm's best error=0.3635,	best estimator lgbm's best error=0.3635
[flaml.automl.logger: 09-03 04:47:27] {2329} INFO - iteration 4, current learner lgbm


[flaml.automl.logger: 09-03 04:47:45] {2513} INFO -  at 88.1s,	estimator lgbm's best error=0.3635,	best estimator lgbm's best error=0.3635
[flaml.automl.logger: 09-03 04:47:45] {2329} INFO - iteration 5, current learner lgbm


[flaml.automl.logger: 09-03 04:48:05] {2513} INFO -  at 106.2s,	estimator lgbm's best error=0.3611,	best estimator lgbm's best error=0.3611
[flaml.automl.logger: 09-03 04:48:06] {569} INFO - logging best model lgbm
[flaml.automl.logger: 09-03 04:48:09] {2756} INFO - retrain lgbm for 0.6s
[flaml.automl.logger: 09-03 04:48:09] {2759} INFO - retrained model: LGBMClassifier(colsample_bytree=0.8871559629536413,
               learning_rate=0.1292426830415275, max_bin=63,
               min_child_samples=14, n_estimators=1, n_jobs=-1, num_leaves=4,
               reg_alpha=0.02960826033957992, reg_lambda=0.023368135622249268,
               verbose=-1)
[flaml.automl.logger: 09-03 04:48:09] {2760} INFO - Auto Feature Engineering pipeline: None
[flaml.automl.logger: 09-03 04:48:09] {2762} INFO - Best MLflow run name: 
[flaml.automl.logger: 09-03 04:48:09] {2763} INFO - Best MLflow run id: 4d4f70aa-e916-4ed9-8209-b99191c4d5e1
[flaml.automl.logger: 09-03 04:48:27] {2055} INFO - fit succeeded
[fl

### Best model and metric

In [None]:
'''retrieve best config and best learner'''
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 16, Finished, Available, Finished)

Best ML leaner: lgbm
Best hyperparmeter config: {'n_estimators': 66, 'num_leaves': 4, 'min_child_samples': 14, 'learning_rate': 0.1292426830415275, 'log_max_bin': 6, 'colsample_bytree': 0.8871559629536413, 'reg_alpha': 0.02960826033957992, 'reg_lambda': 0.023368135622249268}
Best accuracy on validation data: 0.6389
Training duration of best run: 0.5744 s


## 3. Model saving and prediction

### Save model


In [None]:
model_path = f"runs:/{automl.best_run_id}/model"

# Register the model to the MLflow registry
registered_model = mlflow.register_model(model_uri=model_path, name="flight_delays_baseline")

# Print the registered model's name and version
print(f"Model '{registered_model.name}' version {registered_model.version} registered successfully.")

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 17, Finished, Available, Finished)

Registered model 'flight_delays_baseline' already exists. Creating a new version of this model...
2024/09/03 04:48:33 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: flight_delays_baseline, version 4
Created version '4' of model 'flight_delays_baseline'.


Model 'flight_delays_baseline' version 4 registered successfully.


### Predict with saved model

In [None]:
loaded_model = mlflow.sklearn.load_model(f"models:/{registered_model.name}/{registered_model.version}")

y_pred = loaded_model.predict(X_test)
print('Predicted labels', y_pred)
print('True labels', y_test)
y_pred_proba = automl.predict_proba(X_test)[:,1]

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 18, Finished, Available, Finished)

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Predicted labels [1 0 1 ... 1 0 0]
True labels 118331    0
328182    0
335454    0
520591    1
344651    0
         ..
367080    0
203510    1
254894    0
296512    1
362444    0
Name: Delay, Length: 134846, dtype: category
Categories (2, object): ['0' < '1']


StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 38, Finished, Available, Finished)

In [None]:
''' compute different metric values on testing dataset'''
from flaml.ml import sklearn_metric_loss_score
print('accuracy', '=', 1 - sklearn_metric_loss_score('accuracy', y_pred, y_test.astype(float)))
print('roc_auc', '=', 1 - sklearn_metric_loss_score('roc_auc', y_pred_proba, y_test.astype(float)))
print('log_loss', '=', sklearn_metric_loss_score('log_loss', y_pred_proba, y_test.astype(float)))

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 19, Finished, Available, Finished)

accuracy = 0.6425997063316672
roc_auc = 0.6863937336290802
log_loss = 0.6294673392946836


## 4. Customized Learner

Some experienced automl users may have a preferred model to tune or may already have a reasonably by-hand-tuned model before launching the automl experiment. They need to select optimal configurations for the customized model mixed with standard built-in learners. 

FLAML can easily incorporate customized/new learners (preferably with sklearn API) provided by users in a real-time manner, as demonstrated below.

### Example of Regularized Greedy Forest

[Regularized Greedy Forest](https://arxiv.org/abs/1109.0887) (RGF) is a machine learning method currently not included in FLAML. The RGF has many tuning parameters, the most critical of which are: `[max_leaf, n_iter, n_tree_search, opt_interval, min_samples_leaf]`. To run a customized/new learner, the user needs to provide the following information:
* an implementation of the customized/new learner
* a list of hyperparameter names and types
* rough ranges of hyperparameters (i.e., upper/lower bounds)
* choose initial value corresponding to low cost for cost-related hyperparameters (e.g., initial value for max_leaf and n_iter should be small)

In this example, the above information for RGF is wrapped in a python class called *MyRegularizedGreedyForest* that exposes the hyperparameters.

In [None]:
''' SKLearnEstimator is the super class for a sklearn learner '''
from flaml.automl.model import SKLearnEstimator
from flaml import tune
from flaml.automl.task.task import CLASSIFICATION


class MyRegularizedGreedyForest(SKLearnEstimator):
    def __init__(self, task='binary', **config):
        '''Constructor
        
        Args:
            task: A string of the task type, one of
                'binary', 'multiclass', 'regression'
            config: A dictionary containing the hyperparameter names
                and 'n_jobs' as keys. n_jobs is the number of parallel threads.
        '''

        super().__init__(task, **config)

        '''task=binary or multi for classification task'''
        if task in CLASSIFICATION:
            from rgf.sklearn import RGFClassifier

            self.estimator_class = RGFClassifier
        else:
            from rgf.sklearn import RGFRegressor
            
            self.estimator_class = RGFRegressor

    @classmethod
    def search_space(cls, data_size, task):
        '''[required method] search space

        Returns:
            A dictionary of the search space. 
            Each key is the name of a hyperparameter, and value is a dict with
                its domain (required) and low_cost_init_value, init_value,
                cat_hp_cost (if applicable).
                e.g.,
                {'domain': tune.randint(lower=1, upper=10), 'init_value': 1}.
        '''
        space = {        
            'max_leaf': {'domain': tune.lograndint(lower=4, upper=data_size[0]), 'init_value': 4, 'low_cost_init_value': 4},
            'n_iter': {'domain': tune.lograndint(lower=1, upper=data_size[0]), 'init_value': 1, 'low_cost_init_value': 1},
            'n_tree_search': {'domain': tune.lograndint(lower=1, upper=32768), 'init_value': 1, 'low_cost_init_value': 1},
            'opt_interval': {'domain': tune.lograndint(lower=1, upper=10000), 'init_value': 100},
            'learning_rate': {'domain': tune.loguniform(lower=0.01, upper=20.0)},
            'min_samples_leaf': {'domain': tune.lograndint(lower=1, upper=20), 'init_value': 20},
        }
        return space

    @classmethod
    def size(cls, config):
        '''[optional method] memory size of the estimator in bytes
        
        Args:
            config - the dict of the hyperparameter config

        Returns:
            A float of the memory size required by the estimator to train the
            given config
        '''
        max_leaves = int(round(config['max_leaf']))
        n_estimators = int(round(config['n_iter']))
        return (max_leaves * 3 + (max_leaves - 1) * 4 + 1.0) * n_estimators * 8

    @classmethod
    def cost_relative2lgbm(cls):
        '''[optional method] relative cost compared to lightgbm
        '''
        return 1.0


StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 20, Finished, Available, Finished)

## 5. Customized Metric

It's also easy to customize the optimization metric. As an example, we demonstrate with a custom metric function which combines training loss and validation loss as the final loss to minimize.

In [None]:
def custom_metric(X_val, y_val, estimator, labels, X_train, y_train,
                  weight_val=None, weight_train=None, config=None,
                  groups_val=None, groups_train=None):
    from sklearn.metrics import log_loss
    import time
    start = time.time()
    y_pred = estimator.predict_proba(X_val)
    pred_time = (time.time() - start) / len(X_val)
    val_loss = log_loss(y_val, y_pred, labels=labels,
                         sample_weight=weight_val)
    y_pred = estimator.predict_proba(X_train)
    train_loss = log_loss(y_train, y_pred, labels=labels,
                          sample_weight=weight_train)
    alpha = 0.5
    return val_loss * (1 + alpha) - alpha * train_loss, {
        "val_loss": val_loss, "train_loss": train_loss, "pred_time": pred_time
    }
    # two elements are returned:
    # the first element is the metric to minimize as a float number,
    # the second element is a dictionary of the metrics to log

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 21, Finished, Available, Finished)

### Add Customized Learner and and Metric

After adding RGF into the list of learners, we run automl by tuning hyperpameters of RGF as well as the default learners. 

In [None]:
automl = AutoML()
automl.add_learner(learner_name='RGF', learner_class=MyRegularizedGreedyForest)

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 22, Finished, Available, Finished)

In [None]:
settings = {
    "time_budget": 120,  # total running time in seconds
    "metric": custom_metric,  # pass the custom metric funtion here
    "estimator_list": ['RGF', 'lgbm', 'rf', 'xgboost'],  # list of ML learners
    "task": 'classification',  # task type
    "seed": 42,    # random seed
}

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 23, Finished, Available, Finished)

We can then pass this custom learner and metric function to automl's `fit` method.

In [None]:
with mlflow.start_run(run_name="flight_delays_rgf_metric"):
    automl.fit(X_train=X_train, y_train=y_train, **settings)

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 24, Finished, Available, Finished)

[flaml.automl.logger: 09-03 04:48:43] {1787} INFO - task = classification
[flaml.automl.logger: 09-03 04:48:43] {1798} INFO - Evaluation method: holdout
[flaml.automl.logger: 09-03 04:48:43] {1901} INFO - Minimizing error metric: customized metric
[flaml.automl.logger: 09-03 04:48:43] {2019} INFO - List of ML learners in AutoML Run: ['RGF', 'lgbm', 'rf', 'xgboost']
[flaml.automl.logger: 09-03 04:48:43] {2329} INFO - iteration 0, current learner RGF
[flaml.automl.logger: 09-03 04:48:44] {2464} INFO - Estimated sufficient time budget=329015s. Estimated necessary time budget=329s.


[flaml.automl.logger: 09-03 04:49:02] {2513} INFO -  at 1.8s,	estimator RGF's best error=0.6624,	best estimator RGF's best error=0.6624
[flaml.automl.logger: 09-03 04:49:02] {2329} INFO - iteration 1, current learner RGF


[flaml.automl.logger: 09-03 04:49:20] {2513} INFO -  at 19.8s,	estimator RGF's best error=0.6624,	best estimator RGF's best error=0.6624
[flaml.automl.logger: 09-03 04:49:20] {2329} INFO - iteration 2, current learner RGF


[flaml.automl.logger: 09-03 04:49:39] {2513} INFO -  at 38.4s,	estimator RGF's best error=0.6580,	best estimator RGF's best error=0.6580
[flaml.automl.logger: 09-03 04:49:39] {2329} INFO - iteration 3, current learner RGF


[flaml.automl.logger: 09-03 04:49:57] {2513} INFO -  at 57.1s,	estimator RGF's best error=0.6564,	best estimator RGF's best error=0.6564
[flaml.automl.logger: 09-03 04:49:57] {2329} INFO - iteration 4, current learner RGF


[flaml.automl.logger: 09-03 04:50:15] {2513} INFO -  at 75.1s,	estimator RGF's best error=0.6564,	best estimator RGF's best error=0.6564
[flaml.automl.logger: 09-03 04:50:15] {2329} INFO - iteration 5, current learner RGF


[flaml.automl.logger: 09-03 04:50:32] {2513} INFO -  at 92.9s,	estimator RGF's best error=0.6564,	best estimator RGF's best error=0.6564
[flaml.automl.logger: 09-03 04:50:32] {2329} INFO - iteration 6, current learner RGF


[flaml.automl.logger: 09-03 04:50:49] {2513} INFO -  at 110.6s,	estimator RGF's best error=0.6400,	best estimator RGF's best error=0.6400
[flaml.automl.logger: 09-03 04:50:50] {569} INFO - logging best model RGF
[flaml.automl.logger: 09-03 04:51:20] {2756} INFO - retrain RGF for 27.4s
[flaml.automl.logger: 09-03 04:51:20] {2759} INFO - retrained model: RGFClassifier(learning_rate=0.38416082968818005, max_leaf=37,
              min_samples_leaf=19, n_iter=222, n_tree_search=4,
              opt_interval=167)
[flaml.automl.logger: 09-03 04:51:20] {2760} INFO - Auto Feature Engineering pipeline: None
[flaml.automl.logger: 09-03 04:51:20] {2762} INFO - Best MLflow run name: 
[flaml.automl.logger: 09-03 04:51:20] {2763} INFO - Best MLflow run id: 4e7d64c8-227b-4596-a717-a160161778d3
[flaml.automl.logger: 09-03 04:51:35] {2055} INFO - fit succeeded
[flaml.automl.logger: 09-03 04:51:35] {2056} INFO - Time taken to find the best model: 110.61953806877136


In [None]:
'''retrieve best config and best learner'''
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 25, Finished, Available, Finished)

Best ML leaner: RGF
Best hyperparmeter config: {'max_leaf': 37, 'n_iter': 222, 'n_tree_search': 4, 'opt_interval': 167, 'min_samples_leaf': 19, 'learning_rate': 0.38416082968818005}
Best accuracy on validation data: 0.36
Training duration of best run: 27.35 s


## 6. Auto Featurization

Next, we introduce the latest `featurization` module, which could automatically search for a feature engineering pipeline along with AutoML process.

This module leverages HPO algorithms to intelligently select, transform, and construct features from raw data, enhancing the model's predictive power. 

The module's integration with AutoML allows for a seamless, automated process where both feature engineering and model selection are jointly optimized.

Just set the `featurization` parameter to `auto` could let you experience this module. Set to `force` to let FLAML choose a method for each stage. Set to `off` to disable the module. On Fabric, it's set to `auto` by default.



Currently avaliable feature engineering methods:
1. Stage `categorical`: Methods to encode categorical features. Available:
  - `ordinal`: Ordinal encoding for categorical features.
  

2. Stage `numerical`: Methods to transform numerical features. Available:
  - `null`: No transformation applied to numerical features.
  - `scaler_standard`: Standard scaling for numerical features, normalizing them to have zero mean and unit variance.
  - `scaler_minmax`: Min-Max scaling, transforming features by scaling each feature to a given range, typically [0, 1].
  - `scaler_maxabs`: MaxAbs scaling, scales each feature by its maximum absolute value. This is meant for data that is already centered at zero or sparse data.
  - `scaler_robust`: Robust scaling using statistics that are robust to outliers, particularly useful when dealing with features that contain many outliers.
  - `normalizer_sparse`: Normalization applied to sparse input, making each feature vector have unit norm.
    
  
3. Stage `selection`: Methods for feature selection. Available:
  - `null`: No feature selection is applied.
  - `cardinality`: Selecting features based on their cardinality.
  - `variance`: Selecting features based on variance threshold.
  

4. Stage `extraction`: Feature extraction methods, applicable based on task type. Available:
  - `null`: No feature extraction is applied.
  - `PCA`: Principal Component Analysis.
  - `LDA`: Linear Discriminant Analysis(For classification tasks only).

In [None]:
automl = AutoML()
automl.add_learner(learner_name='RGF', learner_class=MyRegularizedGreedyForest)

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 26, Finished, Available, Finished)

In [None]:
settings = {
    "time_budget": 120,  # total running time in seconds
    "metric": custom_metric,  # pass the custom metric funtion here
    "estimator_list": ['RGF', 'lgbm', 'rf', 'xgboost'],  # list of ML learners
    "task": 'classification',  # task type
    "seed": 42,    # random seed
    "featurization": "auto",
}

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 27, Finished, Available, Finished)

In [None]:
with mlflow.start_run(run_name="flight_delays_autofe"):
    automl.fit(X_train=X_train, y_train=y_train, **settings)

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 28, Finished, Available, Finished)

[flaml.automl.logger: 09-03 04:51:43] {1787} INFO - task = classification
[flaml.automl.logger: 09-03 04:51:43] {1798} INFO - Evaluation method: holdout
[flaml.automl.logger: 09-03 04:51:44] {1901} INFO - Minimizing error metric: customized metric
Auto featurization is not supported for spark data. Featurization is turned off.
[flaml.automl.logger: 09-03 04:51:44] {2019} INFO - List of ML learners in AutoML Run: ['RGF', 'lgbm', 'rf', 'xgboost']
[flaml.automl.logger: 09-03 04:51:44] {2329} INFO - iteration 0, current learner RGF
[flaml.automl.logger: 09-03 04:51:44] {2464} INFO - Estimated sufficient time budget=108490s. Estimated necessary time budget=108s.


[flaml.automl.logger: 09-03 04:52:01] {2513} INFO -  at 1.1s,	estimator RGF's best error=0.6624,	best estimator RGF's best error=0.6624
[flaml.automl.logger: 09-03 04:52:01] {2329} INFO - iteration 1, current learner RGF


[flaml.automl.logger: 09-03 04:52:20] {2513} INFO -  at 18.6s,	estimator RGF's best error=0.6624,	best estimator RGF's best error=0.6624
[flaml.automl.logger: 09-03 04:52:20] {2329} INFO - iteration 2, current learner RGF


[flaml.automl.logger: 09-03 04:52:39] {2513} INFO -  at 37.7s,	estimator RGF's best error=0.6580,	best estimator RGF's best error=0.6580
[flaml.automl.logger: 09-03 04:52:39] {2329} INFO - iteration 3, current learner lgbm


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[categorical_features] = X[categorical_features].astype(str).astype("category")


[flaml.automl.logger: 09-03 04:52:59] {2513} INFO -  at 56.7s,	estimator lgbm's best error=0.6771,	best estimator RGF's best error=0.6580
[flaml.automl.logger: 09-03 04:52:59] {2329} INFO - iteration 4, current learner xgboost


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[categorical_features] = X[categorical_features].astype(str).astype("category")


[flaml.automl.logger: 09-03 04:53:19] {2513} INFO -  at 77.0s,	estimator xgboost's best error=0.6798,	best estimator RGF's best error=0.6580
[flaml.automl.logger: 09-03 04:53:19] {2329} INFO - iteration 5, current learner RGF


[flaml.automl.logger: 09-03 04:53:41] {2513} INFO -  at 96.8s,	estimator RGF's best error=0.6580,	best estimator RGF's best error=0.6580
[flaml.automl.logger: 09-03 04:53:41] {2329} INFO - iteration 6, current learner lgbm


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[categorical_features] = X[categorical_features].astype(str).astype("category")


[flaml.automl.logger: 09-03 04:54:00] {2513} INFO -  at 118.1s,	estimator lgbm's best error=0.6771,	best estimator RGF's best error=0.6580
[flaml.automl.logger: 09-03 04:54:01] {569} INFO - logging best model RGF
[flaml.automl.logger: 09-03 04:54:06] {2756} INFO - retrain RGF for 2.6s
[flaml.automl.logger: 09-03 04:54:06] {2759} INFO - retrained model: RGFClassifier(learning_rate=1.7841917324247605, max_leaf=4, min_samples_leaf=19,
              n_iter=7, n_tree_search=2, opt_interval=224)
[flaml.automl.logger: 09-03 04:54:06] {2760} INFO - Auto Feature Engineering pipeline: None
[flaml.automl.logger: 09-03 04:54:06] {2762} INFO - Best MLflow run name: 
[flaml.automl.logger: 09-03 04:54:06] {2763} INFO - Best MLflow run id: 3e40efd0-6742-433f-ac4f-0dd3e491a964
[flaml.automl.logger: 09-03 04:54:23] {2055} INFO - fit succeeded
[flaml.automl.logger: 09-03 04:54:23] {2056} INFO - Time taken to find the best model: 37.70584297180176


### Standalone Feturization Pipeline 
Once the AutoML process completes, the featurization pipeline can be accessed independently, and be utilized separately from the AutoML process.

You can retrieve the feature engineering pipeline specifically through `automl.model.autofe`. 
Alternatively, for a comprehensive view of the preprocessing steps, including those from FLAML's existing preprocessors, use `automl.feature_transformer`.

- To view the configuration details of the entire pipeline, use `autofe.show_transformations()`.
- For a more interactive experience, the pipeline structure can be visualized by executing `display(autofe)` or simply `autofe`.

In [None]:
autofe = automl.model.autofe
display(autofe)
if autofe:
    # autofe could be None
    display(autofe.transform(X_test))

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 29, Finished, Available, Finished)

Full data preprocessor set, including FLAML's existing preprocess and Featurization:

In [None]:
transformer = automl.feature_transformer
display(transformer)
display(transformer.transform(X_test))

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 30, Finished, Available, Finished)

<flaml.automl.data.DataTransformer at 0x7c5d2a52f070>

SynapseWidget(Synapse.DataFrame, 01e178a6-4c9f-4b90-825f-a05558a8ee9e)

## 7. Visualization
The `flaml.visualization` module provides utility functions for plotting the optimization process using [plotly](https://plotly.com/python/).  Leveraging `plotly`, users can interactively explore experiment results. To use these plotting functions, simply provide your Hyperparameter Tuning & AutoML experiment results as input. Optional parameters can be added using keyword arguments.

## Avaliable Plots
- plot_contour: Plot the parameter relationship as contour plot in the experiment.
- plot_edf: Plot the objective value EDF (empirical distribution function) of the experiment.
- plot_feature_importance: Plot importance for each feature in the dataset.
- plot_optimization_history: Plot optimization history of all trials in the experiment.
- plot_parallel_coordinate: Plot the high-dimensional parameter relationships in the experiment.
- plot_slice: Plot the parameter relationship as slice plot in a study.
- plot_timeline: Plot the timeline of the experiment.

In [None]:
import flaml.visualization as fviz
fig = fviz.plot_slice(automl)  # , params=['num_leaves', 'fe.extraction']
fig.show()

In [None]:
fig = fviz.plot_contour(automl, learner="RGF", params=["max_leaf", "learning_rate"])
fig.show()

In [None]:
fig = fviz.plot_edf(automl)
fig.show()

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 33, Finished, Available, Finished)

In [None]:
fig = fviz.plot_optimization_history(automl)
fig.show()

In [None]:
fig = fviz.plot_timeline(automl)
fig.show()

In [None]:
fig = fviz.plot_feature_importance(automl)
fig.show()

In [None]:
fig = fviz.plot_parallel_coordinate(automl)
fig.show()

StatementMeta(, 44ec0d80-7d47-447e-8932-68bbcc94ad2b, 37, Finished, Available, Finished)