This notebook uses flaml to finetune a transformer model from Huggingface transformers library.

**Requirements.** This notebook has additional requirements:

In [1]:
# %pip install torch transformers datasets ipywidgets flaml[blendsearch,ray]

## Tokenizer

In [1]:
from transformers import AutoTokenizer

In [2]:
MODEL_CHECKPOINT = "distilbert-base-uncased"

In [3]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT, use_fast=True)

In [4]:
tokenizer("this is a test")

{'input_ids': [101, 2023, 2003, 1037, 3231, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]}

## Data

In [5]:
TASK = "cola"

In [6]:
import datasets

In [7]:
raw_dataset = datasets.load_dataset("glue", TASK, trust_remote_code=True)

Reusing dataset glue (/home/ec2-user/.cache/huggingface/datasets/glue/cola/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)


In [8]:
# define tokenization function used to process data
COLUMN_NAME = "sentence"
def tokenize(examples):
    return tokenizer(examples[COLUMN_NAME], truncation=True)

In [9]:
encoded_dataset = raw_dataset.map(tokenize, batched=True)

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [10]:
encoded_dataset["train"][0]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'idx': 0,
 'input_ids': [101,
  2256,
  2814,
  2180,
  1005,
  1056,
  4965,
  2023,
  4106,
  1010,
  2292,
  2894,
  1996,
  2279,
  2028,
  2057,
  16599,
  1012,
  102],
 'label': 1,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

## Model

In [11]:
from transformers import AutoModelForSequenceClassification

In [12]:
NUM_LABELS = 2
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=NUM_LABELS)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [13]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

## Metric

In [14]:
metric = datasets.load_metric("glue", TASK, trust_remote_code=True)

In [15]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

In [16]:
import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

## Training (aka Finetuning)

In [17]:
from transformers import Trainer
from transformers import TrainingArguments

In [18]:
args = TrainingArguments(
    output_dir='output',
    do_eval=True,
)

In [19]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [20]:
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step,Training Loss
500,0.571
1000,0.5154
1500,0.3561


## Hyperparameter Optimization

`flaml.tune` is a module for economical hyperparameter tuning. It frees users from manually tuning many hyperparameters for a software, such as machine learning training procedures. 
The API is compatible with ray tune.

### Step 1. Define training method

We define a function `train_distilbert(config: dict)` that accepts a hyperparameter configuration dict `config`. The specific configs will be generated by flaml's search algorithm in a given search space.


In [None]:
import flaml

def train_distilbert(config: dict):

    # Load CoLA dataset and apply tokenizer
    cola_raw = datasets.load_dataset("glue", TASK, trust_remote_code=True)
    cola_encoded = cola_raw.map(tokenize, batched=True)
    train_dataset, eval_dataset = cola_encoded["train"], cola_encoded["validation"]

    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_CHECKPOINT, num_labels=NUM_LABELS
    )

    metric = datasets.load_metric("glue", TASK, trust_remote_code=True)
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        return metric.compute(predictions=predictions, references=labels)

    training_args = TrainingArguments(
        output_dir='.',
        do_eval=False,
        disable_tqdm=True,
        logging_steps=20000,
        save_total_limit=0,
        **config,
    )

    trainer = Trainer(
        model,
        training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    # train model
    trainer.train()

    # evaluate model
    eval_output = trainer.evaluate()

    # report the metric to optimize
    flaml.tune.report(
        loss=eval_output["eval_loss"],
        matthews_correlation=eval_output["eval_matthews_correlation"],
    )

### Step 2. Define the search

We are now ready to define our search. This includes:

- The `search_space` for our hyperparameters
- The metric and the mode ('max' or 'min') for optimization
- The constraints (`n_cpus`, `n_gpus`, `num_samples`, and `time_budget_s`)

In [None]:
max_num_epoch = 64
search_space = {
        # You can mix constants with search space objects.
        "num_train_epochs": flaml.tune.loguniform(1, max_num_epoch),
        "learning_rate": flaml.tune.loguniform(1e-6, 1e-4),
        "adam_epsilon": flaml.tune.loguniform(1e-9, 1e-7),
        "adam_beta1": flaml.tune.uniform(0.8, 0.99),
        "adam_beta2": flaml.tune.loguniform(98e-2, 9999e-4),
}

In [None]:
# optimization objective
HP_METRIC, MODE = "matthews_correlation", "max"

# resources
num_cpus = 4
num_gpus = 4

# constraints
num_samples = -1    # number of trials, -1 means unlimited
time_budget_s = 3600    # time budget in seconds

### Step 3. Launch with `flaml.tune.run`

We are now ready to launch the tuning using `flaml.tune.run`:

In [None]:
import time
import ray
start_time = time.time()
ray.shutdown()
ray.init(num_cpus=num_cpus, num_gpus=num_gpus)

print("Tuning started...")
analysis = flaml.tune.run(
    train_distilbert,
    search_alg=flaml.CFO(
        space=search_space,
        metric=HP_METRIC,
        mode=MODE,
        low_cost_partial_config={"num_train_epochs": 1}),
    # uncomment the following if scheduler = 'asha',
    # max_resource=max_num_epoch, min_resource=1,
    resources_per_trial={"gpu": num_gpus, "cpu": num_cpus},
    local_dir='logs/',
    num_samples=num_samples,
    time_budget_s=time_budget_s,
    use_ray=True,
)

ray.shutdown()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av



Tuning started...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2m[36m(pid=11344)[0m Reusing dataset glue (/home/ec2-user/.cache/huggingface/datasets/glue/cola/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)
  0%|          | 0/9 [00:00<?, ?ba/s]
 22%|██▏       | 2/9 [00:00<00:00, 19.41ba/s]
 56%|█████▌    | 5/9 [00:00<00:00, 20.98ba/s]
 89%|████████▉ | 8/9 [00:00<00:00, 21.75ba/s]
100%|██████████| 9/9 [00:00<00:00, 24.49ba/s]
100%|██████████| 2/2 [00:00<00:00, 42.79ba/s]
  0%|          | 0/2 [00:00<?, ?ba/s]
100%|██████████| 2/2 [00:00<00:00, 41.48ba/s]
[2m[36m(pid=11344)[0m Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
[2m[36m(pid=11344)[0m - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or wit

[2m[36m(pid=11344)[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2m[36m(pid=11344)[0m 	- Avoid using `tokenizers` before the fork if possible
[2m[36m(pid=11344)[0m 	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2m[36m(pid=11344)[0m huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
[2m[36m(pid=11344)[0m 	- Avoid using `tokenizers` before the fork if possible
[2m[36m(pid=11344)[0m 	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
best_trial = analysis.get_best_trial(HP_METRIC, MODE, "all")
metric = best_trial.metric_analysis[HP_METRIC][MODE]
print(f"n_trials={len(analysis.trials)}")
print(f"time={time.time()-start_time}")
print(f"Best model eval {HP_METRIC}: {metric:.4f}")
print(f"Best model parameters: {best_trial.config}")


n_trials=22
time=3999.769361972809
Best model eval matthews_correlation: 0.5699
Best model parameters: {'num_train_epochs': 15.580684188655825, 'learning_rate': 1.2851507818900338e-05, 'adam_epsilon': 8.134982521948352e-08, 'adam_beta1': 0.99, 'adam_beta2': 0.9971094424784387}


## Next Steps

Notice that we only reported the metric with `flaml.tune.report` at the end of full training loop. It is possible to enable reporting of intermediate performance - allowing early stopping - as follows:

- Huggingface provides _Callbacks_ which can be used to insert the `flaml.tune.report` call inside the training loop
- Make sure to set `do_eval=True` in the `TrainingArguments` provided to `Trainer` and adjust the evaluation frequency accordingly