Remove old compression doc (#5546)

2023-05-08 18:37:28 +08:00 · 2023-05-08 18:37:28 +08:00 · 108e73cb3f
--- a/README.md
+++ b/README.md
@ -193,9 +193,9 @@ To update NNI to the latest version, add `--upgrade` flag to the above commands.
 </ul>
 <li><b>Compression</b></li>
 <ul>
-<li><a href="https://nni.readthedocs.io/en/latest/tutorials/pruning_quick_start_mnist.html">Pruning</a></li>
+<li><a href="https://nni.readthedocs.io/en/latest/tutorials/pruning_quick_start.html">Pruning</a></li>
 <li><a href="https://nni.readthedocs.io/en/latest/tutorials/pruning_speed_up.html">Pruning Speedup</a></li>
-<li><a href="https://nni.readthedocs.io/en/latest/tutorials/quantization_quick_start_mnist.html">Quantization</a></li>
+<li><a href="https://nni.readthedocs.io/en/latest/tutorials/quantization_quick_start.html">Quantization</a></li>
 <li><a href="https://nni.readthedocs.io/en/latest/tutorials/quantization_speed_up.html">Quantization Speedup</a></li>
 </ul>
 </ul>
--- a/docs/source/compression/advance.rst
+++ b/docs/source/compression/advance.rst
@ -0,0 +1,9 @@
+Advanced Usage
+==============
+
+..  toctree::
+    :maxdepth: 2
+
+    Customize Setting <setting>
+    Fusion Compression <fusion_compress>
+    Module Fusion <module_fusion>
--- a/docs/source/compression/advanced_usage.rst
+++ b/docs/source/compression/advanced_usage.rst
@ -1,9 +0,0 @@
-Advanced Usage
-==============
-
-..  toctree::
-    :maxdepth: 2
-
-    Customize Quantizer <../tutorials/quantization_customize>
-    Customize Scheduled Pruning Process <pruning_scheduler>
-    Utilities <compression_utils>
--- a/docs/source/compression/best_practices.rst
+++ b/docs/source/compression/best_practices.rst
@ -5,4 +5,4 @@ Best Practices
    :hidden:
    :maxdepth: 2

-    Pruning Transformer </tutorials/pruning_bert_glue>
+    Pruning Transformer </tutorials/new_pruning_bert_glue>
--- a/docs/source/compression_preview/changes.rst
+++ b/docs/source/compression_preview/changes.rst
@ -100,10 +100,10 @@ Distillation

 Two distillers is supported in NNI 3.0. By pruning or quantization fused distillation, it can get better compression results and higher precision.

-Please refer :doc:`Distiller <../reference/compression_preview/distiller>` for more details.
+Please refer :doc:`Distiller <../reference/compression/distiller>` for more details.


-Fusion Compressoin
+Fusion Compression
 ------------------

 Thanks to the new unified compression framework, it is now possible to perform pruning, quantization, and distillation simultaneously,
--- a/docs/source/compression/compression_config_list.rst
+++ b/docs/source/compression/compression_config_list.rst
@ -1,399 +0,0 @@
-Compression Config Specification
-================================
-
-Common Keys in Config
---------------------
-
-op_names
-^^^^^^^^
-
-A list of fully-qualified name of modules (e.g., ``['backbone.layers.0.ffn', ...]``) that will be compressed.
-If the name referenced module is not existed in the model, it will be ignored.
-
-op_names_re
-^^^^^^^^^^^
-
-A list of regular expressions for matching module names by python standard library ``re``.
-The matched modules will be selected to be compressed.
-
-op_types
-^^^^^^^^
-
-A list of type names of classes that inherit from ``torch.nn.Module``.
-Only module types in this list can be selected to be compressed.
-If this key is not set, all module types can be selected.
-If neither ``op_names`` or ``op_names_re`` are set, all modules satisfied the ``op_types`` are selected.
-
-exclude_op_names
-^^^^^^^^^^^^^^^^
-
-A list of fully-qualified name of modules that are excluded.
-
-exclude_op_names_re
-^^^^^^^^^^^^^^^^^^^
-
-A list of regular expressions for matching module names.
-The matched modules will be removed from the modules that need to be compressed.
-
-exclude_op_types
-^^^^^^^^^^^^^^^^
-
-A list of type names of classes that inherit from ``torch.nn.Module``.
-The module types in this list are excluded from compression.
-
-target_names
-^^^^^^^^^^^^
-
-A list of legal compression target name, i.e., usually ``_input_``, ``weight``, ``bias``, ``_output_`` are support to be compressed.
-
-Two kinds of target are supported by design, module inputs/outputs(should be a tensor), module parameters:
-
- Inputs/Outputs: If the module inputs or outputs is a singal tensor, directly set ``_input_`` for input and ``_output_`` for output.
-  ``_input_{position_index}`` or ``_input_{arg_name}`` can be used to specify the input target,
-  i.e., for a forward function ``def forward(self, x: Tensor, y: Tensor, z: Any): ...``, ``_input_0`` or ``_input_x`` can be used to specify ``x`` to be compressed,
-  note that ``self`` will be ignored when counting the position index.
-  Similarly, ``_output_{position_index}`` can be used to specify the output target if the output is a ``list/tuple``,
-  ``_output_{dict_key}`` can be used to specify the output target if the output is a ``dict``.
- Parameters/Buffers: Directly using the attribute name to specify the target, i.e., ``weight``, ``bias``.
-
-target_settings
-^^^^^^^^^^^^^^^
-
-A ``dict`` of target settings, the format is ``{target_name: setting}``. Target setting usually configure how to compress the target.
-
-All other keys(except these eight common keys) in a config will seems as a shortcut of target setting key, and will apply to all targets selected in this config.
-For example, consider a model has two ``Linear`` module (linear module names are ``'fc1'`` and ``'fc2'``), the following configs have same effect for pruning.
-
-.. code-block:: python
-
-    shorthand_config = {
-        'op_types': ['Linear'],
-        'sparse_ratio': 0.8
-    }
-
-    standard_config = {
-        'op_names': ['fc1', 'fc2'],
-        'target_names': ['weight', 'bias'],
-        'target_settings': {
-            'weight': {
-                'sparse_ratio': 0.8,
-                'max_sparse_ratio': None,
-                'min_sparse_ratio': None,
-                'sparse_threshold': None,
-                'global_group_id': None,
-                'dependency_group_id': None,
-                'granularity': 'default',
-                'internal_metric_block': None,
-                'apply_method': 'mul',
-            },
-            'bias': {
-                'align': {
-                    'target_name': 'weight',
-                    'dims': [0],
-                },
-                'apply_method': 'mul',
-            }
-        }
-    }
-
-
-.. Note:: Each compression target can only be configure once, re-configuration will not take effect.
-
-Pruning Specific Configuration Keys
-----------------------------------
-
-sparse_ratio
-^^^^^^^^^^^^
-
-A float number between 0. ~ 1., the sparse ratio of the pruning target or the total sparse ratio of a group of pruning targets.
-For example, if the sparse ratio is 0.8, and the pruning target is a Linear module weight, 80% weight value will be masked after pruning.
-
-max_sparse_ratio
-^^^^^^^^^^^^^^^^
-
-This key is usually used in combination with ``sparse_threshold`` and ``global_group_id``, limit the maximum sparse ratio of each target.
-
-A float number between 0. ~ 1., for each single pruning target, the sparse ratio after pruning will not be larger than this number,
-that means at most masked ``max_sparse_ratio`` pruning target value.
-
-min_sparse_ratio
-^^^^^^^^^^^^^^^^
-
-This key is usually used in combination with ``sparse_threshold`` and ``global_group_id``, limit the minimum sparse ratio of each target.
-
-A float number between 0. ~ 1., for each single pruning target, the sparse ratio after pruning will not be lower than this number,
-that means at least masked ``min_sparse_ratio`` pruning target value.
-
-sparse_threshold
-^^^^^^^^^^^^^^^^
-
-A float number, different from the ``sparse_ratio`` which configures a specific sparsity, ``sparse_threshold`` usually used in some adaptive sparse cases.
-``sparse_threshold`` is directly compared to pruning metrics (different in different algorithms) and the positions smaller than the threshold are masked.
-
-The value range is different for different pruning algorithms, please reference the pruner document to see how to configure it.
-In general, the higher the threshold, the higher the final sparsity. 
-
-global_group_id
-^^^^^^^^^^^^^^^
-
-``global_group_id`` should jointly used with ``sparse_ratio``.
-All pruning targets that have same ``global_group_id`` will be treat as a whole, and the ``sparse_ratio`` will be distributed across pruning targets.
-That means each pruning target might have different sparse ratio after pruning, but the group sparse ratio will be the configured ``sparse_ratio``.
-
-Note that the ``sparse_ratio`` in the same global group should be the same.
-
-For example, a model has three ``Linear`` modules (``'fc1'``, ``'fc2'``, ``'fc3'``),
-and the expected total sparse ratio of these three modules is 0.5, then the config can be:
-
-.. code-block:: python
-
-    config_list = [{
-        'op_names': ['fc1', 'fc2'],
-        'sparse_ratio': 0.5,
-        'global_group_id': 'linear_group_1'
-    }, {
-        'op_names': ['fc3'],
-        'sparse_ratio': 0.5,
-        'global_group_id': 'linear_group_1'
-    }]
-
-
-dependency_group_id
-^^^^^^^^^^^^^^^^^^^
-
-All pruning targets that have same ``dependency_group_id`` will be treat as a whole, and the positions the targets' pruned will be the same.
-For example, layer A and layer B have same ``dependency_group_id``, and they want to be pruned output channels, then A and B will be pruned the same channel indexes.
-
-Note that the ``sparse_ratio`` in the same dependency group should be the same, and the prunable positions (after reduction by ``granularity``) should be same,
-for example, pruning targets should have same output channel number when pruning output channel.
-
-This key usually be used on modules with add operation, i.e., skip connection.
-
-granularity
-^^^^^^^^^^^
-
-Control the granularity of the generated masked.
-
-``default``, ``in_channel``, ``out_channel``, ``per_channel`` and list of integer are supported:
-
- default: The pruner will auto determine using which kind of granularity, usually consistent with the paper.
- in_channel: The pruner will do pruning on the weight parameters 1 dimension.
- out_channel: The pruner will do pruning on the weight parameters 0 dimension.
- per_channel: The pruner will do pruning on the input/output -1 dimension.
- list of integer: Block sparse will be applied. For example, ``[4, 4]`` will apply 4x4 block sparse on the last two dimensions of the weight parameters.
-
-Note that ``in_channel`` or ``out_channel`` is not supported for input/output targets, please using ``per_channel`` instead.
-``torch.nn.Embedding`` is special, it's output dimension on weight is 1, so if want to pruning Embedding output channel, please set ``in_channel`` for its granularity for workaround.
-
-The following is an example for output channel pruning:
-
-.. code-block:: python
-
-    config = {
-        'op_types': ['Conv2d'],
-        'sparse_ratio': 0.5,
-        'granularity': 'out_channel' # same as [1, -1, -1, -1]
-    }
-
-apply_method
-^^^^^^^^^^^^
-
-By default, ``mul``. ``mul`` and ``add`` is supported to apply mask on pruning target.
-
-``mul`` means the pruning target will be masked by multiply a mask metrix contains 0 and 1, 0 represents masked position, 1 represents unmasked position.
-
-``add`` means the pruning target will be masked by add a mask metrix contains -1000 and 0, -1000 represents masked position, 0 represents unmasked position.
-Note that -1000 can be configured in the future. ``add`` usually be used to mask activation module such as Softmax.
-
-Quantization Specific Configuration Keys
----------------------------------------
-
-quant_dtype
-^^^^^^^^^^^
-
-By default, ``int8``. Support ``int`` and ``uint`` plus quant bits.
-
-quant_scheme
-^^^^^^^^^^^^
-
-``affine`` or ``symmetric``. If this key is not set, the quantization scheme will be choosen by quantizer,
-most quantizer will apply ``symmetric`` quantization.
-
-granularity
-^^^^^^^^^^^
-
-Used to control the granularity of the target quantization, by default the whole tensor will use the same scale and zero point.
-
-``per_channel`` and list of integer are supported:
-
- ``per_channel``: Each (ouput) channel will have their independent scales and zero points.
- list of integer: The integer list is the block size. Each block will have their independent scales and zero points.
-
-Each sub-config in the config list is a dict, and the scope of each setting (key) is only internal to each sub-config.
-If multiple sub-configs are configured for the same layer, the later ones will overwrite the previous ones.
-
-Distillation Specific Configuration Keys
----------------------------------------
-
-lambda
-^^^^^^
-
-A float number. The scale factor of the distillation loss.
-
-link
-^^^^
-
-A teacher module name or a list of teacher module names. The student module link to.
-
-apply_method
-^^^^^^^^^^^^
-
-``mse`` or ``kl``.
-
-.. Note:: The following legacy config format is also supported in nni v3.0, and will deprecated in nni v3.2.
-
-Common Keys in Config (Legacy)
------------------------------
-
-op_types
-^^^^^^^^
-
-The type of the layers targeted by this sub-config.
-If ``op_names`` is not set in this sub-config, all layers in the model that satisfy the type will be selected.
-If ``op_names`` is set in this sub-config, the selected layers should satisfy both type and name.
-
-op_names
-^^^^^^^^
-
-The name of the layers targeted by this sub-config.
-If ``op_types`` is set in this sub-config, the selected layer should satisfy both type and name.
-
-exclude
-^^^^^^^
-
-The ``exclude`` and ``sparsity`` keyword are mutually exclusive and cannot exist in the same sub-config.
-If ``exclude`` is set in sub-config, the layers selected by this config will not be compressed.
-
-Special Keys for Pruning (Legacy)
---------------------------------
-
-op_partial_names
-^^^^^^^^^^^^^^^^
-
-This key will share with `Quantization Config` in the future.
-
-This key is for the layers to be pruned with names that have the same sub-string. NNI will find all names in the model,
-find names that contain one of ``op_partial_names``, and append them into the ``op_names``.
-
-sparsity_per_layer
-^^^^^^^^^^^^^^^^^^
-
-The sparsity ratio of each selected layer.
-
-e.g., the ``sparsity_per_layer`` is 0.8 means each selected layer will mask 80% values on the weight.
-If ``layer_1`` (500 parameters) and ``layer_2`` (1000 parameters) are selected in this sub-config,
-then ``layer_1`` will be masked 400 parameters and ``layer_2`` will be masked 800 parameters.
-
-total_sparsity
-^^^^^^^^^^^^^^
-
-The sparsity ratio of all selected layers, means that sparsity ratio may no longer be even between layers.
-
-e.g., the ``total_sparsity`` is 0.8 means 80% of parameters in this sub-config will be masked.
-If ``layer_1`` (500 parameters) and ``layer_2`` (1000 parameters) are selected in this sub-config,
-then ``layer_1`` and ``layer_2`` will be masked a total of 1200 parameters,
-how these total parameters are distributed between the two layers is determined by the pruning algorithm.
-
-sparsity
-^^^^^^^^
-
-``sparsity`` is an old config key from the pruning v1, it has the same meaning as ``sparsity_per_layer``.
-You can also use ``sparsity`` right now, but it will be deprecated in the future.
-
-max_sparsity_per_layer
-^^^^^^^^^^^^^^^^^^^^^^
-
-This key is usually used with ``total_sparsity``. It limits the maximum sparsity ratio of each layer.
-
-In ``total_sparsity`` example, there are 1200 parameters that need to be masked and all parameters in ``layer_1`` may be totally masked.
-To avoid this situation, ``max_sparsity_per_layer`` can be set as 0.9, this means up to 450 parameters can be masked in ``layer_1``,
-and 900 parameters can be masked in ``layer_2``.
-
-Special Keys for Quantization (Legacy)
--------------------------------------
-
-quant_types
-^^^^^^^^^^^
-
-Currently, nni support three kind of quantization types: 'weight', 'input', 'output'.
-It can be set as ``str`` or ``List[str]``.
-Note that 'weight' and 'input' are always quantize together, e.g., ``['input', 'weight']``.
-
-quant_bits
-^^^^^^^^^^
-
-Bits length of quantization, key is the quantization type set in ``quant_types``, value is the length,
-eg. {'weight': 8}, when the type is int, all quantization types share same bits length.
-
-quant_start_step
-^^^^^^^^^^^^^^^^
-
-Specific key for ``QAT Quantizer``. Disable quantization until model are run by certain number of steps,
-this allows the network to enter a more stable.
-State where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0.
-
-Examples
--------
-
-Suppose we want to compress the following model::
-
-    class Model(nn.Module):
-        def __init__(self):
-            super().__init__()
-            self.conv1 = nn.Conv2d(1, 32, 3, 1)
-            self.conv2 = nn.Conv2d(32, 64, 3, 1)
-            self.dropout1 = nn.Dropout2d(0.25)
-            self.dropout2 = nn.Dropout2d(0.5)
-            self.fc1 = nn.Linear(9216, 128)
-            self.fc2 = nn.Linear(128, 10)
-
-        def forward(self, x):
-            ...
-    
-First, we need to determine where to compress, use the following config list to specify all ``Conv2d`` modules and module named ``fc1``::
-
-    config_list = [{'op_types': ['Conv2d']}, {'op_names': ['fc1']}]
-
-Sometimes we may need to compress all modules of a certain type, except for a few special ones.
-Writing all the module names is laborious at this point, we can use ``exclude`` to quickly specify the compression target modules::
-
-    config_list = [{
-        'op_types': ['Conv2d', 'Linear']
-    }, {
-        'exclude': True,
-        'op_names': ['fc2']
-    }]
-
-The above two config lists are equivalent to the model we want to compress, they both use ``conv1``, ``conv2``, and ``fc1`` as compression targets.
-
-Let's take a simple pruning config list example, pruning all ``Conv2d`` modules with 50% sparsity, and pruning ``fc1`` with 80% sparsity::
-
-    config_list = [{
-        'op_types': ['Conv2d'],
-        'total_sparsity': 0.5
-    }, {
-        'op_names': ['fc1'],
-        'total_sparsity': 0.8
-    }]
-
-Then if you want to try model quantization, here is a simple config list example::
-
-    config_list = [{
-        'op_types': ['Conv2d'],
-        'quant_types': ['input', 'weight'],
-        'quant_bits': {'input': 8, 'weight': 8}
-    }, {
-        'op_names': ['fc1'],
-        'quant_types': ['input', 'weight'],
-        'quant_bits': {'input': 8, 'weight': 8}
-    }]
--- a/docs/source/compression/compression_evaluator.rst
+++ b/docs/source/compression/compression_evaluator.rst
@ -1,309 +0,0 @@
-Compression Evaluator
-=====================
-
-The ``Evaluator`` is used to package the training and evaluation process for a targeted model.
-To explain why NNI needs an ``Evaluator``, let's first look at the general process of model compression in NNI.
-
-In model pruning, some algorithms need to prune according to some intermediate variables (gradients, activations, etc.) generated during the training process,
-and some algorithms need to gradually increase or adjust the sparsity of different layers during the training process,
-or adjust the pruning strategy according to the performance changes of the model during the pruning process.
-
-In model quantization, NNI has quantization-aware training algorithm,
-it can adjust the scale and zero point required for model quantization from time to time during the training process,
-and may achieve a better performance compare to post-training quantization.
-
-In order to better support the above algorithms' needs and maintain the consistency of the interface,
-NNI introduces the ``Evaluator`` as the carrier of the training and evaluation process.
-
-.. note::
-    For users prior to NNI v2.8: NNI previously provided APIs like ``trainer``, ``traced_optimizer``, ``criterion``, ``finetuner``.
-    These APIs were maybe tedious in terms of user experience. Users need to exchange the corresponding API frequently if they want to switch compression algorithms.
-    ``Evaluator`` is an alternative to the above interface, users only need to create the evaluator once and it can be used in all compressors.
-
-For users of native PyTorch, :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` requires the user to encapsulate the training process as a function and exposes the specified interface,
-which will bring some complexity. But don't worry, in most cases, this will not change too much code.
-
-For users of `PyTorchLightning <https://www.pytorchlightning.ai/>`__, :class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>` can be created with only a few lines of code based on your original Lightning code.
-
-Here we give two examples of how to create an ``Evaluator`` for both native PyTorch and PyTorchLightning users.
-
-TorchEvaluator
--------------
-
-:class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` is for the users who work in a native PyTorch environment (If you are using PyTorchLightning, please refer `LightningEvaluator`_).
-
-:class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` has six initialization parameters ``training_func``, ``optimizers``, ``criterion``, ``lr_schedulers``,
-``dummy_input``, ``evaluating_func``.
-
-* ``training_func`` is the training loop to train the compressed model.
-  It is a callable function with six input parameters ``model``, ``optimizers``,
-  ``criterion``, ``lr_schedulers``, ``max_steps``, ``max_epochs``.
-  Please make sure each input argument of the ``training_func`` is actually used,
-  especially ``max_steps`` and ``max_epochs`` can correctly control the duration of training.
-* ``optimizers`` is a single / a list of traced optimizer(s),
-  please make sure using ``nni.trace`` wrapping the ``Optimizer`` class before initializing it / them.
-* ``criterion`` is a callable function to compute loss, it has two input parameters ``input`` and ``target``, and returns a tensor as loss.
-* ``lr_schedulers`` is a single / a list of traced scheduler(s), same as ``optimizers``,
-  please make sure using ``nni.trace`` wrapping the ``_LRScheduler`` class before initializing it / them.
-* ``dummy_input`` is used to trace the model, same as ``example_inputs``
-  in `torch.jit.trace <https://pytorch.org/docs/stable/generated/torch.jit.trace.html?highlight=torch%20jit%20trace#torch.jit.trace>`_.
-* ``evaluating_func`` is a callable function to evaluate the compressed model performance. Its input is a compressed model and its output is metric.
-  The format of metric should be a float number or a dict with key ``default``.
-
-Please refer :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` for more details.
-Here is an example of how to initialize a :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>`.
-
-.. code-block:: python
-
-    from __future__ import annotations
-    from typing import Callable, Any
-
-    import torch
-    from torch.optim.lr_scheduler import StepLR, _LRScheduler
-    from torch.utils.data import DataLoader
-    from torchvision import datasets, models
-
-    import nni
-    from nni.compression.pytorch import TorchEvaluator
-
-
-    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-
-    def training_func(model: torch.nn.Module, optimizers: torch.optim.Optimizer,
-                      criterion: Callable[[Any, Any], torch.Tensor],
-                      lr_schedulers: _LRScheduler | None = None, max_steps: int | None = None,
-                      max_epochs: int | None = None, *args, **kwargs):
-        model.train()
-
-        # prepare data
-        imagenet_train_data = datasets.ImageNet(root='data/imagenet', split='train', download=True)
-        train_dataloader = DataLoader(imagenet_train_data, batch_size=4, shuffle=True)
-
-        #############################################################################
-        # NNI may change the training duration by setting max_steps or max_epochs.
-        # To ensure that NNI has the ability to control the training duration,
-        # please add max_steps and max_epochs as constraints to the training loop.
-        #############################################################################
-        total_epochs = max_epochs if max_epochs else 20
-        total_steps = max_steps if max_steps else 1000000
-        current_steps = 0
-
-        # training loop
-        for _ in range(total_epochs):
-            for inputs, labels in train_dataloader:
-                inputs, labels = inputs.to(device), labels.to(device)
-
-                optimizers.zero_grad()
-                loss = criterion(model(inputs), labels)
-                loss.backward()
-                optimizers.step()
-                ######################################################################
-                # stop the training loop when reach the total_steps
-                ######################################################################
-                current_steps += 1
-                if total_steps and current_steps == total_steps:
-                    return
-            lr_schedulers.step()
-
-
-    def evaluating_func(model: torch.nn.Module):
-        model.eval()
-
-        # prepare data
-        imagenet_val_data = datasets.ImageNet(root='./data/imagenet', split='val', download=True)
-        val_dataloader = DataLoader(imagenet_val_data, batch_size=4, shuffle=False)
-
-        # testing loop
-        correct = 0
-        with torch.no_grad():
-            for inputs, labels in val_dataloader:
-                inputs, labels = inputs.to(device), labels.to(device)
-                logits = model(inputs)
-                preds = torch.argmax(logits, dim=1)
-                correct += preds.eq(labels.view_as(preds)).sum().item()
-        return correct / len(imagenet_val_data)
-
-
-    # initialize the optimizer, criterion, lr_scheduler, dummy_input
-    model = models.resnet18().to(device)
-    ######################################################################
-    # please use nni.trace wrap the optimizer class,
-    # NNI will use the trace information to re-initialize the optimizer
-    ######################################################################
-    optimizer = nni.trace(torch.optim.Adam)(model.parameters(), lr=1e-3)
-    criterion = torch.nn.CrossEntropyLoss()
-    ######################################################################
-    # please use nni.trace wrap the lr_scheduler class,
-    # NNI will use the trace information to re-initialize the lr_scheduler
-    ######################################################################
-    lr_scheduler = nni.trace(StepLR)(optimizer, step_size=5, gamma=0.1)
-    dummy_input = torch.rand(4, 3, 224, 224).to(device)
-
-    # TorchEvaluator initialization
-    evaluator = TorchEvaluator(training_func=training_func, optimizers=optimizer, criterion=criterion,
-                               lr_schedulers=lr_scheduler, dummy_input=dummy_input, evaluating_func=evaluating_func)
-
-
-.. note::
-    It is also worth to note that not all the arguments of :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` must be provided.
-    Some compressors only require ``evaluate_func`` as they do not train the model, some compressors only require ``training_func``.
-    Please refer to each compressor's doc to check the required arguments.
-    But, it is fine to provide more arguments than the compressor's need.
-
-
-A complete example of pruner using :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` to compress model can be found :githublink:`here <examples/model_compress/pruning/taylorfo_torch_evaluator.py>`.
-
-
-LightningEvaluator
------------------
-:class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>` is for the users who work with PyTorchLightning.
-
-Only three parts users need to modify compared with the original pytorch-lightning code:
-
-1. Wrap the ``Optimizer`` and ``_LRScheduler`` class with ``nni.trace``.
-2. Wrap the ``LightningModule`` class with ``nni.trace``.
-3. Wrap the ``LightningDataModule`` class with ``nni.trace``.
-
-Please refer :class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>` for more details.
-Here is an example of how to initialize a :class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>`.
-
-.. code-block:: python
-
-    import pytorch_lightning as pl
-    from pytorch_lightning.loggers import TensorBoardLogger
-    import torch
-    from torch.optim.lr_scheduler import StepLR
-    from torch.utils.data import DataLoader
-    from torchmetrics.functional import accuracy
-    from torchvision import datasets, models
-
-    import nni
-    from nni.compression.pytorch import LightningEvaluator
-
-
-    class SimpleLightningModel(pl.LightningModule):
-        def __init__(self):
-            super().__init__()
-            self.model = models.resnet18()
-            self.criterion = torch.nn.CrossEntropyLoss()
-
-        def forward(self, x):
-            return self.model(x)
-
-        def training_step(self, batch, batch_idx):
-            x, y = batch
-            logits = self(x)
-            loss = self.criterion(logits, y)
-            self.log("train_loss", loss)
-            return loss
-
-        def evaluate(self, batch, stage=None):
-            x, y = batch
-            logits = self(x)
-            loss = self.criterion(logits, y)
-            preds = torch.argmax(logits, dim=1)
-            acc = accuracy(preds, y, 'multiclass', num_classes=10)
-
-            if stage:
-                self.log(f"default", loss, prog_bar=False)
-                self.log(f"{stage}_loss", loss, prog_bar=True)
-                self.log(f"{stage}_acc", acc, prog_bar=True)
-
-        def validation_step(self, batch, batch_idx):
-            self.evaluate(batch, "val")
-
-        def test_step(self, batch, batch_idx):
-            self.evaluate(batch, "test")
-
-        #####################################################################
-        # please pay attention to this function,
-        # using nni.trace trace the optimizer and lr_scheduler class.
-        #####################################################################
-        def configure_optimizers(self):
-            optimizer = nni.trace(torch.optim.SGD)(
-                self.parameters(),
-                lr=0.01,
-                momentum=0.9,
-                weight_decay=5e-4,
-            )
-            scheduler_dict = {
-                "scheduler": nni.trace(StepLR)(
-                    optimizer,
-                    step_size=5,
-                    amma=0.1
-                ),
-                "interval": "epoch",
-            }
-            return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}
-
-
-    class ImageNetDataModule(pl.LightningDataModule):
-        def __init__(self, data_dir: str = "./data/imagenet"):
-            super().__init__()
-            self.data_dir = data_dir
-
-        def prepare_data(self):
-            # download
-            datasets.ImageNet(self.data_dir, split='train', download=True)
-            datasets.ImageNet(self.data_dir, split='val', download=True)
-
-        def setup(self, stage: str | None = None):
-            if stage == "fit" or stage is None:
-                self.imagenet_train_data = datasets.ImageNet(root='data/imagenet', split='train')
-                self.imagenet_val_data = datasets.ImageNet(root='./data/imagenet', split='val')
-
-            if stage == "test" or stage is None:
-                self.imagenet_test_data = datasets.ImageNet(root='./data/imagenet', split='val')
-
-            if stage == "predict" or stage is None:
-                self.imagenet_predict_data = datasets.ImageNet(root='./data/imagenet', split='val')
-
-        def train_dataloader(self):
-            return DataLoader(self.imagenet_train_data, batch_size=4)
-
-        def val_dataloader(self):
-            return DataLoader(self.imagenet_val_data, batch_size=4)
-
-        def test_dataloader(self):
-            return DataLoader(self.imagenet_test_data, batch_size=4)
-
-        def predict_dataloader(self):
-            return DataLoader(self.imagenet_predict_data, batch_size=4)
-
-    #####################################################################
-    # please use nni.trace wrap the pl.Trainer class,
-    # NNI will use the trace information to re-initialize the trainer
-    #####################################################################
-    pl_trainer = nni.trace(pl.Trainer)(
-        accelerator='auto',
-        devices=1,
-        max_epochs=1,
-        max_steps=50,
-        logger=TensorBoardLogger('./lightning_logs', name="resnet"),
-    )
-
-    #####################################################################
-    # please use nni.trace wrap the pl.LightningDataModule class,
-    # NNI will use the trace information to re-initialize the datamodule
-    #####################################################################
-    pl_data = nni.trace(ImageNetDataModule)(data_dir='./data/imagenet')
-
-    evaluator = LightningEvaluator(pl_trainer, pl_data)
-
-
-.. note::
-    In ``LightningModule.configure_optimizers``, user should use traced ``torch.optim.Optimizer`` and traced ``torch.optim._LRScheduler``.
-    It's for NNI can get the initialization parameters of the optimizers and lr_schedulers.
-
-    .. code-block:: python
-
-        class SimpleModel(pl.LightningModule):
-            ...
-
-            def configure_optimizers(self):
-                optimizers = nni.trace(torch.optim.SGD)(model.parameters(), lr=0.001)
-                lr_schedulers = nni.trace(ExponentialLR)(optimizer=optimizers, gamma=0.1)
-                return optimizers, lr_schedulers
-
-
-A complete example of pruner using :class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>` to compress model can be found :githublink:`here <examples/model_compress/pruning/taylorfo_lightning_evaluator.py>`.
--- a/docs/source/compression/compression_utils.rst
+++ b/docs/source/compression/compression_utils.rst
@ -1,113 +0,0 @@
-Analysis Utils for Model Compression
-====================================
-
-We provide several easy-to-use tools for users to analyze their model during model compression.
-
-.. _topology-analysis:
-
-Topology Analysis
-----------------
-
-We also provide several tools for the topology analysis during the model compression. These tools are to help users compress their model better. Because of the complex topology of the network, when compressing the model, users often need to spend a lot of effort to check whether the compression configuration is reasonable. So we provide these tools for topology analysis to reduce the burden on users.
-
-ChannelDependency
-^^^^^^^^^^^^^^^^^
-
-Complicated models may have residual connection/concat operations in their models. When the user prunes these models, they need to be careful about the channel-count dependencies between the convolution layers in the model. Taking the following residual block in the resnet18 as an example. The output features of the ``layer2.0.conv2`` and ``layer2.0.downsample.0`` are added together, so the number of the output channels of ``layer2.0.conv2`` and ``layer2.0.downsample.0`` should be the same, or there may be a tensor shape conflict.
-
-
-.. image:: ../../img/channel_dependency_example.jpg
-   :target: ../../img/channel_dependency_example.jpg
-   :alt: 
- 
-
-If the layers have channel dependency are assigned with different sparsities (here we only discuss the structured pruning by L1FilterPruner/L2FilterPruner), then there will be a shape conflict during these layers. Even the pruned model with mask works fine, the pruned model cannot be speedup to the final model directly that runs on the devices, because there will be a shape conflict when the model tries to add/concat the outputs of these layers. This tool is to find the layers that have channel count dependencies to help users better prune their model.
-
-Usage
-"""""
-
-.. code-block:: python
-
-   from nni.compression.pytorch.utils.shape_dependency import ChannelDependency
-   data = torch.ones(1, 3, 224, 224).cuda()
-   channel_depen = ChannelDependency(net, data)
-   channel_depen.export('dependency.csv')
-
-Output Example
-""""""""""""""
-
-The following lines are the output example of torchvision.models.resnet18 exported by ChannelDependency. The layers at the same line have output channel dependencies with each other. For example, layer1.1.conv2, conv1, and layer1.0.conv2 have output channel dependencies with each other, which means the output channel(filters) numbers of these three layers should be same with each other, otherwise, the model may have shape conflict. 
-
-.. code-block:: bash
-
-   Dependency Set,Convolutional Layers
-   Set 1,layer1.1.conv2,layer1.0.conv2,conv1
-   Set 2,layer1.0.conv1
-   Set 3,layer1.1.conv1
-   Set 4,layer2.0.conv1
-   Set 5,layer2.1.conv2,layer2.0.conv2,layer2.0.downsample.0
-   Set 6,layer2.1.conv1
-   Set 7,layer3.0.conv1
-   Set 8,layer3.0.downsample.0,layer3.1.conv2,layer3.0.conv2
-   Set 9,layer3.1.conv1
-   Set 10,layer4.0.conv1
-   Set 11,layer4.0.downsample.0,layer4.1.conv2,layer4.0.conv2
-   Set 12,layer4.1.conv1
-
-MaskConflict
-^^^^^^^^^^^^
-
-When the masks of different layers in a model have conflict (for example, assigning different sparsities for the layers that have channel dependency), we can fix the mask conflict by MaskConflict. Specifically, the MaskConflict loads the masks exported by the pruners(L1FilterPruner, etc), and check if there is mask conflict, if so, MaskConflict sets the conflicting masks to the same value.
-
-.. code-block:: python
-
-   from nni.compression.pytorch.utils.mask_conflict import fix_mask_conflict
-   fixed_mask = fix_mask_conflict('./resnet18_mask', net, data)
-
-not_safe_to_prune
-^^^^^^^^^^^^^^^^^
-
-If we try to prune a layer whose output tensor is taken as the input by a shape-constraint OP(for example, view, reshape), then such pruning maybe not be safe. For example, we have a convolutional layer followed by a view function.
-
-.. code-block:: python
-
-   x = self.conv(x) # output shape is (batch, 1024, 3, 3)
-   x = x.view(-1, 1024)
-
-If the output shape of the pruned conv layer is not divisible by 1024(for example(batch, 500, 3, 3)), we may meet a shape error. We cannot replace such a function that directly operates on the Tensor. Therefore, we need to be careful when pruning such layers. The function not_safe_to_prune finds all the layers followed by a shape-constraint function. Here is an example for usage. If you meet a shape error when running the forward inference on the speeduped model, you can exclude the layers returned by not_safe_to_prune and try again. 
-
-.. code-block:: python
-
-   not_safe = not_safe_to_prune(model, dummy_input)
-
-.. _flops-counter:
-
-Model FLOPs/Parameters Counter
------------------------------
-
-We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup). 
-
-We support two modes to collect information of modules. The first mode is ``default``\ , which only collect the information of convolution and linear. The second mode is ``full``\ , which also collect the information of other operations. Users can easily use our collected ``results`` for futher analysis.
-
-Usage
-^^^^^
-
-.. code-block:: python
-
-   from nni.compression.pytorch.utils import count_flops_params
-
-   # Given input size (1, 1, 28, 28)
-   flops, params, results = count_flops_params(model, (1, 1, 28, 28)) 
-
-   # Given input tensor with size (1, 1, 28, 28) and switch to full mode
-   x = torch.randn(1, 1, 28, 28)
-
-   flops, params, results = count_flops_params(model, (x,), mode='full') # tuple of tensor as input
-
-   # Format output size to M (i.e., 10^6)
-   print(f'FLOPs: {flops/1e6:.3f}M,  Params: {params/1e6:.3f}M')
-   print(results)
-   {
-   'conv': {'flops': [60], 'params': [20], 'weight_size': [(5, 3, 1, 1)], 'input_size': [(1, 3, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}, 
-   'conv2': {'flops': [100], 'params': [30], 'weight_size': [(5, 5, 1, 1)], 'input_size': [(1, 5, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}
-   }
--- a/docs/source/compression_preview/config_list.rst
+++ b/docs/source/compression_preview/config_list.rst
--- a/docs/source/compression_preview/evaluator.rst
+++ b/docs/source/compression_preview/evaluator.rst
--- a/docs/source/compression_preview/fusion_compress.rst
+++ b/docs/source/compression_preview/fusion_compress.rst
@ -41,7 +41,7 @@ Example
 Pruning + Distillation
 ^^^^^^^^^^^^^^^^^^^^^^

-The full example can be found `here <https://github.com/microsoft/nni/tree/master/examples/compression/pqd_fuse.py>`__.
+The full example can be found `here <https://github.com/microsoft/nni/tree/master/examples/compression/fusion/pqd_fuse.py>`__.

 The following code is a common pipeline with pruning first and then distillation.

--- a/docs/source/compression_preview/module_fusion.rst
+++ b/docs/source/compression_preview/module_fusion.rst
--- a/docs/source/compression/overview.rst
+++ b/docs/source/compression/overview.rst
@ -1,74 +1,20 @@
 Overview of NNI Model Compression
 =================================

-Deep neural networks (DNNs) have achieved great success in many tasks like computer vision, nature launguage processing, speech processing.
-However, typical neural networks are both computationally expensive and energy-intensive,
-which can be difficult to be deployed on devices with low computation resources or with strict latency requirements.
-Therefore, a natural thought is to perform model compression to reduce model size and accelerate model training/inference without losing performance significantly.
-Model compression techniques can be divided into two categories: pruning and quantization.
-The pruning methods explore the redundancy in the model weights and try to remove/prune the redundant and uncritical weights.
-Quantization refers to compress models by reducing the number of bits required to represent weights or activations.
-We further elaborate on the two methods, pruning and quantization, in the following chapters. Besides, the figure below visualizes the difference between these two methods.
+The NNI model compression has undergone a completely new framework design in version 3.0,
+seamlessly integrating pruning, quantization, and distillation methods.
+Additionally, it provides a more granular model compression configuration,
+including compression granularity configuration, input/output compression configuration, and custom module compression.
+Furthermore, the model speedup part of pruning uses the graph analysis scheme based on torch.fx,
+which supports more op types of sparsity propagation,
+as well as custom special op sparsity propagation methods and replacement logic,
+further enhancing the generality and robustness of model acceleration.

-.. image:: ../../img/prune_quant.jpg
-   :target: ../../img/prune_quant.jpg
-   :scale: 40%
-   :align: center
-   :alt:
+The current documentation for the new version of compression may not be complete, but there is no need to worry.
+The optimizations in the new version are mostly focused on the underlying framework and implementation,
+and there are not significant changes to the user interface.
+Instead, there are more extensions and compatibility with the configuration of the previous version.

-NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
-For users to compress their models, they only need to add several lines in their code.
-There are some popular model compression algorithms built-in in NNI.
-On the other hand, users could easily customize their new compression algorithms using NNI’s interface.
+If you want to view the old compression documents, please refer `nni 2.10 compression doc <https://nni.readthedocs.io/en/v2.10/compression/overview.html>`__.

-There are several core features supported by NNI model compression:
-
-* Support many popular pruning and quantization algorithms.
-* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
-* Speedup a compressed model to make it have lower inference latency and also make it smaller.
-* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
-* Concise interface for users to customize their own compression algorithms.
-
-
-Compression Pipeline
--------------------
-
-.. image:: ../../img/compression_pipeline.png
-   :target: ../../img/compression_pipeline.png
-   :alt:
-   :align: center
-   :scale: 30%
-
-The overall compression pipeline in NNI is shown above. For compressing a pretrained model, pruning and quantization can be used alone or in combination.
-If users want to apply both, a sequential mode is recommended as common practise.
-
-.. note::
-  Note that NNI pruners or quantizers are not meant to physically compact the model but for simulating the compression effect. Whereas NNI speedup tool can truly compress model by changing the network architecture and therefore reduce latency.
-  To obtain a truly compact model, users should conduct :doc:`pruning speedup <../tutorials/pruning_speedup>` or :doc:`quantizaiton speedup <../tutorials/quantization_speedup>`. 
-  The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
-
-
-Model Speedup
-------------
-
-The final goal of model compression is to reduce inference latency and model size.
-However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
-For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
-Given the output masks and quantization bits produced by those algorithms, NNI can really speedup the model.
-
-The following figure shows how NNI prunes and speeds up your models. 
-
-.. image:: ../../img/nni_prune_process.png
-   :target: ../../img/nni_prune_process.png
-   :scale: 30%
-   :align: center
-   :alt:
-
-The detailed tutorial of Speedup Model with Mask can be found :doc:`here <../tutorials/pruning_speedup>`.
-The detailed tutorial of Speedup Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speedup>`.
-
-.. attention::
-
-  NNI's model pruning framework has been upgraded to a more powerful version (named pruning v2 before nni v2.6).
-  The old version (`named pruning before nni v2.6 <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) will be out of maintenance. If for some reason you have to use the old pruning,
-  v2.6 is the last nni version to support old pruning version.
+See :doc:`the major enhancement of compression in NNI 3.0 <./changes>`.
--- a/docs/source/compression/overview_zh.rst
+++ b/docs/source/compression/overview_zh.rst
@ -1,81 +0,0 @@
-.. b6bdf52910e2e2c72085d03482d45340
-
-模型压缩
-========
-
-深度神经网络（DNNs）在计算机视觉、自然语言处理、语音处理等领域取得了巨大的成功。   
-然而，典型的神经网络是计算和能源密集型的，很难将其部署在计算资源匮乏
-或具有严格延迟要求的设备上。 因此，一个自然的想法就是对模型进行压缩，
-以减小模型大小并加速模型训练/推断，同时不会显着降低模型性能。 
-模型压缩技术可以分为两类：剪枝和量化。 剪枝方法探索模型权重中的冗余，
-并尝试删除/修剪冗余和非关键的权重。 量化是指通过减少权重表示或激活所需的比特数来压缩模型。
-在接下来的章节中，我们将进一步阐述这两种方法: 剪枝和量化。 
-此外，下图直观地展示了这两种方法的区别。  
-
-.. image:: ../../img/prune_quant.jpg
-   :target: ../../img/prune_quant.jpg
-   :scale: 40%
-   :alt:
-
-NNI 提供了易于使用的工具包来帮助用户设计并使用剪枝和量化算法。
-其使用了统一的接口来支持 TensorFlow 和 PyTorch。
-对用户来说， 只需要添加几行代码即可压缩模型。
-NNI 中也内置了一些主流的模型压缩算法。
-用户可以进一步利用 NNI 的自动调优功能找到最佳的压缩模型，
-该功能在自动模型压缩部分有详细介绍。
-另一方面，用户可以使用 NNI 的接口自定义新的压缩算法。
-
-
-NNI 具备以下几个核心特性:
-* 内置许多流行的剪枝和量化算法。
-* 利用最先进的策略和NNI的自动调整能力，来自动化模型剪枝和量化过程。
-* 加速模型，使其有更低的推理延迟。
-* 提供友好和易于使用的压缩工具，让用户深入到压缩过程和结果。
-* 简洁的界面，供用户自定义自己的压缩算法。
-
-压缩流程
---------
-
-.. image:: ../../img/compression_pipeline.png
-   :target: ../../img/compression_pipeline.png
-   :alt:
-   :align: center
-   :scale: 30%
-
-NNI中模型压缩的整体流程如上图所示。
-为了压缩一个预先训练好的模型，可以单独或联合使用修剪和量化。
-如果用户希望同时应用这两种模式，建议采用串行模式。
-
-
-.. note::
-  值得注意的是，NNI的pruner或quantizer并不能改变网络结构，只能模拟压缩的效果。
-  真正能够压缩模型、改变网络结构、降低推理延迟的是NNI的加速工具。
-  为了获得一个真正的压缩的模型，用户需要执行 :doc:`剪枝加速 <../tutorials/pruning_speedup>` or :doc:`量化加速 <../tutorials/quantization_speedup>`. 
-  PyTorch和TensorFlow的接口都是统一的。目前只支持PyTorch版本，未来将支持TensorFlow版本。
-
-
-模型加速
---------
-
-模型压缩的最终目标是减少推理延迟和模型大小。
-然而，现有的模型压缩算法主要是通过仿真来检测压缩模型的性能。
-例如，修剪算法使用掩码，量化算法仍将值存储在float32中。
-如果能给定这些算法产生的输出掩码和量化位，NNI的加速工具就可以真正地压缩模型。
-
-下图显示了NNI如何修剪和加速您的模型。
-
-.. image:: ../../img/nni_prune_process.png
-   :target: ../../img/nni_prune_process.png
-   :scale: 30%
-   :align: center
-   :alt:
-
-关于用掩码进行模型加速的详细文档可以参考 :doc:`here <../tutorials/pruning_speedup>`.
-关于用校准配置进行模型加速的详细文档可以参考 :doc:`here <../tutorials/quantization_speedup>`.
-
-
-.. attention::
-
-  NNI的模型剪枝框架已经升级到更高级的版本 (在 nni 2.6 版本前称为pruning v2)。
-  旧版本 (`named pruning before nni v2.6 <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) 不再进行维护. 
-  如果出于某些原因您不得不使用，v2.6 是最后的支持旧版剪枝算法的版本。
--- a/docs/source/compression/pruner.rst
+++ b/docs/source/compression/pruner.rst
@ -1,10 +1,11 @@
-Pruner in NNI
-=============
+Pruning Algorithm Supported in NNI
+==================================

-NNI implements the main part of the pruning algorithm as pruner. All pruners are implemented as close as possible to what is described in the paper (if it has).
-The following table provides a brief introduction to the pruners implemented in nni, click the link in table to view a more detailed introduction and use cases.
+Note that not all pruners from the previous version have been migrated to the new framework yet.
+NNI has plans to migrate all pruners that were implemented in NNI 3.2.

-There are two kinds of pruners in NNI, please refer to :ref:`basic pruner <basic-pruner>` and :ref:`scheduled pruner <scheduled-pruner>` for details.
+If you believe that a certain old pruner has not been implemented or that another pruning algorithm would be valuable,
+please feel free to contact us. We will prioritize and expedite support accordingly.

 .. list-table::
   :header-rows: 1
@ -12,35 +13,21 @@ There are two kinds of pruners in NNI, please refer to :ref:`basic pruner <basic

   * - Name
     - Brief Introduction of Algorithm
-   * - :ref:`level-pruner`
+   * - :ref:`new-level-pruner`
     - Pruning the specified ratio on each weight element based on absolute value of weight element
-   * - :ref:`l1-norm-pruner`
+   * - :ref:`new-l1-norm-pruner`
     - Pruning output channels with the smallest L1 norm of weights (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
-   * - :ref:`l2-norm-pruner`
+   * - :ref:`new-l2-norm-pruner`
     - Pruning output channels with the smallest L2 norm of weights
-   * - :ref:`fpgm-pruner`
+   * - :ref:`new-fpgm-pruner`
     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/abs/1811.00250>`__
-   * - :ref:`slim-pruner`
+   * - :ref:`new-slim-pruner`
     - Pruning output channels by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
-   * - :ref:`activation-apoz-rank-pruner`
-     - Pruning output channels based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
-   * - :ref:`activation-mean-rank-pruner`
-     - Pruning output channels based on the metric that calculates the smallest mean value of output activations
-   * - :ref:`taylor-fo-weight-pruner`
+   * - :ref:`new-taylor-pruner`
     - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
-   * - :ref:`admm-pruner`
-     - Pruning based on ADMM optimization technique `Reference Paper <https://arxiv.org/abs/1804.03294>`__
-   * - :ref:`linear-pruner`
+   * - :ref:`new-linear-pruner`
     - Sparsity ratio increases linearly during each pruning rounds, in each round, using a basic pruner to prune the model.
-   * - :ref:`agp-pruner`
+   * - :ref:`new-agp-pruner`
     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
-   * - :ref:`lottery-ticket-pruner`
-     - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
-   * - :ref:`simulated-annealing-pruner`
-     - Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
-   * - :ref:`auto-compress-pruner`
-     - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper <https://arxiv.org/abs/1907.03141>`__
-   * - :ref:`amc-pruner`
-     - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/abs/1802.03494>`__
-   * - :ref:`movement-pruner`
+   * - :ref:`new-movement-pruner`
     - Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
--- a/docs/source/compression/pruning.rst
+++ b/docs/source/compression/pruning.rst
@ -6,105 +6,3 @@ The pruning methods explore the redundancy in the model weights(parameters) and
 The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.

 The following concepts can help you understand pruning in NNI.
-
-Pruning Target
--------------
-
-Pruning target means where we apply the sparsity.
-Most pruning methods prune the weights to reduce the model size and accelerate the inference latency.
-Other pruning methods also apply sparsity on activations (e.g., inputs, outputs, or feature maps) to accelerate the inference latency.
-NNI supports pruning module weights right now, and will support other pruning targets in the future.
-
-.. _basic-pruner:
-
-Basic Pruner
------------
-
-Basic pruner generates the masks for each pruning target (weights) for a determined sparsity ratio.
-It usually takes model and config as input arguments, then generates masks for each pruning target.
-
-.. _scheduled-pruner:
-
-Scheduled Pruner
----------------
-
-Scheduled pruner decides how to allocate sparsity ratio to each pruning target,
-it also handles the model speedup (after each pruning iteration) and finetuning logic.
-From the implementation logic, the scheduled pruner is a combination of pruning scheduler, basic pruner and task generator.
-
-Task generator only cares about the pruning effect that should be achieved in each round, and uses a config list to express how to pruning.
-Basic pruner will reset with the model and config list given by task generator then generate the masks.
-
-For a clearer structure vision, please refer to the figure below.
-
-.. image:: ../../img/pruning_process.png
-   :target: ../../img/pruning_process.png
-   :scale: 30%
-   :align: center
-   :alt:
-
-More information about scheduled pruning process please refer to :doc:`Pruning Scheduler <pruning_scheduler>`.
-
-Granularity
-----------
-
-Fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
-Coarse-grained pruning or structured pruning is pruning a regular group of weights, such as a convolutional filter.
-
-Only :ref:`level-pruner` and :ref:`admm-pruner` support fine-grained pruning, all other pruners do some kind of structured pruning on weights.
-
-.. _dependency-aware-mode-for-output-channel-pruning:
-
-Dependency-aware Mode for Output Channel Pruning
------------------------------------------------
-
-Currently, we support dependency-aware mode in several ``pruner``: :ref:`l1-norm-pruner`, :ref:`l2-norm-pruner`, :ref:`fpgm-pruner`,
-:ref:`activation-apoz-rank-pruner`, :ref:`activation-mean-rank-pruner`, :ref:`taylor-fo-weight-pruner`.
-
-In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
-the algorithm will quantify the importance of each filter based on some specific metrics(such as l1 norm), and prune the less important output channels.
-
-We use pruning convolutional layers as an example to explain dependency-aware mode.
-As :ref:`topology analysis utils <topology-analysis>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
-then these two convolutional layers have channel dependency with each other (more details please see :ref:`ChannelDependency <topology-analysis>`).
-Take the following figure as an example.
-
-.. image:: ../../img/mask_conflict.jpg
-   :target: ../../img/mask_conflict.jpg
-   :scale: 80%
-   :align: center
-   :alt: 
-
-If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
-Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
-In this case, we cannot harvest the speed benefit from the model pruning.
-
-To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
-In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
-
-In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
-
-.. image:: ../../img/dependency-aware.jpg
-   :target: ../../img/dependency-aware.jpg
-   :scale: 80%
-   :align: center
-   :alt: 
-
-Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
-Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
-Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
-According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
-Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
-For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
-In this case, the ``dependency-aware pruner`` will 
-
-* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
-* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
-
-In addition, for the convolutional layers that have more than one filter group,
-``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
-Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process. 
-
-.. Note:: Operations that will be recognized as having channel dependencies: add/sub/mul/div, addcmul/addcdiv, logical_and/or/xor
-
-In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
--- a/docs/source/compression/pruning_scheduler.rst
+++ b/docs/source/compression/pruning_scheduler.rst
@ -1,75 +0,0 @@
-Pruning Scheduler
-=================
-
-Pruning scheduler is new feature supported in pruning v2. It can bring more flexibility for pruning the model iteratively.
-All the built-in iterative pruners (e.g., AGPPruner, SimulatedAnnealingPruner) are based on three abstracted components: pruning scheduler, pruners and task generators.
-In addition to using the NNI built-in iterative pruners,
-users can directly use the pruning schedulers to customize their own iterative pruning logic.
-
-Workflow of Pruning Scheduler
-----------------------------
-
-In iterative pruning, the final goal will be broken down into different small goals, and complete a small goal in each iteration.
-For example, each iteration increases a little sparsity ratio, and after several pruning iterations, the continuous pruned model reaches the final overall sparsity;
-fix the overall sparsity, try different ways to allocate sparsity between layers in each iteration, and find the best allocation way.
-
-We define a small goal as ``Task``, it usually includes states inherited from previous iterations (eg. pruned model and masks) and description of the current goal (eg. a config list that describes how to allocate sparsity).
-Details about ``Task`` can be found in this :githublink:`file <nni/compression/pytorch/base/scheduler.py>`.
-
-Pruning scheduler handles two main components, a basic pruner, and a task generator. The logic of generating ``Task`` is encapsulated in the task generator.
-In an iteration (one pruning step), pruning scheduler parses the ``Task`` getting from the task generator,
-and reset the pruner by ``model``, ``masks``, ``config_list`` parsing from the ``Task``.
-Then pruning scheduler generates the new masks by the pruner. During an iteration, the new masked model may also experience speed-up, finetuning, and evaluating.
-After one iteration is done, the pruning scheduler collects the compact model, new masks and evaluation score, packages them into ``TaskResult``, and passes it to task generator.
-The iteration process will end until the task generator has no more ``Task``.
-
-How to Customized Iterative Pruning
-----------------------------------
-
-Using AGP Pruning as an example to explain how to implement an iterative pruning by scheduler in NNI.
-
-.. code-block:: python
-
-    from nni.compression.pytorch.pruning import L1NormPruner, PruningScheduler
-    from nni.compression.pytorch.pruning.tools import AGPTaskGenerator
-
-    pruner = L1NormPruner(model=None, config_list=None, mode='dependency_aware', dummy_input=torch.rand(10, 3, 224, 224).to(device))
-    task_generator = AGPTaskGenerator(total_iteration=10, origin_model=model, origin_config_list=config_list, log_dir='.', keep_intermediate_result=True)
-    scheduler = PruningScheduler(pruner, task_generator, finetuner=finetuner, speedup=True, dummy_input=dummy_input, evaluator=None, reset_weight=False)
-
-    scheduler.compress()
-    _, model, masks, _, _ = scheduler.get_best_result()
-
-The full script can be found :githublink:`here <examples/model_compress/pruning/scheduler_torch.py>`.
-
-In this example, we use dependency-aware mode L1 Norm Pruner as a basic pruner during each iteration.
-Note we do not need to pass ``model`` and ``config_list`` to the pruner, because in each iteration the ``model`` and ``config_list`` used by the pruner are received from the task generator.
-Then we can use ``scheduler`` as an iterative pruner directly. In fact, this is the implementation of ``AGPPruner`` in NNI.
-
-More about Task Generator
-------------------------
-
-The task generator is used to give the model that needs to be pruned in each iteration and the corresponding config_list.
-For example, ``AGPTaskGenerator`` will give the model pruned in the previous iteration and compute the sparsity using in the current iteration.
-``TaskGenerator`` put all these pruning information into ``Task`` and pruning scheduler will get the ``Task``, then run it.
-The pruning result will return to the ``TaskGenerator`` at the end of each iteration and ``TaskGenerator`` will judge whether and how to generate the next ``Task``.
-
-The information included in the ``Task`` and ``TaskResult`` can be found :githublink:`here <nni/compression/pytorch/base/scheduler.py>`.
-
-A clearer iterative pruning flow chart can be found :doc:`here <pruning>`.
-
-
-If you want to implement your own task generator, please following the ``TaskGenerator`` :githublink:`interface <nni/compression/pytorch/pruning/tools/base.py>`.
-Two main functions should be implemented, ``init_pending_tasks(self) -> List[Task]`` and ``generate_tasks(self, task_result: TaskResult) -> List[Task]``.
-
-Why Use Pruning Scheduler
-------------------------
-
-One of the benefits of using a scheduler to do iterative pruning is users can use more functions of NNI pruning components,
-because of simplicity of the interface and the restoration of the paper, NNI not fully exposing all the low-level interfaces to the upper layer.
-For example, resetting weight value to the original model in each iteration is a key point in lottery ticket pruning algorithm, and this is implemented in ``LotteryTicketPruner``.
-To reduce the complexity of the interface, we only support this function in ``LotteryTicketPruner``, not other pruners.
-If users want to reset weight during each iteration in AGP pruning, ``AGPPruner`` can not do this, but users can easily set ``reset_weight=True`` in ``PruningScheduler`` to implement this.
-
-What's more, for a customized pruner or task generator, using scheduler can easily enhance the algorithm.
-In addition, users can also customize the scheduling process to implement their own scheduler.
--- a/docs/source/compression/quantization.rst
+++ b/docs/source/compression/quantization.rst
@ -7,4 +7,4 @@ format for model weights is 32-bit float, or FP32. Many research works have demo
 can be represented using 8-bit integers without significant loss in accuracy. Even lower bit-widths, such as 4/2/1 bits,
 is an active field of research.

-A quantizer is a quantization algorithm implementation in NNI.
+A quantizer is a quantization algorithm implementation in NNI.
--- a/docs/source/compression/quantization_tutorials.rst
+++ b/docs/source/compression/quantization_tutorials.rst
@ -1,8 +0,0 @@
-Quantization Tutorials
-======================
-
-.. toctree::
-    :hidden:
-    :maxdepth: 2
-
-    Quantize Transformer </tutorials/quantization_bert_glue>
--- a/docs/source/compression/quantizer.rst
+++ b/docs/source/compression/quantizer.rst
@ -10,15 +10,13 @@ The following table provides a brief introduction to the quantizers implemented

   * - Name
     - Brief Introduction of Algorithm
-   * - :ref:`naive-quantizer`
-     - Quantize weights to default 8 bits
-   * - :ref:`qat-quantizer`
+   * - :ref:`NewQATQuantizer`
     - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
-   * - :ref:`dorefa-quantizer`
+   * - :ref:`NewDorefaQuantizer`
     - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
-   * - :ref:`bnn-quantizer`
+   * - :ref:`NewBNNQuantizer`
     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
-   * - :ref:`lsq-quantizer`
+   * - :ref:`NewLsqQuantizer`
     - Learned step size quantization. `Reference Paper <https://arxiv.org/pdf/1902.08153.pdf>`__
-   * - :ref:`observer-quantizer`
+   * - :ref:`NewPtqQuantizer`
     - Post training quantizaiton. Collect quantization information during calibration with observers.
--- a/docs/source/compression_preview/setting.rst
+++ b/docs/source/compression_preview/setting.rst
--- a/docs/source/compression/toctree.rst
+++ b/docs/source/compression/toctree.rst
@ -6,8 +6,9 @@ Compression
    :maxdepth: 2

    Overview <overview>
+    Config Specification <config_list>
    Pruning <toctree_pruning>
    Quantization <toctree_quantization>
-    Config Specification <compression_config_list>
-    Evaluator <compression_evaluator>
-    Advanced Usage <advanced_usage>
+    Evaluator <evaluator>
+    Advanced Usage <advance>
+    Enhancement <changes>
--- a/docs/source/compression/toctree_pruning.rst
+++ b/docs/source/compression/toctree_pruning.rst
@ -6,7 +6,7 @@ Pruning
    :maxdepth: 2

    Overview <pruning>
-    Quickstart </tutorials/pruning_quick_start_mnist>
+    Quickstart </tutorials/pruning_quick_start>
    Pruner <pruner>
    Speedup </tutorials/pruning_speedup>
    Best Practices <best_practices>
--- a/docs/source/compression/toctree_quantization.rst
+++ b/docs/source/compression/toctree_quantization.rst
@ -6,7 +6,7 @@ Quantization
    :maxdepth: 2

    Overview <quantization>
-    Quickstart </tutorials/quantization_quick_start_mnist>
+    Quickstart </tutorials/quantization_quick_start>
    Quantizer <quantizer>
    SpeedUp </tutorials/quantization_speedup>
-    Quantization Tutorials <quantization_tutorials>
+    Quantize Transformer </tutorials/quantization_bert_glue>
--- a/docs/source/compression_preview/best_practices.rst
+++ b/docs/source/compression_preview/best_practices.rst
@ -1,8 +0,0 @@
-Best Practices
-==============
-
-.. toctree::
-    :hidden:
-    :maxdepth: 2
-
-    Pruning Transformer </tutorials/new_pruning_bert_glue>
--- a/docs/source/compression_preview/overview.rst
+++ b/docs/source/compression_preview/overview.rst
@ -1,16 +0,0 @@
-Overview of NNI Model Compression (Preview)
-===========================================
-
-The NNI model compression has undergone a completely new framework design in version 3.0,
-seamlessly integrating pruning, quantization, and distillation methods.
-Additionally, it provides a more granular model compression configuration,
-including compression granularity configuration, input/output compression configuration, and custom module compression.
-Furthermore, the model speedup part of pruning uses the graph analysis scheme based on torch.fx,
-which supports more op types of sparsity propagation,
-as well as custom special op sparsity propagation methods and replacement logic,
-further enhancing the generality and robustness of model acceleration.
-
-The current documentation for the new version of compression may not be complete, but there is no need to worry.
-The optimizations in the new version are mostly focused on the underlying framework and implementation,
-and there are not significant changes to the user interface.
-Instead, there are more extensions and compatibility with the configuration of the previous version.
--- a/docs/source/compression_preview/pruner.rst
+++ b/docs/source/compression_preview/pruner.rst
@ -1,33 +0,0 @@
-Pruning Algorithm Supported in NNI
-==================================
-
-Note that not all pruners from the previous version have been migrated to the new framework yet.
-NNI has plans to migrate all pruners that were implemented in NNI 3.2.
-
-If you believe that a certain old pruner has not been implemented or that another pruning algorithm would be valuable,
-please feel free to contact us. We will prioritize and expedite support accordingly.
-
-.. list-table::
-   :header-rows: 1
-   :widths: auto
-
-   * - Name
-     - Brief Introduction of Algorithm
-   * - :ref:`new-level-pruner`
-     - Pruning the specified ratio on each weight element based on absolute value of weight element
-   * - :ref:`new-l1-norm-pruner`
-     - Pruning output channels with the smallest L1 norm of weights (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
-   * - :ref:`new-l2-norm-pruner`
-     - Pruning output channels with the smallest L2 norm of weights
-   * - :ref:`new-fpgm-pruner`
-     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/abs/1811.00250>`__
-   * - :ref:`new-slim-pruner`
-     - Pruning output channels by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
-   * - :ref:`new-taylor-pruner`
-     - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
-   * - :ref:`new-linear-pruner`
-     - Sparsity ratio increases linearly during each pruning rounds, in each round, using a basic pruner to prune the model.
-   * - :ref:`new-agp-pruner`
-     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
-   * - :ref:`new-movement-pruner`
-     - Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
--- a/docs/source/compression_preview/quantization.rst
+++ b/docs/source/compression_preview/quantization.rst
@ -1,11 +0,0 @@
-Overview of NNI Model Quantization
-==================================
-
-Quantization refers to compressing models by reducing the number of bits required to represent weights or activations,
-which can reduce the computations and the inference time. In the context of deep neural networks, the major numerical
-format for model weights is 32-bit float, or FP32. Many research works have demonstrated that weights and activations
-can be represented using 8-bit integers without significant loss in accuracy. Even lower bit-widths, such as 4/2/1 bits,
-is an active field of research.
-
-A quantizer is a quantization algorithm implementation in NNI.
-You can also :doc:`create your own quantizer <../tutorials/quantization_customize>` using NNI model compression interface.
--- a/docs/source/compression_preview/quantization_quick_start.rst
+++ b/docs/source/compression_preview/quantization_quick_start.rst
@ -1,8 +0,0 @@
-Quickstart
-==========
-
-.. toctree::
-    :hidden:
-    :maxdepth: 2
-
-    Quantization Quickstart </tutorials/quantization_quick_start>
--- a/docs/source/compression_preview/quantizer.rst
+++ b/docs/source/compression_preview/quantizer.rst
@ -1,22 +0,0 @@
-Quantizer in NNI
-================
-
-NNI implements the main part of the quantizaiton algorithm as quantizer. All quantizers are implemented as close as possible to what is described in the paper (if it has).
-The following table provides a brief introduction to the quantizers implemented in nni, click the link in table to view a more detailed introduction and use cases.
-
-.. list-table::
-   :header-rows: 1
-   :widths: auto
-
-   * - Name
-     - Brief Introduction of Algorithm
-   * - :ref:`NewQATQuantizer`
-     - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
-   * - :ref:`NewDorefaQuantizer`
-     - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
-   * - :ref:`NewBNNQuantizer`
-     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
-   * - :ref:`NewLsqQuantizer`
-     - Learned step size quantization. `Reference Paper <https://arxiv.org/pdf/1902.08153.pdf>`__
-   * - :ref:`NewPtqQuantizer`
-     - Post training quantizaiton. Collect quantization information during calibration with observers.
--- a/docs/source/compression_preview/toctree.rst
+++ b/docs/source/compression_preview/toctree.rst
@ -1,17 +0,0 @@
-Compression (Preview)
-=====================
-
-.. toctree::
-    :hidden:
-    :maxdepth: 2
-
-    Overview <overview>
-    Enhancement <changes>
-    Config Specification <config_list>
-    Pruning <toctree_pruning>
-    Quantization <toctree_quantization>
-    Evaluator <evaluator>
-    Customize Setting <setting>
-    Fusion Compression <fusion_compress>
-    Module Fusion <module_fusion>
-
--- a/docs/source/compression_preview/toctree_pruning.rst
+++ b/docs/source/compression_preview/toctree_pruning.rst
@ -1,9 +0,0 @@
-Pruning
-=======
-
-.. toctree::
-    :hidden:
-    :maxdepth: 2
-
-    Pruner <pruner>
-    Best Practices <best_practices>
--- a/docs/source/compression_preview/toctree_quantization.rst
+++ b/docs/source/compression_preview/toctree_quantization.rst
@ -1,11 +0,0 @@
-Quantization
-============
-
-.. toctree::
-    :hidden:
-    :maxdepth: 2
-
-    Overview <quantization>
-    Quickstart <quantization_quick_start>
-    Quantizer <quantizer>
-    SpeedUp </tutorials/quantization_speedup>
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -117,8 +117,8 @@ linkcheck_ignore = [
    r'https://cla\.opensource\.microsoft\.com',
    r'https://www\.docker\.com/',

-    # remove after #5491 merged
-    r'https://github\.com/microsoft/nni/tree/master/examples/compression/pqd_fuse\.py',
+    # remove after 3.0 release
+    r'https://nni\.readthedocs\.io/en/v2\.10/compression/overview\.html',
 ]

 # Ignore all links located in release.rst
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@ -46,7 +46,7 @@ More examples can be found in our :githublink:`GitHub repository <examples>`.
 .. cardlinkitem::
   :header: Get Started with Model Pruning on MNIST
   :description: Familiarize yourself with pruning to compress your model 
-   :link: tutorials/pruning_quick_start_mnist
+   :link: tutorials/pruning_quick_start
   :image: ../img/thumbnails/pruning-tutorial.svg
   :background: blue
   :tags: Compression
@ -54,7 +54,7 @@ More examples can be found in our :githublink:`GitHub repository <examples>`.
 .. cardlinkitem::
   :header: Get Started with Model Quantization on MNIST
   :description: Familiarize yourself with quantization to compress your model
-   :link: tutorials/quantization_quick_start_mnist
+   :link: tutorials/quantization_quick_start
   :image: ../img/thumbnails/quantization-tutorial.svg
   :background: indigo
   :tags: Compression
@ -78,7 +78,7 @@ More examples can be found in our :githublink:`GitHub repository <examples>`.
 .. cardlinkitem::
   :header: Pruning Bert on Task MNLI
   :description: An end to end example for how to using NNI pruning transformer and show the real speedup number
-   :link: tutorials/pruning_bert_glue
+   :link: tutorials/new_pruning_bert_glue
   :image: ../img/thumbnails/pruning-tutorial.svg
   :background: indigo
   :tags: Compression
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -17,7 +17,6 @@ NNI Documentation
   hpo/toctree
   nas/toctree
   compression/toctree
-   compression_preview/toctree
   feature_engineering/toctree
   experiment/toctree

@ -107,7 +106,7 @@ NNI makes AutoML techniques plug-and-play
 .. codesnippetcard::
   :icon: ../img/thumbnails/pruning-small.svg
   :title: Model Pruning
-   :link: tutorials/pruning_quick_start_mnist
+   :link: tutorials/pruning_quick_start

   .. code-block::

@ -129,7 +128,7 @@ NNI makes AutoML techniques plug-and-play
 .. codesnippetcard::
   :icon: ../img/thumbnails/quantization-small.svg
   :title: Quantization
-   :link: tutorials/quantization_quick_start_mnist
+   :link: tutorials/quantization_quick_start

   .. code-block::

--- a/docs/source/index_zh.rst
+++ b/docs/source/index_zh.rst
@ -20,7 +20,6 @@ NNI 文档
   超参调优 <hpo/toctree>
   架构搜索 <nas/toctree>
   模型压缩 <compression/toctree>
-   模型压缩（预览） <compression_preview/toctree>
   特征工程 <feature_engineering/toctree>
   实验管理 <experiment/toctree>

@ -111,7 +110,7 @@ NNI 使得自动机器学习技术即插即用
 .. codesnippetcard::
   :icon: ../img/thumbnails/pruning-small.svg
   :title: 模型剪枝
-   :link: tutorials/pruning_quick_start_mnist
+   :link: tutorials/pruning_quick_start
   :seemore: 点这里阅读完整教程

   .. code-block::
--- a/docs/source/locales/zh/LC_MESSAGES/compression.po
+++ b/docs/source/locales/zh/LC_MESSAGES/compression.po
@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: NNI \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2022-04-13 03:14+0000\n"
+"POT-Creation-Date: 2023-05-08 16:52+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language-Team: LANGUAGE <LL@li.org>\n"
@ -23,122 +23,187 @@ msgstr ""

 #: ../../source/compression/overview.rst:4
 msgid ""
-"Deep neural networks (DNNs) have achieved great success in many tasks "
-"like computer vision, nature launguage processing, speech processing. "
-"However, typical neural networks are both computationally expensive and "
-"energy-intensive, which can be difficult to be deployed on devices with "
-"low computation resources or with strict latency requirements. Therefore,"
-" a natural thought is to perform model compression to reduce model size "
-"and accelerate model training/inference without losing performance "
-"significantly. Model compression techniques can be divided into two "
-"categories: pruning and quantization. The pruning methods explore the "
-"redundancy in the model weights and try to remove/prune the redundant and"
-" uncritical weights. Quantization refers to compress models by reducing "
-"the number of bits required to represent weights or activations. We "
-"further elaborate on the two methods, pruning and quantization, in the "
-"following chapters. Besides, the figure below visualizes the difference "
-"between these two methods."
+"The NNI model compression has undergone a completely new framework design"
+" in version 3.0, seamlessly integrating pruning, quantization, and "
+"distillation methods. Additionally, it provides a more granular model "
+"compression configuration, including compression granularity "
+"configuration, input/output compression configuration, and custom module "
+"compression. Furthermore, the model speedup part of pruning uses the "
+"graph analysis scheme based on torch.fx, which supports more op types of "
+"sparsity propagation, as well as custom special op sparsity propagation "
+"methods and replacement logic, further enhancing the generality and "
+"robustness of model acceleration."
 msgstr ""

-#: ../../source/compression/overview.rst:19
+#: ../../source/compression/overview.rst:13
 msgid ""
-"NNI provides an easy-to-use toolkit to help users design and use model "
-"pruning and quantization algorithms. For users to compress their models, "
-"they only need to add several lines in their code. There are some popular"
-" model compression algorithms built-in in NNI. On the other hand, users "
-"could easily customize their new compression algorithms using NNI’s "
-"interface."
+"The current documentation for the new version of compression may not be "
+"complete, but there is no need to worry. The optimizations in the new "
+"version are mostly focused on the underlying framework and "
+"implementation, and there are not significant changes to the user "
+"interface. Instead, there are more extensions and compatibility with the "
+"configuration of the previous version."
 msgstr ""

-#: ../../source/compression/overview.rst:24
-msgid "There are several core features supported by NNI model compression:"
-msgstr ""
-
-#: ../../source/compression/overview.rst:26
-msgid "Support many popular pruning and quantization algorithms."
-msgstr ""
-
-#: ../../source/compression/overview.rst:27
+#: ../../source/compression/overview.rst:18
 msgid ""
-"Automate model pruning and quantization process with state-of-the-art "
-"strategies and NNI's auto tuning power."
+"If you want to view the old compression documents, please refer `nni 2.10"
+" compression doc "
+"<https://nni.readthedocs.io/en/v2.10/compression/overview.html>`__."
 msgstr ""

-#: ../../source/compression/overview.rst:28
-msgid ""
-"Speedup a compressed model to make it have lower inference latency and "
-"also make it smaller."
+#: ../../source/compression/overview.rst:20
+msgid "See :doc:`the major enhancement of compression in NNI 3.0 <./changes>`."
 msgstr ""

-#: ../../source/compression/overview.rst:29
-msgid ""
-"Provide friendly and easy-to-use compression utilities for users to dive "
-"into the compression process and results."
-msgstr ""
+#~ msgid ""
+#~ "Deep neural networks (DNNs) have "
+#~ "achieved great success in many tasks "
+#~ "like computer vision, nature launguage "
+#~ "processing, speech processing. However, "
+#~ "typical neural networks are both "
+#~ "computationally expensive and energy-"
+#~ "intensive, which can be difficult to "
+#~ "be deployed on devices with low "
+#~ "computation resources or with strict "
+#~ "latency requirements. Therefore, a natural "
+#~ "thought is to perform model compression"
+#~ " to reduce model size and accelerate"
+#~ " model training/inference without losing "
+#~ "performance significantly. Model compression "
+#~ "techniques can be divided into two "
+#~ "categories: pruning and quantization. The "
+#~ "pruning methods explore the redundancy "
+#~ "in the model weights and try to"
+#~ " remove/prune the redundant and uncritical"
+#~ " weights. Quantization refers to compress"
+#~ " models by reducing the number of "
+#~ "bits required to represent weights or"
+#~ " activations. We further elaborate on "
+#~ "the two methods, pruning and "
+#~ "quantization, in the following chapters. "
+#~ "Besides, the figure below visualizes the"
+#~ " difference between these two methods."
+#~ msgstr ""

-#: ../../source/compression/overview.rst:30
-msgid "Concise interface for users to customize their own compression algorithms."
-msgstr ""
+#~ msgid ""
+#~ "NNI provides an easy-to-use "
+#~ "toolkit to help users design and "
+#~ "use model pruning and quantization "
+#~ "algorithms. For users to compress their"
+#~ " models, they only need to add "
+#~ "several lines in their code. There "
+#~ "are some popular model compression "
+#~ "algorithms built-in in NNI. On the"
+#~ " other hand, users could easily "
+#~ "customize their new compression algorithms "
+#~ "using NNI’s interface."
+#~ msgstr ""

-#: ../../source/compression/overview.rst:34
-msgid "Compression Pipeline"
-msgstr ""
+#~ msgid "There are several core features supported by NNI model compression:"
+#~ msgstr ""

-#: ../../source/compression/overview.rst:42
-msgid ""
-"The overall compression pipeline in NNI is shown above. For compressing a"
-" pretrained model, pruning and quantization can be used alone or in "
-"combination. If users want to apply both, a sequential mode is "
-"recommended as common practise."
-msgstr ""
+#~ msgid "Support many popular pruning and quantization algorithms."
+#~ msgstr ""

-#: ../../source/compression/overview.rst:46
-msgid ""
-"Note that NNI pruners or quantizers are not meant to physically compact "
-"the model but for simulating the compression effect. Whereas NNI speedup "
-"tool can truly compress model by changing the network architecture and "
-"therefore reduce latency. To obtain a truly compact model, users should "
-"conduct :doc:`pruning speedup <../tutorials/pruning_speedup>` or "
-":doc:`quantizaiton speedup <../tutorials/quantization_speedup>`. The "
-"interface and APIs are unified for both PyTorch and TensorFlow. Currently"
-" only PyTorch version has been supported, and TensorFlow version will be "
-"supported in future."
-msgstr ""
+#~ msgid ""
+#~ "Automate model pruning and quantization "
+#~ "process with state-of-the-art "
+#~ "strategies and NNI's auto tuning power."
+#~ msgstr ""

-#: ../../source/compression/overview.rst:52
-msgid "Model Speedup"
-msgstr ""
+#~ msgid ""
+#~ "Speedup a compressed model to make "
+#~ "it have lower inference latency and "
+#~ "also make it smaller."
+#~ msgstr ""

-#: ../../source/compression/overview.rst:54
-msgid ""
-"The final goal of model compression is to reduce inference latency and "
-"model size. However, existing model compression algorithms mainly use "
-"simulation to check the performance (e.g., accuracy) of compressed model."
-" For example, using masks for pruning algorithms, and storing quantized "
-"values still in float32 for quantization algorithms. Given the output "
-"masks and quantization bits produced by those algorithms, NNI can really "
-"speedup the model."
-msgstr ""
+#~ msgid ""
+#~ "Provide friendly and easy-to-use "
+#~ "compression utilities for users to dive"
+#~ " into the compression process and "
+#~ "results."
+#~ msgstr ""

-#: ../../source/compression/overview.rst:59
-msgid "The following figure shows how NNI prunes and speeds up your models."
-msgstr ""
+#~ msgid ""
+#~ "Concise interface for users to customize"
+#~ " their own compression algorithms."
+#~ msgstr ""

-#: ../../source/compression/overview.rst:67
-msgid ""
-"The detailed tutorial of Speedup Model with Mask can be found :doc:`here "
-"<../tutorials/pruning_speedup>`. The detailed tutorial of Speedup Model "
-"with Calibration Config can be found :doc:`here "
-"<../tutorials/quantization_speedup>`."
-msgstr ""
+#~ msgid "Compression Pipeline"
+#~ msgstr ""

-#: ../../source/compression/overview.rst:72
-msgid ""
-"NNI's model pruning framework has been upgraded to a more powerful "
-"version (named pruning v2 before nni v2.6). The old version (`named "
-"pruning before nni v2.6 "
-"<https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) will be "
-"out of maintenance. If for some reason you have to use the old pruning, "
-"v2.6 is the last nni version to support old pruning version."
-msgstr ""
+#~ msgid ""
+#~ "The overall compression pipeline in NNI"
+#~ " is shown above. For compressing a"
+#~ " pretrained model, pruning and quantization"
+#~ " can be used alone or in "
+#~ "combination. If users want to apply "
+#~ "both, a sequential mode is recommended"
+#~ " as common practise."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Note that NNI pruners or quantizers "
+#~ "are not meant to physically compact "
+#~ "the model but for simulating the "
+#~ "compression effect. Whereas NNI speedup "
+#~ "tool can truly compress model by "
+#~ "changing the network architecture and "
+#~ "therefore reduce latency. To obtain a"
+#~ " truly compact model, users should "
+#~ "conduct :doc:`pruning speedup "
+#~ "<../tutorials/pruning_speedup>` or :doc:`quantizaiton "
+#~ "speedup <../tutorials/quantization_speedup>`. The "
+#~ "interface and APIs are unified for "
+#~ "both PyTorch and TensorFlow. Currently "
+#~ "only PyTorch version has been supported,"
+#~ " and TensorFlow version will be "
+#~ "supported in future."
+#~ msgstr ""
+
+#~ msgid "Model Speedup"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "The final goal of model compression "
+#~ "is to reduce inference latency and "
+#~ "model size. However, existing model "
+#~ "compression algorithms mainly use simulation"
+#~ " to check the performance (e.g., "
+#~ "accuracy) of compressed model. For "
+#~ "example, using masks for pruning "
+#~ "algorithms, and storing quantized values "
+#~ "still in float32 for quantization "
+#~ "algorithms. Given the output masks and"
+#~ " quantization bits produced by those "
+#~ "algorithms, NNI can really speedup the"
+#~ " model."
+#~ msgstr ""
+
+#~ msgid "The following figure shows how NNI prunes and speeds up your models."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "The detailed tutorial of Speedup Model"
+#~ " with Mask can be found :doc:`here"
+#~ " <../tutorials/pruning_speedup>`. The detailed "
+#~ "tutorial of Speedup Model with "
+#~ "Calibration Config can be found "
+#~ ":doc:`here <../tutorials/quantization_speedup>`."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "NNI's model pruning framework has been"
+#~ " upgraded to a more powerful version"
+#~ " (named pruning v2 before nni v2.6)."
+#~ " The old version (`named pruning "
+#~ "before nni v2.6 "
+#~ "<https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) "
+#~ "will be out of maintenance. If for"
+#~ " some reason you have to use "
+#~ "the old pruning, v2.6 is the last"
+#~ " nni version to support old pruning"
+#~ " version."
+#~ msgstr ""

--- a/docs/source/locales/zh/LC_MESSAGES/tutorials.po
+++ b/docs/source/locales/zh/LC_MESSAGES/tutorials.po
@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: NNI \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-04-12 10:42+0800\n"
+"POT-Creation-Date: 2023-05-08 16:52+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language-Team: LANGUAGE <LL@li.org>\n"
@ -308,128 +308,9 @@ msgid ":download:`Download Jupyter notebook: main.ipynb <main.ipynb>`"
 msgstr ""

 #: ../../source/tutorials/hpo_quickstart_pytorch/main.rst:335
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:343
 msgid "`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:13
-msgid ""
-"Click :ref:`here "
-"<sphx_glr_download_tutorials_pruning_quick_start_mnist.py>` to download "
-"the full example code"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:22
-msgid "Pruning Quickstart"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:24
-msgid "Here is a three-minute video to get you started with model pruning."
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:29
-msgid ""
-"Model pruning is a technique to reduce the model size and computation by "
-"reducing model weight size or intermediate state size. There are three "
-"common practices for pruning a DNN model:"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:32
-msgid "Pre-training a model -> Pruning the model -> Fine-tuning the pruned model"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:33
-msgid ""
-"Pruning a model during training (i.e., pruning aware training) -> Fine-"
-"tuning the pruned model"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:34
-msgid "Pruning a model -> Training the pruned model from scratch"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:36
-msgid ""
-"NNI supports all of the above pruning practices by working on the key "
-"pruning stage. Following this tutorial for a quick look at how to use NNI"
-" to prune a model in a common practice."
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:42
-msgid "Preparation"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:44
-msgid ""
-"In this tutorial, we use a simple model and pre-trained on MNIST dataset."
-" If you are familiar with defining a model and training in pytorch, you "
-"can skip directly to `Pruning Model`_."
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:122
-msgid "Pruning Model"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:124
-msgid ""
-"Using L1NormPruner to prune the model and generate the masks. Usually, a "
-"pruner requires original model and ``config_list`` as its inputs. "
-"Detailed about how to write ``config_list`` please refer "
-":doc:`compression config specification "
-"<../compression/compression_config_list>`."
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:128
-msgid ""
-"The following `config_list` means all layers whose type is `Linear` or "
-"`Conv2d` will be pruned, except the layer named `fc3`, because `fc3` is "
-"`exclude`. The final sparsity ratio for each layer is 50%. The layer "
-"named `fc3` will not be pruned."
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:154
-msgid "Pruners usually require `model` and `config_list` as input arguments."
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:229
-msgid ""
-"Speedup the original model with masks, note that `ModelSpeedup` requires "
-"an unwrapped model. The model becomes smaller after speedup, and reaches "
-"a higher sparsity ratio because `ModelSpeedup` will propagate the masks "
-"across layers."
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:262
-msgid "the model will become real smaller after speedup"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:298
-msgid "Fine-tuning Compacted Model"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:299
-msgid ""
-"Note that if the model has been sped up, you need to re-initialize a new "
-"optimizer for fine-tuning. Because speedup will replace the masked big "
-"layers with dense small ones."
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:320
-msgid "**Total running time of the script:** ( 1 minutes  0.810 seconds)"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:332
-msgid ""
-":download:`Download Python source code: pruning_quick_start_mnist.py "
-"<pruning_quick_start_mnist.py>`"
-msgstr ""
-
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:336
-msgid ""
-":download:`Download Jupyter notebook: pruning_quick_start_mnist.ipynb "
-"<pruning_quick_start_mnist.ipynb>`"
-msgstr ""
-
 #~ msgid "**Total running time of the script:** ( 2 minutes  15.810 seconds)"
 #~ msgstr ""

@ -899,3 +780,125 @@ msgstr ""
 #~ "hello_nas.ipynb <hello_nas.ipynb>`"
 #~ msgstr ""

+#~ msgid ""
+#~ "Click :ref:`here "
+#~ "<sphx_glr_download_tutorials_pruning_quick_start_mnist.py>` to"
+#~ " download the full example code"
+#~ msgstr ""
+
+#~ msgid "Pruning Quickstart"
+#~ msgstr ""
+
+#~ msgid "Here is a three-minute video to get you started with model pruning."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Model pruning is a technique to "
+#~ "reduce the model size and computation"
+#~ " by reducing model weight size or "
+#~ "intermediate state size. There are three"
+#~ " common practices for pruning a DNN"
+#~ " model:"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Pre-training a model -> Pruning "
+#~ "the model -> Fine-tuning the "
+#~ "pruned model"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Pruning a model during training (i.e.,"
+#~ " pruning aware training) -> Fine-"
+#~ "tuning the pruned model"
+#~ msgstr ""
+
+#~ msgid "Pruning a model -> Training the pruned model from scratch"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "NNI supports all of the above "
+#~ "pruning practices by working on the "
+#~ "key pruning stage. Following this "
+#~ "tutorial for a quick look at how"
+#~ " to use NNI to prune a model"
+#~ " in a common practice."
+#~ msgstr ""
+
+#~ msgid "Preparation"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "In this tutorial, we use a simple"
+#~ " model and pre-trained on MNIST "
+#~ "dataset. If you are familiar with "
+#~ "defining a model and training in "
+#~ "pytorch, you can skip directly to "
+#~ "`Pruning Model`_."
+#~ msgstr ""
+
+#~ msgid "Pruning Model"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Using L1NormPruner to prune the model"
+#~ " and generate the masks. Usually, a"
+#~ " pruner requires original model and "
+#~ "``config_list`` as its inputs. Detailed "
+#~ "about how to write ``config_list`` "
+#~ "please refer :doc:`compression config "
+#~ "specification <../compression/compression_config_list>`."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "The following `config_list` means all "
+#~ "layers whose type is `Linear` or "
+#~ "`Conv2d` will be pruned, except the "
+#~ "layer named `fc3`, because `fc3` is "
+#~ "`exclude`. The final sparsity ratio for"
+#~ " each layer is 50%. The layer "
+#~ "named `fc3` will not be pruned."
+#~ msgstr ""
+
+#~ msgid "Pruners usually require `model` and `config_list` as input arguments."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Speedup the original model with masks,"
+#~ " note that `ModelSpeedup` requires an "
+#~ "unwrapped model. The model becomes "
+#~ "smaller after speedup, and reaches a "
+#~ "higher sparsity ratio because `ModelSpeedup`"
+#~ " will propagate the masks across "
+#~ "layers."
+#~ msgstr ""
+
+#~ msgid "the model will become real smaller after speedup"
+#~ msgstr ""
+
+#~ msgid "Fine-tuning Compacted Model"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Note that if the model has been"
+#~ " sped up, you need to re-"
+#~ "initialize a new optimizer for fine-"
+#~ "tuning. Because speedup will replace the"
+#~ " masked big layers with dense small"
+#~ " ones."
+#~ msgstr ""
+
+#~ msgid "**Total running time of the script:** ( 1 minutes  0.810 seconds)"
+#~ msgstr ""
+
+#~ msgid ""
+#~ ":download:`Download Python source code: "
+#~ "pruning_quick_start_mnist.py <pruning_quick_start_mnist.py>`"
+#~ msgstr ""
+
+#~ msgid ""
+#~ ":download:`Download Jupyter notebook: "
+#~ "pruning_quick_start_mnist.ipynb "
+#~ "<pruning_quick_start_mnist.ipynb>`"
+#~ msgstr ""
+
--- a/docs/source/quickstart.rst
+++ b/docs/source/quickstart.rst
@ -18,6 +18,6 @@ Quickstart
 .. cardlinkitem::
   :header: Model Compression Quickstart
   :description: Familiarize yourself with pruning to compress your model.
-   :link: tutorials/pruning_quick_start_mnist
+   :link: tutorials/pruning_quick_start
   :image: ../img/thumbnails/pruning-tutorial.svg
   :background: blue
--- a/docs/source/quickstart_zh.rst
+++ b/docs/source/quickstart_zh.rst
@ -20,6 +20,6 @@
 .. cardlinkitem::
   :header: 模型压缩快速入门
   :description: 学习剪枝以压缩您的模型。
-   :link: tutorials/pruning_quick_start_mnist
+   :link: tutorials/pruning_quick_start
   :image: ../img/thumbnails/pruning-tutorial.svg
   :background: blue
--- a/docs/source/reference/compression_preview/distiller.rst
+++ b/docs/source/reference/compression_preview/distiller.rst
@ -1,5 +1,5 @@
-Distiller (Preview)
-===================
+Distiller
+=========

 DynamicLayerwiseDistiller
 -------------------------
--- a/docs/source/reference/compression/evaluator.rst
+++ b/docs/source/reference/compression/evaluator.rst
@ -1,17 +1,23 @@
 Evaluator
 =========

+.. _new-torch-evaluator:
+
 TorchEvaluator
 --------------

-..  autoclass:: nni.compression.pytorch.TorchEvaluator
+..  autoclass:: nni.contrib.compression.TorchEvaluator
+
+.. _new-lightning-evaluator:

 LightningEvaluator
 ------------------

-..  autoclass:: nni.compression.pytorch.LightningEvaluator
+..  autoclass:: nni.contrib.compression.LightningEvaluator
+
+.. _new-transformers-evaluator:

 TransformersEvaluator
 ---------------------

-..  autoclass:: nni.compression.pytorch.TransformersEvaluator
+..  autoclass:: nni.contrib.compression.TransformersEvaluator
--- a/docs/source/reference/compression/framework.rst
+++ b/docs/source/reference/compression/framework.rst
@ -1,67 +0,0 @@
-Framework Related
-=================
-
-Pruner
------
-
-.. autoclass:: nni.compression.pytorch.base.Pruner
-    :members:
-
-PrunerModuleWrapper
-------------------
-
-.. autoclass:: nni.compression.pytorch.base.PrunerModuleWrapper
-
-BasicPruner
-----------
-
-.. autoclass:: nni.compression.pytorch.pruning.basic_pruner.BasicPruner
-    :members:
-
-DataCollector
-------------
-
-.. autoclass:: nni.compression.pytorch.pruning.tools.DataCollector
-    :members:
-
-MetricsCalculator
-----------------
-
-.. autoclass:: nni.compression.pytorch.pruning.tools.MetricsCalculator
-    :members:
-
-SparsityAllocator
-----------------
-
-.. autoclass:: nni.compression.pytorch.pruning.tools.SparsityAllocator
-    :members:
-
-BasePruningScheduler
--------------------
-
-.. autoclass:: nni.compression.pytorch.base.BasePruningScheduler
-    :members:
-
-TaskGenerator
-------------
-
-.. autoclass:: nni.compression.pytorch.pruning.tools.TaskGenerator
-    :members:
-
-Quantizer
---------
-
-.. autoclass:: nni.compression.pytorch.compressor.Quantizer
-    :members:
-
-QuantizerModuleWrapper
----------------------
-
-.. autoclass:: nni.compression.pytorch.compressor.QuantizerModuleWrapper
-    :members:
-
-QuantGrad
---------
-
-.. autoclass:: nni.compression.pytorch.compressor.QuantGrad
-    :members:
--- a/docs/source/reference/compression/pruner.rst
+++ b/docs/source/reference/compression/pruner.rst
@ -4,120 +4,71 @@ Pruner
 Basic Pruner
 ------------

-.. _level-pruner:
+.. _new-level-pruner:

 Level Pruner
 ^^^^^^^^^^^^

-..  autoclass:: nni.compression.pytorch.pruning.LevelPruner
+..  autoclass:: nni.contrib.compression.pruning.LevelPruner

-.. _l1-norm-pruner:
+.. _new-l1-norm-pruner:

 L1 Norm Pruner
 ^^^^^^^^^^^^^^

-.. autoclass:: nni.compression.pytorch.pruning.L1NormPruner
+.. autoclass:: nni.contrib.compression.pruning.L1NormPruner

-.. _l2-norm-pruner:
+.. _new-l2-norm-pruner:

 L2 Norm Pruner
 ^^^^^^^^^^^^^^

-.. autoclass:: nni.compression.pytorch.pruning.L2NormPruner
+.. autoclass:: nni.contrib.compression.pruning.L2NormPruner

-.. _fpgm-pruner:
+.. _new-fpgm-pruner:

 FPGM Pruner
 ^^^^^^^^^^^

-.. autoclass:: nni.compression.pytorch.pruning.FPGMPruner
+.. autoclass:: nni.contrib.compression.pruning.FPGMPruner

-.. _slim-pruner:
+.. _new-slim-pruner:

 Slim Pruner
 ^^^^^^^^^^^

-.. autoclass:: nni.compression.pytorch.pruning.SlimPruner
+.. autoclass:: nni.contrib.compression.pruning.SlimPruner

-.. _activation-apoz-rank-pruner:
-
-Activation APoZ Rank Pruner
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. autoclass:: nni.compression.pytorch.pruning.ActivationAPoZRankPruner
-
-.. _activation-mean-rank-pruner:
-
-Activation Mean Rank Pruner
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. autoclass:: nni.compression.pytorch.pruning.ActivationMeanRankPruner
-
-.. _taylor-fo-weight-pruner:
+.. _new-taylor-pruner:

 Taylor FO Weight Pruner
 ^^^^^^^^^^^^^^^^^^^^^^^

-.. autoclass:: nni.compression.pytorch.pruning.TaylorFOWeightPruner
-
-.. _admm-pruner:
-
-ADMM Pruner
-^^^^^^^^^^^
-
-.. autoclass:: nni.compression.pytorch.pruning.ADMMPruner
+.. autoclass:: nni.contrib.compression.pruning.TaylorPruner

 Scheduled Pruners
 -----------------

-.. _linear-pruner:
+.. _new-linear-pruner:

 Linear Pruner
 ^^^^^^^^^^^^^

-.. autoclass:: nni.compression.pytorch.pruning.LinearPruner
+.. autoclass:: nni.contrib.compression.pruning.LinearPruner

-.. _agp-pruner:
+.. _new-agp-pruner:

 AGP Pruner
 ^^^^^^^^^^

-.. autoclass:: nni.compression.pytorch.pruning.AGPPruner
-
-.. _lottery-ticket-pruner:
-
-Lottery Ticket Pruner
-^^^^^^^^^^^^^^^^^^^^^
-
-.. autoclass:: nni.compression.pytorch.pruning.LotteryTicketPruner
-
-.. _simulated-annealing-pruner:
-
-Simulated Annealing Pruner
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. autoclass:: nni.compression.pytorch.pruning.SimulatedAnnealingPruner
-
-.. _auto-compress-pruner:
-
-Auto Compress Pruner
-^^^^^^^^^^^^^^^^^^^^
-
-.. autoclass:: nni.compression.pytorch.pruning.AutoCompressPruner
-
-.. _amc-pruner:
-
-AMC Pruner
-^^^^^^^^^^
-
-..  autoclass:: nni.compression.pytorch.pruning.AMCPruner
+.. autoclass:: nni.contrib.compression.pruning.AGPPruner

 Other Pruner
 ------------

-.. _movement-pruner:
+.. _new-movement-pruner:

 Movement Pruner
 ^^^^^^^^^^^^^^^

-.. autoclass:: nni.compression.pytorch.pruning.MovementPruner
+.. autoclass:: nni.contrib.compression.pruning.MovementPruner
--- a/docs/source/reference/compression/pruning_speedup.rst
+++ b/docs/source/reference/compression/pruning_speedup.rst
@ -1,5 +1,5 @@
 Pruning Speedup
 ===============

-.. autoclass:: nni.compression.pytorch.speedup.ModelSpeedup
+.. autoclass:: nni.compression.pytorch.speedup.v2.ModelSpeedup
    :members:
--- a/docs/source/reference/compression/quantization_speedup.rst
+++ b/docs/source/reference/compression/quantization_speedup.rst
@ -1,5 +0,0 @@
-Quantization Speedup
-====================
-
-.. autoclass:: nni.compression.pytorch.quantization_speedup.ModelSpeedupTensorRT
-    :members:
--- a/docs/source/reference/compression/quantizer.rst
+++ b/docs/source/reference/compression/quantizer.rst
@ -1,44 +1,37 @@
 Quantizer
 =========

-.. _naive-quantizer:
-
-Naive Quantizer
-^^^^^^^^^^^^^^^
-
-..  autoclass:: nni.compression.pytorch.quantization.NaiveQuantizer
-
-.. _qat-quantizer:
+.. _NewQATQuantizer:

 QAT Quantizer
 ^^^^^^^^^^^^^

-..  autoclass:: nni.compression.pytorch.quantization.QAT_Quantizer
+..  autoclass:: nni.contrib.compression.quantization.QATQuantizer

-.. _dorefa-quantizer:
+.. _NewDorefaQuantizer:

 DoReFa Quantizer
 ^^^^^^^^^^^^^^^^

-..  autoclass:: nni.compression.pytorch.quantization.DoReFaQuantizer
+..  autoclass:: nni.contrib.compression.quantization.DoReFaQuantizer

-.. _bnn-quantizer:
+.. _NewBNNQuantizer:

 BNN Quantizer
 ^^^^^^^^^^^^^

-..  autoclass:: nni.compression.pytorch.quantization.BNNQuantizer
+..  autoclass:: nni.contrib.compression.quantization.BNNQuantizer

-.. _lsq-quantizer:
+.. _NewLsqQuantizer:

 LSQ Quantizer
 ^^^^^^^^^^^^^

-..  autoclass:: nni.compression.pytorch.quantization.LsqQuantizer
+..  autoclass:: nni.contrib.compression.quantization.LsqQuantizer

-.. _observer-quantizer:
+.. _NewPtqQuantizer:

-Observer Quantizer
-^^^^^^^^^^^^^^^^^^
+PTQ Quantizer
+^^^^^^^^^^^^^

-..  autoclass:: nni.compression.pytorch.quantization.ObserverQuantizer
+..  autoclass:: nni.contrib.compression.quantization.PtqQuantizer
--- a/docs/source/reference/compression/toctree.rst
+++ b/docs/source/reference/compression/toctree.rst
@ -5,9 +5,8 @@ Compression API Reference
    :maxdepth: 1

    Pruner <pruner>
-    Quantizer <quantizer>
    Pruning Speedup <pruning_speedup>
-    Quantization Speedup <quantization_speedup>
+    Distiller <distiller>
    Evaluator <evaluator>
    Compression Utilities <utils>
-    Framework Related <framework>
+    Quantizer <quantizer>
--- a/docs/source/reference/compression/utils.rst
+++ b/docs/source/reference/compression/utils.rst
@ -1,36 +1,10 @@
 Compression Utilities
 =====================

-ChannelDependency
-----------------
+.. _auto_set_denpendency_group_ids:

-.. autoclass:: nni.compression.pytorch.utils.ChannelDependency
+auto_set_denpendency_group_ids
+------------------------------
+
+.. autoclass:: nni.contrib.compression.utils.auto_set_denpendency_group_ids
    :members:
-
-GroupDependency
---------------
-
-.. autoclass:: nni.compression.pytorch.utils.GroupDependency
-    :members:
-
-ChannelMaskConflict
-------------------
-
-.. autoclass:: nni.compression.pytorch.utils.ChannelMaskConflict
-    :members:
-
-GroupMaskConflict
-----------------
-
-.. autoclass:: nni.compression.pytorch.utils.GroupMaskConflict
-    :members:
-
-count_flops_params
------------------
-
-.. autofunction:: nni.compression.pytorch.utils.count_flops_params
-
-compute_sparsity
----------------
-
-.. autofunction:: nni.compression.pytorch.utils.pruning.compute_sparsity
--- a/docs/source/reference/compression_preview/evaluator.rst
+++ b/docs/source/reference/compression_preview/evaluator.rst
@ -1,23 +0,0 @@
-Evaluator
-=========
-
-.. _new-torch-evaluator:
-
-TorchEvaluator
--------------
-
-..  autoclass:: nni.contrib.compression.TorchEvaluator
-
-.. _new-lightning-evaluator:
-
-LightningEvaluator
------------------
-
-..  autoclass:: nni.contrib.compression.LightningEvaluator
-
-.. _new-transformers-evaluator:
-
-TransformersEvaluator
---------------------
-
-..  autoclass:: nni.contrib.compression.TransformersEvaluator
--- a/docs/source/reference/compression_preview/pruner.rst
+++ b/docs/source/reference/compression_preview/pruner.rst
@ -1,74 +0,0 @@
-Pruner (Preview)
-================
-
-Basic Pruner
------------
-
-.. _new-level-pruner:
-
-Level Pruner
-^^^^^^^^^^^^
-
-..  autoclass:: nni.contrib.compression.pruning.LevelPruner
-
-.. _new-l1-norm-pruner:
-
-L1 Norm Pruner
-^^^^^^^^^^^^^^
-
-.. autoclass:: nni.contrib.compression.pruning.L1NormPruner
-
-.. _new-l2-norm-pruner:
-
-L2 Norm Pruner
-^^^^^^^^^^^^^^
-
-.. autoclass:: nni.contrib.compression.pruning.L2NormPruner
-
-.. _new-fpgm-pruner:
-
-FPGM Pruner
-^^^^^^^^^^^
-
-.. autoclass:: nni.contrib.compression.pruning.FPGMPruner
-
-.. _new-slim-pruner:
-
-Slim Pruner
-^^^^^^^^^^^
-
-.. autoclass:: nni.contrib.compression.pruning.SlimPruner
-
-.. _new-taylor-pruner:
-
-Taylor FO Weight Pruner
-^^^^^^^^^^^^^^^^^^^^^^^
-
-.. autoclass:: nni.contrib.compression.pruning.TaylorPruner
-
-Scheduled Pruners
-----------------
-
-.. _new-linear-pruner:
-
-Linear Pruner
-^^^^^^^^^^^^^
-
-.. autoclass:: nni.contrib.compression.pruning.LinearPruner
-
-.. _new-agp-pruner:
-
-AGP Pruner
-^^^^^^^^^^
-
-.. autoclass:: nni.contrib.compression.pruning.AGPPruner
-
-Other Pruner
------------
-
-.. _new-movement-pruner:
-
-Movement Pruner
-^^^^^^^^^^^^^^^
-
-.. autoclass:: nni.contrib.compression.pruning.MovementPruner
--- a/docs/source/reference/compression_preview/pruning_speedup.rst
+++ b/docs/source/reference/compression_preview/pruning_speedup.rst
@ -1,5 +0,0 @@
-Pruning Speedup
-===============
-
-.. autoclass:: nni.compression.pytorch.speedup.v2.ModelSpeedup
-    :members:
--- a/docs/source/reference/compression_preview/quantizer.rst
+++ b/docs/source/reference/compression_preview/quantizer.rst
@ -1,37 +0,0 @@
-Quantizer
-=========
-
-.. _NewQATQuantizer:
-
-QAT Quantizer
-^^^^^^^^^^^^^
-
-..  autoclass:: nni.contrib.compression.quantization.QATQuantizer
-
-.. _NewDorefaQuantizer:
-
-DoReFa Quantizer
-^^^^^^^^^^^^^^^^
-
-..  autoclass:: nni.contrib.compression.quantization.DoReFaQuantizer
-
-.. _NewBNNQuantizer:
-
-BNN Quantizer
-^^^^^^^^^^^^^
-
-..  autoclass:: nni.contrib.compression.quantization.BNNQuantizer
-
-.. _NewLsqQuantizer:
-
-LSQ Quantizer
-^^^^^^^^^^^^^
-
-..  autoclass:: nni.contrib.compression.quantization.LsqQuantizer
-
-.. _NewPtqQuantizer:
-
-PTQ Quantizer
-^^^^^^^^^^^^^
-
-..  autoclass:: nni.contrib.compression.quantization.PtqQuantizer
--- a/docs/source/reference/compression_preview/toctree.rst
+++ b/docs/source/reference/compression_preview/toctree.rst
@ -1,12 +0,0 @@
-Compression API Reference (Preview)
-===================================
-
-..  toctree::
-    :maxdepth: 1
-
-    Pruner <pruner>
-    Pruning Speedup <pruning_speedup>
-    Distiller <distiller>
-    Evaluator <evaluator>
-    Compression Utilities <utils>
-    Quantizer <quantizer>
--- a/docs/source/reference/compression_preview/utils.rst
+++ b/docs/source/reference/compression_preview/utils.rst
@ -1,10 +0,0 @@
-Compression Utilities
-=====================
-
-.. _auto_set_denpendency_group_ids:
-
-auto_set_denpendency_group_ids
------------------------------
-
-.. autoclass:: nni.contrib.compression.utils.auto_set_denpendency_group_ids
-    :members:
--- a/docs/source/reference/python_api.rst
+++ b/docs/source/reference/python_api.rst
@ -7,7 +7,6 @@ Python API Reference
    Hyperparameter Optimization <hpo>
    Neural Architecture Search <nas>
    Model Compression <compression/toctree>
-    Model Compression (Preview) <compression_preview/toctree>
    Experiment <experiment>
    Mutable <mutable>
    Others <others>
--- a/docs/source/tutorials/hpo_quickstart_pytorch/index.rst
+++ b/docs/source/tutorials/hpo_quickstart_pytorch/index.rst
@ -17,7 +17,7 @@
 .. only:: html

  .. image:: /tutorials/hpo_quickstart_pytorch/images/thumb/sphx_glr_main_thumb.png
-    :alt:
+    :alt: HPO Quickstart with PyTorch

  :ref:`sphx_glr_tutorials_hpo_quickstart_pytorch_main.py`

@ -34,7 +34,7 @@
 .. only:: html

  .. image:: /tutorials/hpo_quickstart_pytorch/images/thumb/sphx_glr_model_thumb.png
-    :alt:
+    :alt: Port PyTorch Quickstart to NNI

  :ref:`sphx_glr_tutorials_hpo_quickstart_pytorch_model.py`

--- a/docs/source/tutorials/hpo_quickstart_tensorflow/index.rst
+++ b/docs/source/tutorials/hpo_quickstart_tensorflow/index.rst
@ -17,7 +17,7 @@
 .. only:: html

  .. image:: /tutorials/hpo_quickstart_tensorflow/images/thumb/sphx_glr_main_thumb.png
-    :alt:
+    :alt: HPO Quickstart with TensorFlow

  :ref:`sphx_glr_tutorials_hpo_quickstart_tensorflow_main.py`

@ -34,7 +34,7 @@
 .. only:: html

  .. image:: /tutorials/hpo_quickstart_tensorflow/images/thumb/sphx_glr_model_thumb.png
-    :alt:
+    :alt: Port TensorFlow Quickstart to NNI

  :ref:`sphx_glr_tutorials_hpo_quickstart_tensorflow_model.py`

--- a/docs/source/tutorials/images/thumb/sphx_glr_pruning_quick_start_thumb.png
+++ b/docs/source/tutorials/images/thumb/sphx_glr_pruning_quick_start_thumb.png
--- a/docs/source/tutorials/index.rst
+++ b/docs/source/tutorials/index.rst
@ -17,7 +17,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/images/thumb/sphx_glr_pruning_speedup_thumb.png
-    :alt:
+    :alt: Speedup Model with Mask

  :ref:`sphx_glr_tutorials_pruning_speedup.py`

@ -29,31 +29,14 @@ Tutorials

 .. raw:: html

-    <div class="sphx-glr-thumbcontainer" tooltip="Here is a four-minute video to get you started with model quantization.">
+    <div class="sphx-glr-thumbcontainer" tooltip="Model pruning is a technique to reduce the model size and computation by reducing model weight ...">

 .. only:: html

-  .. image:: /tutorials/images/thumb/sphx_glr_quantization_quick_start_mnist_thumb.png
-    :alt:
+  .. image:: /tutorials/images/thumb/sphx_glr_pruning_quick_start_thumb.png
+    :alt: Pruning Quickstart

-  :ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py`
-
-.. raw:: html
-
-      <div class="sphx-glr-thumbnail-title">Quantization Quickstart</div>
-    </div>
-
-
-.. raw:: html
-
-    <div class="sphx-glr-thumbcontainer" tooltip="Here is a three-minute video to get you started with model pruning.">
-
-.. only:: html
-
-  .. image:: /tutorials/images/thumb/sphx_glr_pruning_quick_start_mnist_thumb.png
-    :alt:
-
-  :ref:`sphx_glr_tutorials_pruning_quick_start_mnist.py`
+  :ref:`sphx_glr_tutorials_pruning_quick_start.py`

 .. raw:: html

@ -61,23 +44,6 @@ Tutorials
    </div>


-.. raw:: html
-
-    <div class="sphx-glr-thumbcontainer" tooltip="To write a new quantization algorithm, you can write a class that inherits nni.compression.pyto...">
-
-.. only:: html
-
-  .. image:: /tutorials/images/thumb/sphx_glr_quantization_customize_thumb.png
-    :alt:
-
-  :ref:`sphx_glr_tutorials_quantization_customize.py`
-
-.. raw:: html
-
-      <div class="sphx-glr-thumbnail-title">Customize a new quantization algorithm</div>
-    </div>
-
-
 .. raw:: html

    <div class="sphx-glr-thumbcontainer" tooltip="In this tutorial, we show how to use NAS Benchmarks as datasets. For research purposes we somet...">
@ -85,7 +51,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/images/thumb/sphx_glr_nasbench_as_dataset_thumb.png
-    :alt:
+    :alt: Use NAS Benchmarks as Datasets

  :ref:`sphx_glr_tutorials_nasbench_as_dataset.py`

@ -97,12 +63,12 @@ Tutorials

 .. raw:: html

-    <div class="sphx-glr-thumbcontainer" tooltip="Here is a four-minute video to get you started with model quantization.">
+    <div class="sphx-glr-thumbcontainer" tooltip="Quantization reduces model size and speeds up inference time by reducing the number of bits req...">

 .. only:: html

  .. image:: /tutorials/images/thumb/sphx_glr_quantization_quick_start_thumb.png
-    :alt:
+    :alt: Quantization Quickstart

  :ref:`sphx_glr_tutorials_quantization_quick_start.py`

@ -119,7 +85,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/images/thumb/sphx_glr_quantization_speedup_thumb.png
-    :alt:
+    :alt: Speed Up Quantized Model with TensorRT

  :ref:`sphx_glr_tutorials_quantization_speedup.py`

@ -136,7 +102,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/images/thumb/sphx_glr_hello_nas_thumb.png
-    :alt:
+    :alt: Hello, NAS!

  :ref:`sphx_glr_tutorials_hello_nas.py`

@ -153,7 +119,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/images/thumb/sphx_glr_quantization_bert_glue_thumb.png
-    :alt:
+    :alt: Quantize BERT on Task GLUE

  :ref:`sphx_glr_tutorials_quantization_bert_glue.py`

@ -170,7 +136,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/images/thumb/sphx_glr_darts_thumb.png
-    :alt:
+    :alt: Searching in DARTS search space

  :ref:`sphx_glr_tutorials_darts.py`

@ -182,12 +148,12 @@ Tutorials

 .. raw:: html

-    <div class="sphx-glr-thumbcontainer" tooltip="This is a new tutorial on pruning transformer in nni v3.0 (old tutorial). The main difference b...">
+    <div class="sphx-glr-thumbcontainer" tooltip="This is a new tutorial on pruning transformer in nni v3.0 (`old tutorial &lt;https://nni.readthedo...">

 .. only:: html

  .. image:: /tutorials/images/thumb/sphx_glr_new_pruning_bert_glue_thumb.png
-    :alt:
+    :alt: Pruning Bert on Task MNLI

  :ref:`sphx_glr_tutorials_new_pruning_bert_glue.py`

@ -197,23 +163,6 @@ Tutorials
    </div>


-.. raw:: html
-
-    <div class="sphx-glr-thumbcontainer" tooltip="Workable Pruning Process ------------------------">
-
-.. only:: html
-
-  .. image:: /tutorials/images/thumb/sphx_glr_pruning_bert_glue_thumb.png
-    :alt:
-
-  :ref:`sphx_glr_tutorials_pruning_bert_glue.py`
-
-.. raw:: html
-
-      <div class="sphx-glr-thumbnail-title">Pruning Bert on Task MNLI</div>
-    </div>
-
-
 .. raw:: html

    </div>
@ -223,9 +172,7 @@ Tutorials
   :hidden:

   /tutorials/pruning_speedup
-   /tutorials/quantization_quick_start_mnist
-   /tutorials/pruning_quick_start_mnist
-   /tutorials/quantization_customize
+   /tutorials/pruning_quick_start
   /tutorials/nasbench_as_dataset
   /tutorials/quantization_quick_start
   /tutorials/quantization_speedup
@ -233,7 +180,6 @@ Tutorials
   /tutorials/quantization_bert_glue
   /tutorials/darts
   /tutorials/new_pruning_bert_glue
-   /tutorials/pruning_bert_glue



@ -250,7 +196,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/hpo_quickstart_pytorch/images/thumb/sphx_glr_main_thumb.png
-    :alt:
+    :alt: HPO Quickstart with PyTorch

  :ref:`sphx_glr_tutorials_hpo_quickstart_pytorch_main.py`

@ -267,7 +213,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/hpo_quickstart_pytorch/images/thumb/sphx_glr_model_thumb.png
-    :alt:
+    :alt: Port PyTorch Quickstart to NNI

  :ref:`sphx_glr_tutorials_hpo_quickstart_pytorch_model.py`

@ -296,7 +242,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/hpo_quickstart_tensorflow/images/thumb/sphx_glr_main_thumb.png
-    :alt:
+    :alt: HPO Quickstart with TensorFlow

  :ref:`sphx_glr_tutorials_hpo_quickstart_tensorflow_main.py`

@ -313,7 +259,7 @@ Tutorials
 .. only:: html

  .. image:: /tutorials/hpo_quickstart_tensorflow/images/thumb/sphx_glr_model_thumb.png
-    :alt:
+    :alt: Port TensorFlow Quickstart to NNI

  :ref:`sphx_glr_tutorials_hpo_quickstart_tensorflow_model.py`

@ -332,7 +278,6 @@ Tutorials
   :hidden:
   :includehidden:

-
   /tutorials/hpo_quickstart_pytorch/index.rst
   /tutorials/hpo_quickstart_tensorflow/index.rst

--- a/docs/source/tutorials/pruning_bert_glue.ipynb
+++ b/docs/source/tutorials/pruning_bert_glue.ipynb
@ -1,205 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "%matplotlib inline"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "\n# Pruning Bert on Task MNLI\n\n## Workable Pruning Process\n\nHere we show an effective transformer pruning process that NNI team has tried, and users can use NNI to discover better processes.\n\nThe entire pruning process can be divided into the following steps:\n\n1. Finetune the pre-trained model on the downstream task. From our experience,\n   the final performance of pruning on the finetuned model is better than pruning directly on the pre-trained model.\n   At the same time, the finetuned model obtained in this step will also be used as the teacher model for the following\n   distillation training.\n2. Pruning the attention layer at first. Here we apply block-sparse on attention layer weight,\n   and directly prune the head (condense the weight) if the head was fully masked.\n   If the head was partially masked, we will not prune it and recover its weight.\n3. Retrain the head-pruned model with distillation. Recover the model precision before pruning FFN layer.\n4. Pruning the FFN layer. Here we apply the output channels pruning on the 1st FFN layer,\n   and the 2nd FFN layer input channels will be pruned due to the pruning of 1st layer output channels.\n5. Retrain the final pruned model with distillation.\n\nDuring the process of pruning transformer, we gained some of the following experiences:\n\n* We using `movement-pruner` in step 2 and `taylor-fo-weight-pruner` in step 4. `movement-pruner` has good performance on attention layers,\n  and `taylor-fo-weight-pruner` method has good performance on FFN layers. These two pruners are all some kinds of gradient-based pruning algorithms,\n  we also try weight-based pruning algorithms like `l1-norm-pruner`, but it doesn't seem to work well in this scenario.\n* Distillation is a good way to recover model precision. In terms of results, usually 1~2% improvement in accuracy can be achieved when we prune bert on mnli task.\n* It is necessary to gradually increase the sparsity rather than reaching a very high sparsity all at once.\n\n## Experiment\n\nThe complete pruning process will take about 8 hours on one A100.\n\n### Preparation\n\nThis section is mainly to get a finetuned model on the downstream task.\nIf you are familiar with how to finetune Bert on GLUE dataset, you can skip this section.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>Please set ``dev_mode`` to ``False`` to run this tutorial. Here ``dev_mode`` is ``True`` by default is for generating documents.</p></div>\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "dev_mode = True"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Some basic setting.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "from pathlib import Path\nfrom typing import Callable, Dict\n\npretrained_model_name_or_path = 'bert-base-uncased'\ntask_name = 'mnli'\nexperiment_id = 'pruning_bert_mnli'\n\n# heads_num and layers_num should align with pretrained_model_name_or_path\nheads_num = 12\nlayers_num = 12\n\n# used to save the experiment log\nlog_dir = Path(f'./pruning_log/{pretrained_model_name_or_path}/{task_name}/{experiment_id}')\nlog_dir.mkdir(parents=True, exist_ok=True)\n\n# used to save the finetuned model and share between different experiemnts with same pretrained_model_name_or_path and task_name\nmodel_dir = Path(f'./models/{pretrained_model_name_or_path}/{task_name}')\nmodel_dir.mkdir(parents=True, exist_ok=True)\n\n# used to save GLUE data\ndata_dir = Path(f'./data')\ndata_dir.mkdir(parents=True, exist_ok=True)\n\n# set seed\nfrom transformers import set_seed\nset_seed(1024)\n\nimport torch\ndevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Create dataloaders.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "from torch.utils.data import DataLoader\n\nfrom datasets import load_dataset\nfrom transformers import BertTokenizerFast, DataCollatorWithPadding\n\ntask_to_keys = {\n    'cola': ('sentence', None),\n    'mnli': ('premise', 'hypothesis'),\n    'mrpc': ('sentence1', 'sentence2'),\n    'qnli': ('question', 'sentence'),\n    'qqp': ('question1', 'question2'),\n    'rte': ('sentence1', 'sentence2'),\n    'sst2': ('sentence', None),\n    'stsb': ('sentence1', 'sentence2'),\n    'wnli': ('sentence1', 'sentence2'),\n}\n\ndef prepare_dataloaders(cache_dir=data_dir, train_batch_size=32, eval_batch_size=32):\n    tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path)\n    sentence1_key, sentence2_key = task_to_keys[task_name]\n    data_collator = DataCollatorWithPadding(tokenizer)\n\n    # used to preprocess the raw data\n    def preprocess_function(examples):\n        # Tokenize the texts\n        args = (\n            (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])\n        )\n        result = tokenizer(*args, padding=False, max_length=128, truncation=True)\n\n        if 'label' in examples:\n            # In all cases, rename the column to labels because the model will expect that.\n            result['labels'] = examples['label']\n        return result\n\n    raw_datasets = load_dataset('glue', task_name, cache_dir=cache_dir)\n    for key in list(raw_datasets.keys()):\n        if 'test' in key:\n            raw_datasets.pop(key)\n\n    processed_datasets = raw_datasets.map(preprocess_function, batched=True,\n                                          remove_columns=raw_datasets['train'].column_names)\n\n    train_dataset = processed_datasets['train']\n    if task_name == 'mnli':\n        validation_datasets = {\n            'validation_matched': processed_datasets['validation_matched'],\n            'validation_mismatched': processed_datasets['validation_mismatched']\n        }\n    else:\n        validation_datasets = {\n            'validation': processed_datasets['validation']\n        }\n\n    train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=data_collator, batch_size=train_batch_size)\n    validation_dataloaders = {\n        val_name: DataLoader(val_dataset, collate_fn=data_collator, batch_size=eval_batch_size) \\\n            for val_name, val_dataset in validation_datasets.items()\n    }\n\n    return train_dataloader, validation_dataloaders\n\n\ntrain_dataloader, validation_dataloaders = prepare_dataloaders()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Training function & evaluation function.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "import functools\nimport time\n\nimport torch.nn.functional as F\nfrom datasets import load_metric\nfrom transformers.modeling_outputs import SequenceClassifierOutput\n\nfrom nni.common.types import SCHEDULER\n\n\ndef training(model: torch.nn.Module,\n             optimizer: torch.optim.Optimizer,\n             criterion: Callable[[torch.Tensor, torch.Tensor], torch.Tensor],\n             lr_scheduler: SCHEDULER = None,\n             max_steps: int = None,\n             max_epochs: int = None,\n             train_dataloader: DataLoader = None,\n             distillation: bool = False,\n             teacher_model: torch.nn.Module = None,\n             distil_func: Callable = None,\n             log_path: str = Path(log_dir) / 'training.log',\n             save_best_model: bool = False,\n             save_path: str = None,\n             evaluation_func: Callable = None,\n             eval_per_steps: int = 1000,\n             device=None):\n\n    assert train_dataloader is not None\n\n    model.train()\n    if teacher_model is not None:\n        teacher_model.eval()\n    current_step = 0\n    best_result = 0\n\n    total_epochs = max_steps // len(train_dataloader) + 1 if max_steps else max_epochs if max_epochs else 3\n    total_steps = max_steps if max_steps else total_epochs * len(train_dataloader)\n\n    print(f'Training {total_epochs} epochs, {total_steps} steps...')\n\n    for current_epoch in range(total_epochs):\n        for batch in train_dataloader:\n            if current_step >= total_steps:\n                return\n            batch.to(device)\n            outputs = model(**batch)\n            loss = outputs.loss\n\n            if distillation:\n                assert teacher_model is not None\n                with torch.no_grad():\n                    teacher_outputs = teacher_model(**batch)\n                distil_loss = distil_func(outputs, teacher_outputs)\n                loss = 0.1 * loss + 0.9 * distil_loss\n\n            loss = criterion(loss, None)\n            optimizer.zero_grad()\n            loss.backward()\n            optimizer.step()\n\n            # per step schedule\n            if lr_scheduler:\n                lr_scheduler.step()\n\n            current_step += 1\n\n            if current_step % eval_per_steps == 0 or current_step % len(train_dataloader) == 0:\n                result = evaluation_func(model) if evaluation_func else None\n                with (log_path).open('a+') as f:\n                    msg = '[{}] Epoch {}, Step {}: {}\\n'.format(time.asctime(time.localtime(time.time())), current_epoch, current_step, result)\n                    f.write(msg)\n                # if it's the best model, save it.\n                if save_best_model and (result is None or best_result < result['default']):\n                    assert save_path is not None\n                    torch.save(model.state_dict(), save_path)\n                    best_result = None if result is None else result['default']\n\n\ndef distil_loss_func(stu_outputs: SequenceClassifierOutput, tea_outputs: SequenceClassifierOutput, encoder_layer_idxs=[]):\n    encoder_hidden_state_loss = []\n    for i, idx in enumerate(encoder_layer_idxs[:-1]):\n        encoder_hidden_state_loss.append(F.mse_loss(stu_outputs.hidden_states[i], tea_outputs.hidden_states[idx]))\n    logits_loss = F.kl_div(F.log_softmax(stu_outputs.logits / 2, dim=-1), F.softmax(tea_outputs.logits / 2, dim=-1), reduction='batchmean') * (2 ** 2)\n\n    distil_loss = 0\n    for loss in encoder_hidden_state_loss:\n        distil_loss += loss\n    distil_loss += logits_loss\n    return distil_loss\n\n\ndef evaluation(model: torch.nn.Module, validation_dataloaders: Dict[str, DataLoader] = None, device=None):\n    assert validation_dataloaders is not None\n    training = model.training\n    model.eval()\n\n    is_regression = task_name == 'stsb'\n    metric = load_metric('glue', task_name)\n\n    result = {}\n    default_result = 0\n    for val_name, validation_dataloader in validation_dataloaders.items():\n        for batch in validation_dataloader:\n            batch.to(device)\n            outputs = model(**batch)\n            predictions = outputs.logits.argmax(dim=-1) if not is_regression else outputs.logits.squeeze()\n            metric.add_batch(\n                predictions=predictions,\n                references=batch['labels'],\n            )\n        result[val_name] = metric.compute()\n        default_result += result[val_name].get('f1', result[val_name].get('accuracy', 0))\n    result['default'] = default_result / len(result)\n\n    model.train(training)\n    return result\n\n\nevaluation_func = functools.partial(evaluation, validation_dataloaders=validation_dataloaders, device=device)\n\n\ndef fake_criterion(loss, _):\n    return loss"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Prepare pre-trained model and finetuning on downstream task.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "from torch.optim import Adam\nfrom torch.optim.lr_scheduler import LambdaLR\nfrom transformers import BertForSequenceClassification\n\n\ndef create_pretrained_model():\n    is_regression = task_name == 'stsb'\n    num_labels = 1 if is_regression else (3 if task_name == 'mnli' else 2)\n    model = BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path, num_labels=num_labels)\n    model.bert.config.output_hidden_states = True\n    return model\n\n\ndef create_finetuned_model():\n    finetuned_model = create_pretrained_model().to(device)\n    finetuned_model_state_path = Path(model_dir) / 'finetuned_model_state.pth'\n\n    if finetuned_model_state_path.exists():\n        finetuned_model.load_state_dict(torch.load(finetuned_model_state_path, map_location=device))\n    elif dev_mode:\n        pass\n    else:\n        steps_per_epoch = len(train_dataloader)\n        training_epochs = 3\n        optimizer = Adam(finetuned_model.parameters(), lr=3e-5, eps=1e-8)\n\n        def lr_lambda(current_step: int):\n            return max(0.0, float(training_epochs * steps_per_epoch - current_step) / float(training_epochs * steps_per_epoch))\n\n        lr_scheduler = LambdaLR(optimizer, lr_lambda)\n        training(finetuned_model, optimizer, fake_criterion, lr_scheduler=lr_scheduler,\n                 max_epochs=training_epochs, train_dataloader=train_dataloader, log_path=log_dir / 'finetuning_on_downstream.log',\n                 save_best_model=True, save_path=finetuned_model_state_path, evaluation_func=evaluation_func, device=device)\n    return finetuned_model\n\n\nfinetuned_model = create_finetuned_model()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Pruning\nAccording to experience, it is easier to achieve good results by pruning the attention part and the FFN part in stages.\nOf course, pruning together can also achieve the similar effect, but more parameter adjustment attempts are required.\nSo in this section, we do pruning in stages.\n\nFirst, we prune the attention layer with MovementPruner.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "steps_per_epoch = len(train_dataloader)\n\n# Set training steps/epochs for pruning.\n\nif not dev_mode:\n    total_epochs = 4\n    total_steps = total_epochs * steps_per_epoch\n    warmup_steps = 1 * steps_per_epoch\n    cooldown_steps = 1 * steps_per_epoch\nelse:\n    total_epochs = 1\n    total_steps = 3\n    warmup_steps = 1\n    cooldown_steps = 1\n\n# Initialize evaluator used by MovementPruner.\n\nimport nni\nfrom nni.compression.pytorch import TorchEvaluator\n\nmovement_training = functools.partial(training, train_dataloader=train_dataloader,\n                                      log_path=log_dir / 'movement_pruning.log',\n                                      evaluation_func=evaluation_func, device=device)\ntraced_optimizer = nni.trace(Adam)(finetuned_model.parameters(), lr=3e-5, eps=1e-8)\n\ndef lr_lambda(current_step: int):\n    if current_step < warmup_steps:\n        return float(current_step) / warmup_steps\n    return max(0.0, float(total_steps - current_step) / float(total_steps - warmup_steps))\n\ntraced_scheduler = nni.trace(LambdaLR)(traced_optimizer, lr_lambda)\nevaluator = TorchEvaluator(movement_training, traced_optimizer, fake_criterion, traced_scheduler)\n\n# Apply block-soft-movement pruning on attention layers.\n# Note that block sparse is introduced by `sparse_granularity='auto'`, and only support `bert`, `bart`, `t5` right now.\n\nfrom nni.compression.pytorch.pruning import MovementPruner\n\nconfig_list = [{\n    'op_types': ['Linear'],\n    'op_partial_names': ['bert.encoder.layer.{}.attention'.format(i) for i in range(layers_num)],\n    'sparsity': 0.1\n}]\n\npruner = MovementPruner(model=finetuned_model,\n                        config_list=config_list,\n                        evaluator=evaluator,\n                        training_epochs=total_epochs,\n                        training_steps=total_steps,\n                        warm_up_step=warmup_steps,\n                        cool_down_beginning_step=total_steps - cooldown_steps,\n                        regular_scale=10,\n                        movement_mode='soft',\n                        sparse_granularity='auto')\n_, attention_masks = pruner.compress()\npruner.show_pruned_weights()\n\ntorch.save(attention_masks, Path(log_dir) / 'attention_masks.pth')"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Load a new finetuned model to do speedup, you can think of this as using the finetuned state to initialize the pruned model weights.\nNote that nni speedup don't support replacing attention module, so here we manully replace the attention module.\n\nIf the head is entire masked, physically prune it and create config_list for FFN pruning.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "attention_pruned_model = create_finetuned_model().to(device)\nattention_masks = torch.load(Path(log_dir) / 'attention_masks.pth')\n\nffn_config_list = []\nlayer_remained_idxs = []\nmodule_list = []\nfor i in range(0, layers_num):\n    prefix = f'bert.encoder.layer.{i}.'\n    value_mask: torch.Tensor = attention_masks[prefix + 'attention.self.value']['weight']\n    head_mask = (value_mask.reshape(heads_num, -1).sum(-1) == 0.).to(\"cpu\")\n    head_idxs = torch.arange(len(head_mask))[head_mask].long().tolist()\n    print(f'layer {i} prune {len(head_idxs)} head: {head_idxs}')\n    if len(head_idxs) != heads_num:\n        attention_pruned_model.bert.encoder.layer[i].attention.prune_heads(head_idxs)\n        module_list.append(attention_pruned_model.bert.encoder.layer[i])\n        # The final ffn weight remaining ratio is the half of the attention weight remaining ratio.\n        # This is just an empirical configuration, you can use any other method to determine this sparsity.\n        sparsity = 1 - (1 - len(head_idxs) / heads_num) * 0.5\n        # here we use a simple sparsity schedule, we will prune ffn in 12 iterations, each iteration prune `sparsity_per_iter`.\n        sparsity_per_iter = 1 - (1 - sparsity) ** (1 / 12)\n        ffn_config_list.append({\n            'op_names': [f'bert.encoder.layer.{len(layer_remained_idxs)}.intermediate.dense'],\n            'sparsity': sparsity_per_iter\n        })\n        layer_remained_idxs.append(i)\n\nattention_pruned_model.bert.encoder.layer = torch.nn.ModuleList(module_list)\ndistil_func = functools.partial(distil_loss_func, encoder_layer_idxs=layer_remained_idxs)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Retrain the attention pruned model with distillation.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "if not dev_mode:\n    total_epochs = 5\n    total_steps = None\n    distillation = True\nelse:\n    total_epochs = 1\n    total_steps = 1\n    distillation = False\n\nteacher_model = create_finetuned_model()\noptimizer = Adam(attention_pruned_model.parameters(), lr=3e-5, eps=1e-8)\n\ndef lr_lambda(current_step: int):\n    return max(0.0, float(total_epochs * steps_per_epoch - current_step) / float(total_epochs * steps_per_epoch))\n\nlr_scheduler = LambdaLR(optimizer, lr_lambda)\nat_model_save_path = log_dir / 'attention_pruned_model_state.pth'\ntraining(attention_pruned_model, optimizer, fake_criterion, lr_scheduler=lr_scheduler, max_epochs=total_epochs,\n         max_steps=total_steps, train_dataloader=train_dataloader, distillation=distillation, teacher_model=teacher_model,\n         distil_func=distil_func, log_path=log_dir / 'retraining.log', save_best_model=True, save_path=at_model_save_path,\n         evaluation_func=evaluation_func, device=device)\n\nif not dev_mode:\n    attention_pruned_model.load_state_dict(torch.load(at_model_save_path))"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Iterative pruning FFN with TaylorFOWeightPruner in 12 iterations.\nFinetuning 3000 steps after each pruning iteration, then finetuning 2 epochs after pruning finished.\n\nNNI will support per-step-pruning-schedule in the future, then can use an pruner to replace the following code.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "if not dev_mode:\n    total_epochs = 7\n    total_steps = None\n    taylor_pruner_steps = 1000\n    steps_per_iteration = 3000\n    total_pruning_steps = 36000\n    distillation = True\nelse:\n    total_epochs = 1\n    total_steps = 6\n    taylor_pruner_steps = 2\n    steps_per_iteration = 2\n    total_pruning_steps = 4\n    distillation = False\n\nfrom nni.compression.pytorch.pruning import TaylorFOWeightPruner\nfrom nni.compression.pytorch.speedup import ModelSpeedup\n\ndistil_training = functools.partial(training, train_dataloader=train_dataloader, distillation=distillation,\n                                    teacher_model=teacher_model, distil_func=distil_func, device=device)\ntraced_optimizer = nni.trace(Adam)(attention_pruned_model.parameters(), lr=3e-5, eps=1e-8)\nevaluator = TorchEvaluator(distil_training, traced_optimizer, fake_criterion)\n\ncurrent_step = 0\nbest_result = 0\ninit_lr = 3e-5\n\ndummy_input = torch.rand(8, 128, 768).to(device)\n\nattention_pruned_model.train()\nfor current_epoch in range(total_epochs):\n    for batch in train_dataloader:\n        if total_steps and current_step >= total_steps:\n            break\n        # pruning with TaylorFOWeightPruner & reinitialize optimizer\n        if current_step % steps_per_iteration == 0 and current_step < total_pruning_steps:\n            check_point = attention_pruned_model.state_dict()\n            pruner = TaylorFOWeightPruner(attention_pruned_model, ffn_config_list, evaluator, taylor_pruner_steps)\n            _, ffn_masks = pruner.compress()\n            renamed_ffn_masks = {}\n            # rename the masks keys, because we only speedup the bert.encoder\n            for model_name, targets_mask in ffn_masks.items():\n                renamed_ffn_masks[model_name.split('bert.encoder.')[1]] = targets_mask\n            pruner._unwrap_model()\n            attention_pruned_model.load_state_dict(check_point)\n            ModelSpeedup(attention_pruned_model.bert.encoder, dummy_input, renamed_ffn_masks).speedup_model()\n            optimizer = Adam(attention_pruned_model.parameters(), lr=init_lr)\n\n        batch.to(device)\n        # manually schedule lr\n        for params_group in optimizer.param_groups:\n            params_group['lr'] = (1 - current_step / (total_epochs * steps_per_epoch)) * init_lr\n\n        outputs = attention_pruned_model(**batch)\n        loss = outputs.loss\n\n        # distillation\n        if distillation:\n            assert teacher_model is not None\n            with torch.no_grad():\n                teacher_outputs = teacher_model(**batch)\n            distil_loss = distil_func(outputs, teacher_outputs)\n            loss = 0.1 * loss + 0.9 * distil_loss\n\n        optimizer.zero_grad()\n        loss.backward()\n        optimizer.step()\n\n        current_step += 1\n\n        if current_step % 1000 == 0 or current_step % len(train_dataloader) == 0:\n            result = evaluation_func(attention_pruned_model)\n            with (log_dir / 'ffn_pruning.log').open('a+') as f:\n                msg = '[{}] Epoch {}, Step {}: {}\\n'.format(time.asctime(time.localtime(time.time())),\n                                                            current_epoch, current_step, result)\n                f.write(msg)\n            if current_step >= total_pruning_steps and best_result < result['default']:\n                torch.save(attention_pruned_model, log_dir / 'best_model.pth')\n                best_result = result['default']"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Result\nThe speedup is test on the entire validation dataset with batch size 128 on A100.\nWe test under two pytorch version and found the latency varying widely.\n\nSetting 1: pytorch 1.12.1\n\nSetting 2: pytorch 1.10.0\n\n.. list-table:: Prune Bert-base-uncased on MNLI\n    :header-rows: 1\n    :widths: auto\n\n    * - Attention Pruning Method\n      - FFN Pruning Method\n      - Total Sparsity\n      - Accuracy\n      - Acc. Drop\n      - Speedup (S1)\n      - Speedup (S2)\n    * -\n      -\n      - 85.1M (-0.0%)\n      - 84.85 / 85.28\n      - +0.0 / +0.0\n      - 25.60s (x1.00)\n      - 8.10s (x1.00)\n    * - `movement-pruner` (soft, sparsity=0.1, regular_scale=1)\n      - `taylor-fo-weight-pruner`\n      - 54.1M (-36.43%)\n      - 85.38 / 85.41\n      - +0.53 / +0.13\n      - 17.93s (x1.43)\n      - 7.22s (x1.12)\n    * - `movement-pruner` (soft, sparsity=0.1, regular_scale=5)\n      - `taylor-fo-weight-pruner`\n      - 37.1M (-56.40%)\n      - 84.73 / 85.12\n      - -0.12 / -0.16\n      - 12.83s (x2.00)\n      - 5.61s (x1.44)\n    * - `movement-pruner` (soft, sparsity=0.1, regular_scale=10)\n      - `taylor-fo-weight-pruner`\n      - 24.1M (-71.68%)\n      - 84.14 / 84.78\n      - -0.71 / -0.50\n      - 8.93s (x2.87)\n      - 4.55s (x1.78)\n    * - `movement-pruner` (soft, sparsity=0.1, regular_scale=20)\n      - `taylor-fo-weight-pruner`\n      - 14.3M (-83.20%)\n      - 83.26 / 82.96\n      - -1.59 / -2.32\n      - 5.98s (x4.28)\n      - 3.56s (x2.28)\n    * - `movement-pruner` (soft, sparsity=0.1, regular_scale=30)\n      - `taylor-fo-weight-pruner`\n      - 9.9M (-88.37%)\n      - 82.22 / 82.19\n      - -2.63 / -3.09\n      - 4.36s (x5.88)\n      - 3.12s (x2.60)\n    * - `movement-pruner` (soft, sparsity=0.1, regular_scale=40)\n      - `taylor-fo-weight-pruner`\n      - 8.8M (-89.66%)\n      - 81.64 / 82.39\n      - -3.21 / -2.89\n      - 3.88s (x6.60)\n      - 2.81s (x2.88)\n\n"
-      ]
-    }
-  ],
-  "metadata": {
-    "kernelspec": {
-      "display_name": "Python 3",
-      "language": "python",
-      "name": "python3"
-    },
-    "language_info": {
-      "codemirror_mode": {
-        "name": "ipython",
-        "version": 3
-      },
-      "file_extension": ".py",
-      "mimetype": "text/x-python",
-      "name": "python",
-      "nbconvert_exporter": "python",
-      "pygments_lexer": "ipython3",
-      "version": "3.9.16"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
--- a/docs/source/tutorials/pruning_bert_glue.py
+++ b/docs/source/tutorials/pruning_bert_glue.py
@ -1,607 +0,0 @@
-"""
-Pruning Bert on Task MNLI
-=========================
-
-Workable Pruning Process
------------------------
-
-Here we show an effective transformer pruning process that NNI team has tried, and users can use NNI to discover better processes.
-
-The entire pruning process can be divided into the following steps:
-
-1. Finetune the pre-trained model on the downstream task. From our experience,
-   the final performance of pruning on the finetuned model is better than pruning directly on the pre-trained model.
-   At the same time, the finetuned model obtained in this step will also be used as the teacher model for the following
-   distillation training.
-2. Pruning the attention layer at first. Here we apply block-sparse on attention layer weight,
-   and directly prune the head (condense the weight) if the head was fully masked.
-   If the head was partially masked, we will not prune it and recover its weight.
-3. Retrain the head-pruned model with distillation. Recover the model precision before pruning FFN layer.
-4. Pruning the FFN layer. Here we apply the output channels pruning on the 1st FFN layer,
-   and the 2nd FFN layer input channels will be pruned due to the pruning of 1st layer output channels.
-5. Retrain the final pruned model with distillation.
-
-During the process of pruning transformer, we gained some of the following experiences:
-
-* We using :ref:`movement-pruner` in step 2 and :ref:`taylor-fo-weight-pruner` in step 4. :ref:`movement-pruner` has good performance on attention layers,
-  and :ref:`taylor-fo-weight-pruner` method has good performance on FFN layers. These two pruners are all some kinds of gradient-based pruning algorithms,
-  we also try weight-based pruning algorithms like :ref:`l1-norm-pruner`, but it doesn't seem to work well in this scenario.
-* Distillation is a good way to recover model precision. In terms of results, usually 1~2% improvement in accuracy can be achieved when we prune bert on mnli task.
-* It is necessary to gradually increase the sparsity rather than reaching a very high sparsity all at once.
-
-Experiment
----------
-
-The complete pruning process will take about 8 hours on one A100.
-
-Preparation
-^^^^^^^^^^^
-
-This section is mainly to get a finetuned model on the downstream task.
-If you are familiar with how to finetune Bert on GLUE dataset, you can skip this section.
-
-.. note::
-
-    Please set ``dev_mode`` to ``False`` to run this tutorial. Here ``dev_mode`` is ``True`` by default is for generating documents.
-
-"""
-
-dev_mode = True
-
-# %%
-# Some basic setting.
-
-from pathlib import Path
-from typing import Callable, Dict
-
-pretrained_model_name_or_path = 'bert-base-uncased'
-task_name = 'mnli'
-experiment_id = 'pruning_bert_mnli'
-
-# heads_num and layers_num should align with pretrained_model_name_or_path
-heads_num = 12
-layers_num = 12
-
-# used to save the experiment log
-log_dir = Path(f'./pruning_log/{pretrained_model_name_or_path}/{task_name}/{experiment_id}')
-log_dir.mkdir(parents=True, exist_ok=True)
-
-# used to save the finetuned model and share between different experiemnts with same pretrained_model_name_or_path and task_name
-model_dir = Path(f'./models/{pretrained_model_name_or_path}/{task_name}')
-model_dir.mkdir(parents=True, exist_ok=True)
-
-# used to save GLUE data
-data_dir = Path(f'./data')
-data_dir.mkdir(parents=True, exist_ok=True)
-
-# set seed
-from transformers import set_seed
-set_seed(1024)
-
-import torch
-device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-
-# %%
-# Create dataloaders.
-
-from torch.utils.data import DataLoader
-
-from datasets import load_dataset
-from transformers import BertTokenizerFast, DataCollatorWithPadding
-
-task_to_keys = {
-    'cola': ('sentence', None),
-    'mnli': ('premise', 'hypothesis'),
-    'mrpc': ('sentence1', 'sentence2'),
-    'qnli': ('question', 'sentence'),
-    'qqp': ('question1', 'question2'),
-    'rte': ('sentence1', 'sentence2'),
-    'sst2': ('sentence', None),
-    'stsb': ('sentence1', 'sentence2'),
-    'wnli': ('sentence1', 'sentence2'),
-}
-
-def prepare_dataloaders(cache_dir=data_dir, train_batch_size=32, eval_batch_size=32):
-    tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path)
-    sentence1_key, sentence2_key = task_to_keys[task_name]
-    data_collator = DataCollatorWithPadding(tokenizer)
-
-    # used to preprocess the raw data
-    def preprocess_function(examples):
-        # Tokenize the texts
-        args = (
-            (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
-        )
-        result = tokenizer(*args, padding=False, max_length=128, truncation=True)
-
-        if 'label' in examples:
-            # In all cases, rename the column to labels because the model will expect that.
-            result['labels'] = examples['label']
-        return result
-
-    raw_datasets = load_dataset('glue', task_name, cache_dir=cache_dir)
-    for key in list(raw_datasets.keys()):
-        if 'test' in key:
-            raw_datasets.pop(key)
-
-    processed_datasets = raw_datasets.map(preprocess_function, batched=True,
-                                          remove_columns=raw_datasets['train'].column_names)
-
-    train_dataset = processed_datasets['train']
-    if task_name == 'mnli':
-        validation_datasets = {
-            'validation_matched': processed_datasets['validation_matched'],
-            'validation_mismatched': processed_datasets['validation_mismatched']
-        }
-    else:
-        validation_datasets = {
-            'validation': processed_datasets['validation']
-        }
-
-    train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=data_collator, batch_size=train_batch_size)
-    validation_dataloaders = {
-        val_name: DataLoader(val_dataset, collate_fn=data_collator, batch_size=eval_batch_size) \
-            for val_name, val_dataset in validation_datasets.items()
-    }
-
-    return train_dataloader, validation_dataloaders
-
-
-train_dataloader, validation_dataloaders = prepare_dataloaders()
-
-# %%
-# Training function & evaluation function.
-
-import functools
-import time
-
-import torch.nn.functional as F
-from datasets import load_metric
-from transformers.modeling_outputs import SequenceClassifierOutput
-
-from nni.common.types import SCHEDULER
-
-
-def training(model: torch.nn.Module,
-             optimizer: torch.optim.Optimizer,
-             criterion: Callable[[torch.Tensor, torch.Tensor], torch.Tensor],
-             lr_scheduler: SCHEDULER = None,
-             max_steps: int = None,
-             max_epochs: int = None,
-             train_dataloader: DataLoader = None,
-             distillation: bool = False,
-             teacher_model: torch.nn.Module = None,
-             distil_func: Callable = None,
-             log_path: str = Path(log_dir) / 'training.log',
-             save_best_model: bool = False,
-             save_path: str = None,
-             evaluation_func: Callable = None,
-             eval_per_steps: int = 1000,
-             device=None):
-
-    assert train_dataloader is not None
-
-    model.train()
-    if teacher_model is not None:
-        teacher_model.eval()
-    current_step = 0
-    best_result = 0
-
-    total_epochs = max_steps // len(train_dataloader) + 1 if max_steps else max_epochs if max_epochs else 3
-    total_steps = max_steps if max_steps else total_epochs * len(train_dataloader)
-
-    print(f'Training {total_epochs} epochs, {total_steps} steps...')
-
-    for current_epoch in range(total_epochs):
-        for batch in train_dataloader:
-            if current_step >= total_steps:
-                return
-            batch.to(device)
-            outputs = model(**batch)
-            loss = outputs.loss
-
-            if distillation:
-                assert teacher_model is not None
-                with torch.no_grad():
-                    teacher_outputs = teacher_model(**batch)
-                distil_loss = distil_func(outputs, teacher_outputs)
-                loss = 0.1 * loss + 0.9 * distil_loss
-
-            loss = criterion(loss, None)
-            optimizer.zero_grad()
-            loss.backward()
-            optimizer.step()
-
-            # per step schedule
-            if lr_scheduler:
-                lr_scheduler.step()
-
-            current_step += 1
-
-            if current_step % eval_per_steps == 0 or current_step % len(train_dataloader) == 0:
-                result = evaluation_func(model) if evaluation_func else None
-                with (log_path).open('a+') as f:
-                    msg = '[{}] Epoch {}, Step {}: {}\n'.format(time.asctime(time.localtime(time.time())), current_epoch, current_step, result)
-                    f.write(msg)
-                # if it's the best model, save it.
-                if save_best_model and (result is None or best_result < result['default']):
-                    assert save_path is not None
-                    torch.save(model.state_dict(), save_path)
-                    best_result = None if result is None else result['default']
-
-
-def distil_loss_func(stu_outputs: SequenceClassifierOutput, tea_outputs: SequenceClassifierOutput, encoder_layer_idxs=[]):
-    encoder_hidden_state_loss = []
-    for i, idx in enumerate(encoder_layer_idxs[:-1]):
-        encoder_hidden_state_loss.append(F.mse_loss(stu_outputs.hidden_states[i], tea_outputs.hidden_states[idx]))
-    logits_loss = F.kl_div(F.log_softmax(stu_outputs.logits / 2, dim=-1), F.softmax(tea_outputs.logits / 2, dim=-1), reduction='batchmean') * (2 ** 2)
-
-    distil_loss = 0
-    for loss in encoder_hidden_state_loss:
-        distil_loss += loss
-    distil_loss += logits_loss
-    return distil_loss
-
-
-def evaluation(model: torch.nn.Module, validation_dataloaders: Dict[str, DataLoader] = None, device=None):
-    assert validation_dataloaders is not None
-    training = model.training
-    model.eval()
-
-    is_regression = task_name == 'stsb'
-    metric = load_metric('glue', task_name)
-
-    result = {}
-    default_result = 0
-    for val_name, validation_dataloader in validation_dataloaders.items():
-        for batch in validation_dataloader:
-            batch.to(device)
-            outputs = model(**batch)
-            predictions = outputs.logits.argmax(dim=-1) if not is_regression else outputs.logits.squeeze()
-            metric.add_batch(
-                predictions=predictions,
-                references=batch['labels'],
-            )
-        result[val_name] = metric.compute()
-        default_result += result[val_name].get('f1', result[val_name].get('accuracy', 0))
-    result['default'] = default_result / len(result)
-
-    model.train(training)
-    return result
-
-
-evaluation_func = functools.partial(evaluation, validation_dataloaders=validation_dataloaders, device=device)
-
-
-def fake_criterion(loss, _):
-    return loss
-
-# %%
-# Prepare pre-trained model and finetuning on downstream task.
-
-from torch.optim import Adam
-from torch.optim.lr_scheduler import LambdaLR
-from transformers import BertForSequenceClassification
-
-
-def create_pretrained_model():
-    is_regression = task_name == 'stsb'
-    num_labels = 1 if is_regression else (3 if task_name == 'mnli' else 2)
-    model = BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path, num_labels=num_labels)
-    model.bert.config.output_hidden_states = True
-    return model
-
-
-def create_finetuned_model():
-    finetuned_model = create_pretrained_model().to(device)
-    finetuned_model_state_path = Path(model_dir) / 'finetuned_model_state.pth'
-
-    if finetuned_model_state_path.exists():
-        finetuned_model.load_state_dict(torch.load(finetuned_model_state_path, map_location=device))
-    elif dev_mode:
-        pass
-    else:
-        steps_per_epoch = len(train_dataloader)
-        training_epochs = 3
-        optimizer = Adam(finetuned_model.parameters(), lr=3e-5, eps=1e-8)
-
-        def lr_lambda(current_step: int):
-            return max(0.0, float(training_epochs * steps_per_epoch - current_step) / float(training_epochs * steps_per_epoch))
-
-        lr_scheduler = LambdaLR(optimizer, lr_lambda)
-        training(finetuned_model, optimizer, fake_criterion, lr_scheduler=lr_scheduler,
-                 max_epochs=training_epochs, train_dataloader=train_dataloader, log_path=log_dir / 'finetuning_on_downstream.log',
-                 save_best_model=True, save_path=finetuned_model_state_path, evaluation_func=evaluation_func, device=device)
-    return finetuned_model
-
-
-finetuned_model = create_finetuned_model()
-
-
-# %%
-# Pruning
-# ^^^^^^^
-# According to experience, it is easier to achieve good results by pruning the attention part and the FFN part in stages.
-# Of course, pruning together can also achieve the similar effect, but more parameter adjustment attempts are required.
-# So in this section, we do pruning in stages.
-#
-# First, we prune the attention layer with MovementPruner.
-
-steps_per_epoch = len(train_dataloader)
-
-# Set training steps/epochs for pruning.
-
-if not dev_mode:
-    total_epochs = 4
-    total_steps = total_epochs * steps_per_epoch
-    warmup_steps = 1 * steps_per_epoch
-    cooldown_steps = 1 * steps_per_epoch
-else:
-    total_epochs = 1
-    total_steps = 3
-    warmup_steps = 1
-    cooldown_steps = 1
-
-# Initialize evaluator used by MovementPruner.
-
-import nni
-from nni.compression.pytorch import TorchEvaluator
-
-movement_training = functools.partial(training, train_dataloader=train_dataloader,
-                                      log_path=log_dir / 'movement_pruning.log',
-                                      evaluation_func=evaluation_func, device=device)
-traced_optimizer = nni.trace(Adam)(finetuned_model.parameters(), lr=3e-5, eps=1e-8)
-
-def lr_lambda(current_step: int):
-    if current_step < warmup_steps:
-        return float(current_step) / warmup_steps
-    return max(0.0, float(total_steps - current_step) / float(total_steps - warmup_steps))
-
-traced_scheduler = nni.trace(LambdaLR)(traced_optimizer, lr_lambda)
-evaluator = TorchEvaluator(movement_training, traced_optimizer, fake_criterion, traced_scheduler)
-
-# Apply block-soft-movement pruning on attention layers.
-# Note that block sparse is introduced by `sparse_granularity='auto'`, and only support `bert`, `bart`, `t5` right now.
-
-from nni.compression.pytorch.pruning import MovementPruner
-
-config_list = [{
-    'op_types': ['Linear'],
-    'op_partial_names': ['bert.encoder.layer.{}.attention'.format(i) for i in range(layers_num)],
-    'sparsity': 0.1
-}]
-
-pruner = MovementPruner(model=finetuned_model,
-                        config_list=config_list,
-                        evaluator=evaluator,
-                        training_epochs=total_epochs,
-                        training_steps=total_steps,
-                        warm_up_step=warmup_steps,
-                        cool_down_beginning_step=total_steps - cooldown_steps,
-                        regular_scale=10,
-                        movement_mode='soft',
-                        sparse_granularity='auto')
-_, attention_masks = pruner.compress()
-pruner.show_pruned_weights()
-
-torch.save(attention_masks, Path(log_dir) / 'attention_masks.pth')
-
-# %%
-# Load a new finetuned model to do speedup, you can think of this as using the finetuned state to initialize the pruned model weights.
-# Note that nni speedup don't support replacing attention module, so here we manully replace the attention module.
-#
-# If the head is entire masked, physically prune it and create config_list for FFN pruning.
-
-attention_pruned_model = create_finetuned_model().to(device)
-attention_masks = torch.load(Path(log_dir) / 'attention_masks.pth')
-
-ffn_config_list = []
-layer_remained_idxs = []
-module_list = []
-for i in range(0, layers_num):
-    prefix = f'bert.encoder.layer.{i}.'
-    value_mask: torch.Tensor = attention_masks[prefix + 'attention.self.value']['weight']
-    head_mask = (value_mask.reshape(heads_num, -1).sum(-1) == 0.).to("cpu")
-    head_idxs = torch.arange(len(head_mask))[head_mask].long().tolist()
-    print(f'layer {i} prune {len(head_idxs)} head: {head_idxs}')
-    if len(head_idxs) != heads_num:
-        attention_pruned_model.bert.encoder.layer[i].attention.prune_heads(head_idxs)
-        module_list.append(attention_pruned_model.bert.encoder.layer[i])
-        # The final ffn weight remaining ratio is the half of the attention weight remaining ratio.
-        # This is just an empirical configuration, you can use any other method to determine this sparsity.
-        sparsity = 1 - (1 - len(head_idxs) / heads_num) * 0.5
-        # here we use a simple sparsity schedule, we will prune ffn in 12 iterations, each iteration prune `sparsity_per_iter`.
-        sparsity_per_iter = 1 - (1 - sparsity) ** (1 / 12)
-        ffn_config_list.append({
-            'op_names': [f'bert.encoder.layer.{len(layer_remained_idxs)}.intermediate.dense'],
-            'sparsity': sparsity_per_iter
-        })
-        layer_remained_idxs.append(i)
-
-attention_pruned_model.bert.encoder.layer = torch.nn.ModuleList(module_list)
-distil_func = functools.partial(distil_loss_func, encoder_layer_idxs=layer_remained_idxs)
-
-# %%
-# Retrain the attention pruned model with distillation.
-
-if not dev_mode:
-    total_epochs = 5
-    total_steps = None
-    distillation = True
-else:
-    total_epochs = 1
-    total_steps = 1
-    distillation = False
-
-teacher_model = create_finetuned_model()
-optimizer = Adam(attention_pruned_model.parameters(), lr=3e-5, eps=1e-8)
-
-def lr_lambda(current_step: int):
-    return max(0.0, float(total_epochs * steps_per_epoch - current_step) / float(total_epochs * steps_per_epoch))
-
-lr_scheduler = LambdaLR(optimizer, lr_lambda)
-at_model_save_path = log_dir / 'attention_pruned_model_state.pth'
-training(attention_pruned_model, optimizer, fake_criterion, lr_scheduler=lr_scheduler, max_epochs=total_epochs,
-         max_steps=total_steps, train_dataloader=train_dataloader, distillation=distillation, teacher_model=teacher_model,
-         distil_func=distil_func, log_path=log_dir / 'retraining.log', save_best_model=True, save_path=at_model_save_path,
-         evaluation_func=evaluation_func, device=device)
-
-if not dev_mode:
-    attention_pruned_model.load_state_dict(torch.load(at_model_save_path))
-
-# %%
-# Iterative pruning FFN with TaylorFOWeightPruner in 12 iterations.
-# Finetuning 3000 steps after each pruning iteration, then finetuning 2 epochs after pruning finished.
-#
-# NNI will support per-step-pruning-schedule in the future, then can use an pruner to replace the following code.
-
-if not dev_mode:
-    total_epochs = 7
-    total_steps = None
-    taylor_pruner_steps = 1000
-    steps_per_iteration = 3000
-    total_pruning_steps = 36000
-    distillation = True
-else:
-    total_epochs = 1
-    total_steps = 6
-    taylor_pruner_steps = 2
-    steps_per_iteration = 2
-    total_pruning_steps = 4
-    distillation = False
-
-from nni.compression.pytorch.pruning import TaylorFOWeightPruner
-from nni.compression.pytorch.speedup import ModelSpeedup
-
-distil_training = functools.partial(training, train_dataloader=train_dataloader, distillation=distillation,
-                                    teacher_model=teacher_model, distil_func=distil_func, device=device)
-traced_optimizer = nni.trace(Adam)(attention_pruned_model.parameters(), lr=3e-5, eps=1e-8)
-evaluator = TorchEvaluator(distil_training, traced_optimizer, fake_criterion)
-
-current_step = 0
-best_result = 0
-init_lr = 3e-5
-
-dummy_input = torch.rand(8, 128, 768).to(device)
-
-attention_pruned_model.train()
-for current_epoch in range(total_epochs):
-    for batch in train_dataloader:
-        if total_steps and current_step >= total_steps:
-            break
-        # pruning with TaylorFOWeightPruner & reinitialize optimizer
-        if current_step % steps_per_iteration == 0 and current_step < total_pruning_steps:
-            check_point = attention_pruned_model.state_dict()
-            pruner = TaylorFOWeightPruner(attention_pruned_model, ffn_config_list, evaluator, taylor_pruner_steps)
-            _, ffn_masks = pruner.compress()
-            renamed_ffn_masks = {}
-            # rename the masks keys, because we only speedup the bert.encoder
-            for model_name, targets_mask in ffn_masks.items():
-                renamed_ffn_masks[model_name.split('bert.encoder.')[1]] = targets_mask
-            pruner._unwrap_model()
-            attention_pruned_model.load_state_dict(check_point)
-            ModelSpeedup(attention_pruned_model.bert.encoder, dummy_input, renamed_ffn_masks).speedup_model()
-            optimizer = Adam(attention_pruned_model.parameters(), lr=init_lr)
-
-        batch.to(device)
-        # manually schedule lr
-        for params_group in optimizer.param_groups:
-            params_group['lr'] = (1 - current_step / (total_epochs * steps_per_epoch)) * init_lr
-
-        outputs = attention_pruned_model(**batch)
-        loss = outputs.loss
-
-        # distillation
-        if distillation:
-            assert teacher_model is not None
-            with torch.no_grad():
-                teacher_outputs = teacher_model(**batch)
-            distil_loss = distil_func(outputs, teacher_outputs)
-            loss = 0.1 * loss + 0.9 * distil_loss
-
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-
-        current_step += 1
-
-        if current_step % 1000 == 0 or current_step % len(train_dataloader) == 0:
-            result = evaluation_func(attention_pruned_model)
-            with (log_dir / 'ffn_pruning.log').open('a+') as f:
-                msg = '[{}] Epoch {}, Step {}: {}\n'.format(time.asctime(time.localtime(time.time())),
-                                                            current_epoch, current_step, result)
-                f.write(msg)
-            if current_step >= total_pruning_steps and best_result < result['default']:
-                torch.save(attention_pruned_model, log_dir / 'best_model.pth')
-                best_result = result['default']
-
-# %%
-# Result
-# ------
-# The speedup is test on the entire validation dataset with batch size 128 on A100.
-# We test under two pytorch version and found the latency varying widely.
-# 
-# Setting 1: pytorch 1.12.1
-#
-# Setting 2: pytorch 1.10.0
-# 
-# .. list-table:: Prune Bert-base-uncased on MNLI
-#     :header-rows: 1
-#     :widths: auto
-#
-#     * - Attention Pruning Method
-#       - FFN Pruning Method
-#       - Total Sparsity
-#       - Accuracy
-#       - Acc. Drop
-#       - Speedup (S1)
-#       - Speedup (S2)
-#     * -
-#       -
-#       - 85.1M (-0.0%)
-#       - 84.85 / 85.28
-#       - +0.0 / +0.0
-#       - 25.60s (x1.00)
-#       - 8.10s (x1.00)
-#     * - :ref:`movement-pruner` (soft, sparsity=0.1, regular_scale=1)
-#       - :ref:`taylor-fo-weight-pruner`
-#       - 54.1M (-36.43%)
-#       - 85.38 / 85.41
-#       - +0.53 / +0.13
-#       - 17.93s (x1.43)
-#       - 7.22s (x1.12)
-#     * - :ref:`movement-pruner` (soft, sparsity=0.1, regular_scale=5)
-#       - :ref:`taylor-fo-weight-pruner`
-#       - 37.1M (-56.40%)
-#       - 84.73 / 85.12
-#       - -0.12 / -0.16
-#       - 12.83s (x2.00)
-#       - 5.61s (x1.44)
-#     * - :ref:`movement-pruner` (soft, sparsity=0.1, regular_scale=10)
-#       - :ref:`taylor-fo-weight-pruner`
-#       - 24.1M (-71.68%)
-#       - 84.14 / 84.78
-#       - -0.71 / -0.50
-#       - 8.93s (x2.87)
-#       - 4.55s (x1.78)
-#     * - :ref:`movement-pruner` (soft, sparsity=0.1, regular_scale=20)
-#       - :ref:`taylor-fo-weight-pruner`
-#       - 14.3M (-83.20%)
-#       - 83.26 / 82.96
-#       - -1.59 / -2.32
-#       - 5.98s (x4.28)
-#       - 3.56s (x2.28)
-#     * - :ref:`movement-pruner` (soft, sparsity=0.1, regular_scale=30)
-#       - :ref:`taylor-fo-weight-pruner`
-#       - 9.9M (-88.37%)
-#       - 82.22 / 82.19
-#       - -2.63 / -3.09
-#       - 4.36s (x5.88)
-#       - 3.12s (x2.60)
-#     * - :ref:`movement-pruner` (soft, sparsity=0.1, regular_scale=40)
-#       - :ref:`taylor-fo-weight-pruner`
-#       - 8.8M (-89.66%)
-#       - 81.64 / 82.39
-#       - -3.21 / -2.89
-#       - 3.88s (x6.60)
-#       - 2.81s (x2.88)
--- a/docs/source/tutorials/pruning_bert_glue.py.md5
+++ b/docs/source/tutorials/pruning_bert_glue.py.md5
@ -1 +0,0 @@
-822d1933bb3b99080589c0cdf89cf89e
--- a/docs/source/tutorials/pruning_bert_glue.rst
+++ b/docs/source/tutorials/pruning_bert_glue.rst
--- a/docs/source/tutorials/pruning_bert_glue_codeobj.pickle
+++ b/docs/source/tutorials/pruning_bert_glue_codeobj.pickle
--- a/docs/source/tutorials/pruning_quick_start_mnist.ipynb
+++ b/docs/source/tutorials/pruning_quick_start_mnist.ipynb
@ -15,7 +15,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "\n# Pruning Quickstart\n\nHere is a three-minute video to get you started with model pruning.\n\n..  youtube:: wKh51Jnr0a8\n    :align: center\n\nModel pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.\nThere are three common practices for pruning a DNN model:\n\n#. Pre-training a model -> Pruning the model -> Fine-tuning the pruned model\n#. Pruning a model during training (i.e., pruning aware training) -> Fine-tuning the pruned model\n#. Pruning a model -> Training the pruned model from scratch\n\nNNI supports all of the above pruning practices by working on the key pruning stage.\nFollowing this tutorial for a quick look at how to use NNI to prune a model in a common practice.\n"
+        "\n# Pruning Quickstart\n\nModel pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.\nThere are three common practices for pruning a DNN model:\n\n#. Pre-training a model -> Pruning the model -> Fine-tuning the pruned model\n#. Pruning a model during training (i.e., pruning aware training) -> Fine-tuning the pruned model\n#. Pruning a model -> Training the pruned model from scratch\n\nNNI supports all of the above pruning practices by working on the key pruning stage.\nFollowing this tutorial for a quick look at how to use NNI to prune a model in a common practice.\n"
      ]
    },
    {
@ -51,7 +51,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "## Pruning Model\n\nUsing L1NormPruner to prune the model and generate the masks.\nUsually, a pruner requires original model and ``config_list`` as its inputs.\nDetailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.\n\nThe following `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,\nexcept the layer named `fc3`, because `fc3` is `exclude`.\nThe final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.\n\n"
+        "## Pruning Model\n\nUsing L1NormPruner to prune the model and generate the masks.\nUsually, a pruner requires original model and ``config_list`` as its inputs.\nDetailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/config_list>`.\n\nThe following `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,\nexcept the layer named `fc3`, because `fc3` is `exclude`.\nThe final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.\n\n"
      ]
    },
    {
@ -62,7 +62,7 @@
      },
      "outputs": [],
      "source": [
-        "config_list = [{\n    'sparsity_per_layer': 0.5,\n    'op_types': ['Linear', 'Conv2d']\n}, {\n    'exclude': True,\n    'op_names': ['fc3']\n}]"
+        "config_list = [{\n    'op_types': ['Linear', 'Conv2d'],\n    'exclude_op_names': ['fc3'],\n    'sparse_ratio': 0.5\n}]"
      ]
    },
    {
@ -80,7 +80,7 @@
      },
      "outputs": [],
      "source": [
-        "from nni.compression.pytorch.pruning import L1NormPruner\npruner = L1NormPruner(model, config_list)\n\n# show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.\nprint(model)"
+        "from nni.contrib.compression.pruning import L1NormPruner\npruner = L1NormPruner(model, config_list)\n\n# show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.\nprint(model)"
      ]
    },
    {
@ -109,7 +109,7 @@
      },
      "outputs": [],
      "source": [
-        "# need to unwrap the model, if the model is wrapped before speedup\npruner._unwrap_model()\n\n# speedup the model, for more information about speedup, please refer :doc:`pruning_speedup`.\nfrom nni.compression.pytorch.speedup import ModelSpeedup\n\nModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()"
+        "# need to unwrap the model, if the model is wrapped before speedup\npruner.unwrap_model()\n\n# speedup the model, for more information about speedup, please refer :doc:`pruning_speedup`.\nfrom nni.compression.pytorch.speedup.v2 import ModelSpeedup\n\nModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()"
      ]
    },
    {
--- a/examples/tutorials/pruning_quick_start_mnist.py
+++ b/examples/tutorials/pruning_quick_start_mnist.py
@ -2,11 +2,6 @@
 Pruning Quickstart
 ==================

-Here is a three-minute video to get you started with model pruning.
-
-..  youtube:: wKh51Jnr0a8
-    :align: center
-
 Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
 There are three common practices for pruning a DNN model:

@ -55,24 +50,22 @@ for epoch in range(3):
 #
 # Using L1NormPruner to prune the model and generate the masks.
 # Usually, a pruner requires original model and ``config_list`` as its inputs.
-# Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.
+# Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/config_list>`.
 #
 # The following `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
 # except the layer named `fc3`, because `fc3` is `exclude`.
 # The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.

 config_list = [{
-    'sparsity_per_layer': 0.5,
-    'op_types': ['Linear', 'Conv2d']
-}, {
-    'exclude': True,
-    'op_names': ['fc3']
+    'op_types': ['Linear', 'Conv2d'],
+    'exclude_op_names': ['fc3'],
+    'sparse_ratio': 0.5
 }]

 # %%
 # Pruners usually require `model` and `config_list` as input arguments.

-from nni.compression.pytorch.pruning import L1NormPruner
+from nni.contrib.compression.pruning import L1NormPruner
 pruner = L1NormPruner(model, config_list)

 # show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.
@ -92,10 +85,10 @@ for name, mask in masks.items():
 # and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.

 # need to unwrap the model, if the model is wrapped before speedup
-pruner._unwrap_model()
+pruner.unwrap_model()

 # speedup the model, for more information about speedup, please refer :doc:`pruning_speedup`.
-from nni.compression.pytorch.speedup import ModelSpeedup
+from nni.compression.pytorch.speedup.v2 import ModelSpeedup

 ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()

--- a/docs/source/tutorials/pruning_quick_start.py.md5
+++ b/docs/source/tutorials/pruning_quick_start.py.md5
@ -0,0 +1 @@
+9feea465b118b0fa5da9379f4bb2d357
--- a/docs/source/tutorials/pruning_quick_start_mnist.rst
+++ b/docs/source/tutorials/pruning_quick_start_mnist.rst
@ -2,7 +2,7 @@
 .. DO NOT EDIT.
 .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
 .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
-.. "tutorials/pruning_quick_start_mnist.py"
+.. "tutorials/pruning_quick_start.py"
 .. LINE NUMBERS ARE GIVEN BELOW.

 .. only:: html
@ -10,22 +10,17 @@
    .. note::
        :class: sphx-glr-download-link-note

-        Click :ref:`here <sphx_glr_download_tutorials_pruning_quick_start_mnist.py>`
+        Click :ref:`here <sphx_glr_download_tutorials_pruning_quick_start.py>`
        to download the full example code

 .. rst-class:: sphx-glr-example-title

-.. _sphx_glr_tutorials_pruning_quick_start_mnist.py:
+.. _sphx_glr_tutorials_pruning_quick_start.py:


 Pruning Quickstart
 ==================

-Here is a three-minute video to get you started with model pruning.
-
-..  youtube:: wKh51Jnr0a8
-    :align: center
-
 Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
 There are three common practices for pruning a DNN model:

@ -36,7 +31,7 @@ There are three common practices for pruning a DNN model:
 NNI supports all of the above pruning practices by working on the key pruning stage.
 Following this tutorial for a quick look at how to use NNI to prune a model in a common practice.

-.. GENERATED FROM PYTHON SOURCE LINES 22-27
+.. GENERATED FROM PYTHON SOURCE LINES 17-22

 Preparation
 -----------
@ -44,7 +39,7 @@ Preparation
 In this tutorial, we use a simple model and pre-trained on MNIST dataset.
 If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.

-.. GENERATED FROM PYTHON SOURCE LINES 27-40
+.. GENERATED FROM PYTHON SOURCE LINES 22-35

 .. code-block:: default

@ -86,7 +81,7 @@ If you are familiar with defining a model and training in pytorch, you can skip



-.. GENERATED FROM PYTHON SOURCE LINES 41-52
+.. GENERATED FROM PYTHON SOURCE LINES 36-47

 .. code-block:: default

@ -109,37 +104,35 @@ If you are familiar with defining a model and training in pytorch, you can skip

 .. code-block:: none

-    Average test loss: 1.3409, Accuracy: 6494/10000 (65%)
-    Average test loss: 0.3263, Accuracy: 9003/10000 (90%)
-    Average test loss: 0.2029, Accuracy: 9388/10000 (94%)
+    Average test loss: 0.6140, Accuracy: 7985/10000 (80%)
+    Average test loss: 0.2676, Accuracy: 9209/10000 (92%)
+    Average test loss: 0.1946, Accuracy: 9424/10000 (94%)




-.. GENERATED FROM PYTHON SOURCE LINES 53-63
+.. GENERATED FROM PYTHON SOURCE LINES 48-58

 Pruning Model
 -------------

 Using L1NormPruner to prune the model and generate the masks.
 Usually, a pruner requires original model and ``config_list`` as its inputs.
-Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.
+Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/config_list>`.

 The following `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
 except the layer named `fc3`, because `fc3` is `exclude`.
 The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.

-.. GENERATED FROM PYTHON SOURCE LINES 63-72
+.. GENERATED FROM PYTHON SOURCE LINES 58-65

 .. code-block:: default


    config_list = [{
-        'sparsity_per_layer': 0.5,
-        'op_types': ['Linear', 'Conv2d']
-    }, {
-        'exclude': True,
-        'op_names': ['fc3']
+        'op_types': ['Linear', 'Conv2d'],
+        'exclude_op_names': ['fc3'],
+        'sparse_ratio': 0.5
    }]


@ -149,16 +142,16 @@ The final sparsity ratio for each layer is 50%. The layer named `fc3` will not b



-.. GENERATED FROM PYTHON SOURCE LINES 73-74
+.. GENERATED FROM PYTHON SOURCE LINES 66-67

 Pruners usually require `model` and `config_list` as input arguments.

-.. GENERATED FROM PYTHON SOURCE LINES 74-81
+.. GENERATED FROM PYTHON SOURCE LINES 67-74

 .. code-block:: default


-    from nni.compression.pytorch.pruning import L1NormPruner
+    from nni.contrib.compression.pruning import L1NormPruner
    pruner = L1NormPruner(model, config_list)

    # show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.
@ -173,17 +166,21 @@ Pruners usually require `model` and `config_list` as input arguments.
 .. code-block:: none

    TorchModel(
-      (conv1): PrunerModuleWrapper(
-        (module): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
+      (conv1): Conv2d(
+        1, 6, kernel_size=(5, 5), stride=(1, 1)
+        (_nni_wrapper): ModuleWrapper(module=Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1)), module_name=conv1)
      )
-      (conv2): PrunerModuleWrapper(
-        (module): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
+      (conv2): Conv2d(
+        6, 16, kernel_size=(5, 5), stride=(1, 1)
+        (_nni_wrapper): ModuleWrapper(module=Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1)), module_name=conv2)
      )
-      (fc1): PrunerModuleWrapper(
-        (module): Linear(in_features=256, out_features=120, bias=True)
+      (fc1): Linear(
+        in_features=256, out_features=120, bias=True
+        (_nni_wrapper): ModuleWrapper(module=Linear(in_features=256, out_features=120, bias=True), module_name=fc1)
      )
-      (fc2): PrunerModuleWrapper(
-        (module): Linear(in_features=120, out_features=84, bias=True)
+      (fc2): Linear(
+        in_features=120, out_features=84, bias=True
+        (_nni_wrapper): ModuleWrapper(module=Linear(in_features=120, out_features=84, bias=True), module_name=fc2)
      )
      (fc3): Linear(in_features=84, out_features=10, bias=True)
      (relu1): ReLU()
@ -197,7 +194,7 @@ Pruners usually require `model` and `config_list` as input arguments.



-.. GENERATED FROM PYTHON SOURCE LINES 82-89
+.. GENERATED FROM PYTHON SOURCE LINES 75-82

 .. code-block:: default

@ -216,30 +213,30 @@ Pruners usually require `model` and `config_list` as input arguments.

 .. code-block:: none

+    fc2  sparsity :  0.5
    conv1  sparsity :  0.5
    conv2  sparsity :  0.5
    fc1  sparsity :  0.5
-    fc2  sparsity :  0.5




-.. GENERATED FROM PYTHON SOURCE LINES 90-93
+.. GENERATED FROM PYTHON SOURCE LINES 83-86

 Speedup the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
 The model becomes smaller after speedup,
 and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.

-.. GENERATED FROM PYTHON SOURCE LINES 93-102
+.. GENERATED FROM PYTHON SOURCE LINES 86-95

 .. code-block:: default


    # need to unwrap the model, if the model is wrapped before speedup
-    pruner._unwrap_model()
+    pruner.unwrap_model()

    # speedup the model, for more information about speedup, please refer :doc:`pruning_speedup`.
-    from nni.compression.pytorch.speedup import ModelSpeedup
+    from nni.compression.pytorch.speedup.v2 import ModelSpeedup

    ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()

@ -251,17 +248,29 @@ and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the ma

 .. code-block:: none

-    /home/ningshang/anaconda3/envs/nni-dev/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  aten/src/ATen/core/TensorBody.h:417.)
-      return self._grad
+    both dim0 and dim1 masks found.
+
+    TorchModel(
+      (conv1): Conv2d(1, 3, kernel_size=(5, 5), stride=(1, 1))
+      (conv2): Conv2d(3, 8, kernel_size=(5, 5), stride=(1, 1))
+      (fc1): Linear(in_features=128, out_features=60, bias=True)
+      (fc2): Linear(in_features=60, out_features=42, bias=True)
+      (fc3): Linear(in_features=42, out_features=10, bias=True)
+      (relu1): ReLU()
+      (relu2): ReLU()
+      (relu3): ReLU()
+      (relu4): ReLU()
+      (pool1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+      (pool2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+    )



-
-.. GENERATED FROM PYTHON SOURCE LINES 103-104
+.. GENERATED FROM PYTHON SOURCE LINES 96-97

 the model will become real smaller after speedup

-.. GENERATED FROM PYTHON SOURCE LINES 104-106
+.. GENERATED FROM PYTHON SOURCE LINES 97-99

 .. code-block:: default

@ -292,14 +301,14 @@ the model will become real smaller after speedup



-.. GENERATED FROM PYTHON SOURCE LINES 107-111
+.. GENERATED FROM PYTHON SOURCE LINES 100-104

 Fine-tuning Compacted Model
 ---------------------------
 Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
 Because speedup will replace the masked big layers with dense small ones.

-.. GENERATED FROM PYTHON SOURCE LINES 111-115
+.. GENERATED FROM PYTHON SOURCE LINES 104-108

 .. code-block:: default

@ -317,10 +326,10 @@ Because speedup will replace the masked big layers with dense small ones.

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 1 minutes  0.810 seconds)
+   **Total running time of the script:** ( 1 minutes  20.740 seconds)


-.. _sphx_glr_download_tutorials_pruning_quick_start_mnist.py:
+.. _sphx_glr_download_tutorials_pruning_quick_start.py:

 .. only:: html

@ -329,11 +338,11 @@ Because speedup will replace the masked big layers with dense small ones.

    .. container:: sphx-glr-download sphx-glr-download-python

-      :download:`Download Python source code: pruning_quick_start_mnist.py <pruning_quick_start_mnist.py>`
+      :download:`Download Python source code: pruning_quick_start.py <pruning_quick_start.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

-      :download:`Download Jupyter notebook: pruning_quick_start_mnist.ipynb <pruning_quick_start_mnist.ipynb>`
+      :download:`Download Jupyter notebook: pruning_quick_start.ipynb <pruning_quick_start.ipynb>`


 .. only:: html
--- a/docs/source/tutorials/pruning_quick_start_codeobj.pickle
+++ b/docs/source/tutorials/pruning_quick_start_codeobj.pickle
--- a/docs/source/tutorials/pruning_quick_start_mnist.py.md5
+++ b/docs/source/tutorials/pruning_quick_start_mnist.py.md5
@ -1 +0,0 @@
-e7c8d40b9d497d59db95ffcedfc1c450
--- a/docs/source/tutorials/pruning_quick_start_mnist_codeobj.pickle
+++ b/docs/source/tutorials/pruning_quick_start_mnist_codeobj.pickle
--- a/docs/source/tutorials/pruning_quick_start_mnist_zh.rst
+++ b/docs/source/tutorials/pruning_quick_start_mnist_zh.rst
@ -1,337 +0,0 @@
-.. f2006d635ba8b91cd9cd311c1bd844f3
-
-    .. note::
-        :class: sphx-glr-download-link-note
-
-        Click :ref:`here <sphx_glr_download_tutorials_pruning_quick_start_mnist.py>`
-        to download the full example code
-
-.. rst-class:: sphx-glr-example-title
-
-.. _sphx_glr_tutorials_pruning_quick_start_mnist.py:
-
-
-模型剪枝入门
-============
-
-下面是一个三分钟快速入门模型剪枝的视频。
-
-..  youtube:: wKh51Jnr0a8
-    :align: center
-
-模型剪枝是一种通过减小模型权重规模或中间状态规模来减小模型大小和计算量的技术。
-修剪 DNN 模型有三种常见做法：
-
-#. 训练一个模型 -> 对模型进行剪枝 -> 对剪枝后模型进行微调
-#. 在模型训练过程中进行剪枝 -> 对剪枝后模型进行微调
-#. 对模型进行剪枝 -> 从头训练剪枝后模型
-
-NNI 主要通过在剪枝阶段进行工作来支持上述所有剪枝过程。
-通过本教程可以快速了解如何在常见实践中使用 NNI 修剪模型。
-
-.. GENERATED FROM PYTHON SOURCE LINES 17-22
-
-准备工作
--------
-
-在本教程中，我们使用一个简单的模型在 MNIST 数据集上进行了预训练。
-如果你熟悉在 pytorch 中定义模型和训练模型，可以直接跳到 `模型剪枝`_。
-
-.. GENERATED FROM PYTHON SOURCE LINES 22-35
-
-.. code-block:: default
-
-
-    import torch
-    import torch.nn.functional as F
-    from torch.optim import SGD
-
-    from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
-
-    # define the model
-    model = TorchModel().to(device)
-
-    # show the model structure, note that pruner will wrap the model layer.
-    print(model)
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    TorchModel(
-      (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
-      (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
-      (fc1): Linear(in_features=256, out_features=120, bias=True)
-      (fc2): Linear(in_features=120, out_features=84, bias=True)
-      (fc3): Linear(in_features=84, out_features=10, bias=True)
-    )
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 36-47
-
-.. code-block:: default
-
-
-    # define the optimizer and criterion for pre-training
-
-    optimizer = SGD(model.parameters(), 1e-2)
-    criterion = F.nll_loss
-
-    # pre-train and evaluate the model on MNIST dataset
-    for epoch in range(3):
-        trainer(model, optimizer, criterion)
-        evaluator(model)
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    Average test loss: 0.5266, Accuracy: 8345/10000 (83%)
-    Average test loss: 0.2713, Accuracy: 9209/10000 (92%)
-    Average test loss: 0.1919, Accuracy: 9356/10000 (94%)
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 48-58
-
-模型剪枝
--------
-
-使用 L1NormPruner 对模型进行剪枝并生成掩码。
-通常情况下，pruner 需要原始模型和一个 ``config_list`` 作为输入参数。
-具体关于如何写 ``config_list`` 请参考 :doc:`compression config specification <../compression/compression_config_list>`。
-
-以下 `config_list` 表示 pruner 将修剪类型为 `Linear` 或 `Conv2d` 的所有层除了名为 `fc3` 的层，因为 `fc3` 被设置为 `exclude`。
-每层的最终稀疏率是 50%。而名为 `fc3` 的层将不会被修剪。
-
-.. GENERATED FROM PYTHON SOURCE LINES 58-67
-
-.. code-block:: default
-
-
-    config_list = [{
-        'sparsity_per_layer': 0.5,
-        'op_types': ['Linear', 'Conv2d']
-    }, {
-        'exclude': True,
-        'op_names': ['fc3']
-    }]
-
-
-
-
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 68-69
-
-Pruners usually require `model` and `config_list` as input arguments.
-
-.. GENERATED FROM PYTHON SOURCE LINES 69-76
-
-.. code-block:: default
-
-
-    from nni.compression.pytorch.pruning import L1NormPruner
-    pruner = L1NormPruner(model, config_list)
-
-    # show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.
-    print(model)
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    TorchModel(
-      (conv1): PrunerModuleWrapper(
-        (module): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
-      )
-      (conv2): PrunerModuleWrapper(
-        (module): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
-      )
-      (fc1): PrunerModuleWrapper(
-        (module): Linear(in_features=256, out_features=120, bias=True)
-      )
-      (fc2): PrunerModuleWrapper(
-        (module): Linear(in_features=120, out_features=84, bias=True)
-      )
-      (fc3): Linear(in_features=84, out_features=10, bias=True)
-    )
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 77-84
-
-.. code-block:: default
-
-
-    # compress the model and generate the masks
-    _, masks = pruner.compress()
-    # show the masks sparsity
-    for name, mask in masks.items():
-        print(name, ' sparsity : ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    conv1  sparsity :  0.5
-    conv2  sparsity :  0.5
-    fc1  sparsity :  0.5
-    fc2  sparsity :  0.5
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 85-88
-
-使用 NNI 的模型加速功能和 pruner 生成好的 masks 对原始模型进行加速，注意 `ModelSpeedup` 需要 unwrapped 的模型。
-模型会在加速之后真正的在规模上变小，并且可能会达到相比于 masks 更大的稀疏率，这是因为 `ModelSpeedup` 会自动在模型中传播稀疏，
-识别由于掩码带来的冗余权重。
-
-.. GENERATED FROM PYTHON SOURCE LINES 88-97
-
-.. code-block:: default
-
-
-    # need to unwrap the model, if the model is wrapped before speedup
-    pruner._unwrap_model()
-
-    # speedup the model
-    from nni.compression.pytorch.speedup import ModelSpeedup
-
-    ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    aten::log_softmax is not Supported! Please report an issue at https://github.com/microsoft/nni. Thanks~
-    Note: .aten::log_softmax.12 does not have corresponding mask inference object
-    /home/ningshang/anaconda3/envs/nni-dev/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  aten/src/ATen/core/TensorBody.h:417.)
-      return self._grad
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 98-99
-
-模型在加速之后变小了。
-
-.. GENERATED FROM PYTHON SOURCE LINES 99-101
-
-.. code-block:: default
-
-    print(model)
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    TorchModel(
-      (conv1): Conv2d(1, 3, kernel_size=(5, 5), stride=(1, 1))
-      (conv2): Conv2d(3, 8, kernel_size=(5, 5), stride=(1, 1))
-      (fc1): Linear(in_features=128, out_features=60, bias=True)
-      (fc2): Linear(in_features=60, out_features=42, bias=True)
-      (fc3): Linear(in_features=42, out_features=10, bias=True)
-    )
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 102-106
-
-微调压缩好的紧凑模型
--------------------
-
-注意当前的模型已经经过了加速，如果你需要微调模型，请重新生成 optimizer。
-这是因为在加速过程中进行了层替换，原来的 optimizer 已经不适用于现在的新模型了。
-
-.. GENERATED FROM PYTHON SOURCE LINES 106-110
-
-.. code-block:: default
-
-
-    optimizer = SGD(model.parameters(), 1e-2)
-    for epoch in range(3):
-        trainer(model, optimizer, criterion)
-
-
-
-
-
-
-
-
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  24.976 seconds)
-
-
-.. _sphx_glr_download_tutorials_pruning_quick_start_mnist.py:
-
-
-.. only :: html
-
- .. container:: sphx-glr-footer
-    :class: sphx-glr-footer-example
-
-
-
-  .. container:: sphx-glr-download sphx-glr-download-python
-
-     :download:`Download Python source code: pruning_quick_start_mnist.py <pruning_quick_start_mnist.py>`
-
-
-
-  .. container:: sphx-glr-download sphx-glr-download-jupyter
-
-     :download:`Download Jupyter notebook: pruning_quick_start_mnist.ipynb <pruning_quick_start_mnist.ipynb>`
-
-
-.. only:: html
-
- .. rst-class:: sphx-glr-signature
-
-    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
--- a/docs/source/tutorials/pruning_speedup.ipynb
+++ b/docs/source/tutorials/pruning_speedup.ipynb
@ -15,7 +15,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "\n# Speedup Model with Mask\n\n## Introduction\n\nPruning algorithms usually use weight masks to simulate the real pruning. Masks can be used\nto check model performance of a specific pruning (or sparsity), but there is no real speedup.\nSince model speedup is the ultimate goal of model pruning, we try to provide a tool to users\nto convert a model to a smaller one based on user provided masks (the masks come from the\npruning algorithms).\n\nThere are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,\nand input/output tensors. Sparse kernel is required to speedup a fine-grained pruned layer.\nThe other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.\nTo speedup this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.\nSince the support of sparse kernels in community is limited,\nwe only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.\n\n## Design and Implementation\n\nTo speedup a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,\nor replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,\nthus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.\nTherefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;\nsecond, replace the modules.\n\nThe first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.\nThe new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.\nFor each type of module, we should prepare a function for module replacement.\nThe module replacement function returns a newly created module which is smaller.\n\n## Usage\n"
+        "\n# Speedup Model with Mask\n\n## Introduction\n\nPruning algorithms usually use weight masks to simulate the real pruning. Masks can be used\nto check model performance of a specific pruning (or sparsity), but there is no real speedup.\nSince model speedup is the ultimate goal of model pruning, we try to provide a tool to users\nto convert a model to a smaller one based on user provided masks (the masks come from the\npruning algorithms).\n\nThere are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,\nand input/output tensors. Sparse kernel is required to speedup a fine-grained pruned layer.\nThe other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.\nTo speedup this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.\nSince the support of sparse kernels in community is limited,\nwe only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.\n\n## Design and Implementation\n\nTo speedup a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,\nor replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,\nthus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.\nTherefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;\nsecond, replace the modules.\n\nThe first step requires topology (i.e., connections) of the model, we use a tracer based on ``torch.fx`` to obtain the model graph for PyTorch.\nThe new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.\nFor each type of module, we should prepare a function for module replacement.\nThe module replacement function returns a newly created module which is smaller.\n\n## Usage\n"
      ]
    },
    {
@ -87,7 +87,7 @@
      },
      "outputs": [],
      "source": [
-        "from nni.compression.pytorch import ModelSpeedup\nModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()\nprint(model)"
+        "from nni.compression.pytorch.speedup.v2 import ModelSpeedup\nModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()\nprint(model)"
      ]
    },
    {
--- a/docs/source/tutorials/pruning_speedup.py
+++ b/docs/source/tutorials/pruning_speedup.py
@ -27,7 +27,7 @@ thus, we should do shape inference to check are there other unpruned layers shou
 Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
 second, replace the modules.

-The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+The first step requires topology (i.e., connections) of the model, we use a tracer based on ``torch.fx`` to obtain the model graph for PyTorch.
 The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
 For each type of module, we should prepare a function for module replacement.
 The module replacement function returns a newly created module which is smaller.
@ -65,7 +65,7 @@ print('Original Model - Elapsed Time : ', time.time() - start)

 # %%
 # Speedup the model and show the model structure after speedup.
-from nni.compression.pytorch import ModelSpeedup
+from nni.compression.pytorch.speedup.v2 import ModelSpeedup
 ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
 print(model)

--- a/docs/source/tutorials/pruning_speedup.py.md5
+++ b/docs/source/tutorials/pruning_speedup.py.md5
@ -1 +1 @@
-a2564a6391bdd7aae11a85757ba27ed8
+60334840999c86b64ff889ee9909a797
--- a/docs/source/tutorials/pruning_speedup.rst
+++ b/docs/source/tutorials/pruning_speedup.rst
@ -46,7 +46,7 @@ thus, we should do shape inference to check are there other unpruned layers shou
 Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
 second, replace the modules.

-The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+The first step requires topology (i.e., connections) of the model, we use a tracer based on ``torch.fx`` to obtain the model graph for PyTorch.
 The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
 For each type of module, we should prepare a function for module replacement.
 The module replacement function returns a newly created module which is smaller.
@ -138,7 +138,7 @@ Roughly test the original model inference speed.

 .. code-block:: none

-    Original Model - Elapsed Time :  0.051694393157958984
+    Original Model - Elapsed Time :  0.16419386863708496



@ -151,7 +151,7 @@ Speedup the model and show the model structure after speedup.

 .. code-block:: default

-    from nni.compression.pytorch import ModelSpeedup
+    from nni.compression.pytorch.speedup.v2 import ModelSpeedup
    ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
    print(model)

@ -163,8 +163,6 @@ Speedup the model and show the model structure after speedup.

 .. code-block:: none

-    /home/ningshang/anaconda3/envs/nni-dev/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  aten/src/ATen/core/TensorBody.h:417.)
-      return self._grad
    TorchModel(
      (conv1): Conv2d(1, 3, kernel_size=(5, 5), stride=(1, 1))
      (conv2): Conv2d(3, 16, kernel_size=(5, 5), stride=(1, 1))
@ -202,7 +200,7 @@ Roughly test the model after speedup inference speed.

 .. code-block:: none

-    Speedup Model - Elapsed Time :  0.003111600875854492
+    Speedup Model - Elapsed Time :  0.0038301944732666016



@ -210,7 +208,7 @@ Roughly test the model after speedup inference speed.
 .. GENERATED FROM PYTHON SOURCE LINES 79-239

 For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,
-please refer to :doc:`Pruning Quick Start <pruning_quick_start_mnist>`.
+please refer to :doc:`Pruning Quick Start <pruning_quick_start>`.

 NOTE: The current implementation supports PyTorch 1.3.1 or newer.

@ -373,7 +371,7 @@ The latency is measured on one V100 GPU and the input tensor is  ``torch.randn(1

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 0 minutes  10.747 seconds)
+   **Total running time of the script:** ( 0 minutes  16.241 seconds)


 .. _sphx_glr_download_tutorials_pruning_speedup.py:
--- a/docs/source/tutorials/pruning_speedup_codeobj.pickle
+++ b/docs/source/tutorials/pruning_speedup_codeobj.pickle
--- a/docs/source/tutorials/quantization_customize.ipynb
+++ b/docs/source/tutorials/quantization_customize.ipynb
@ -1,79 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "%matplotlib inline"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "\n# Customize a new quantization algorithm\n\nTo write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``.\nThen, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``.\n``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "from nni.compression.pytorch import Quantizer\n\nclass YourQuantizer(Quantizer):\n    def __init__(self, model, config_list):\n        \"\"\"\n        Suggest you to use the NNI defined spec for config\n        \"\"\"\n        super().__init__(model, config_list)\n\n    def quantize_weight(self, weight, config, **kwargs):\n        \"\"\"\n        quantize should overload this method to quantize weight tensors.\n        This method is effectively hooked to :meth:`forward` of the model.\n\n        Parameters\n        ----------\n        weight : Tensor\n            weight that needs to be quantized\n        config : dict\n            the configuration for weight quantization\n        \"\"\"\n\n        # Put your code to generate `new_weight` here\n        new_weight = ...\n        return new_weight\n\n    def quantize_output(self, output, config, **kwargs):\n        \"\"\"\n        quantize should overload this method to quantize output.\n        This method is effectively hooked to `:meth:`forward` of the model.\n\n        Parameters\n        ----------\n        output : Tensor\n            output that needs to be quantized\n        config : dict\n            the configuration for output quantization\n        \"\"\"\n\n        # Put your code to generate `new_output` here\n        new_output = ...\n        return new_output\n\n    def quantize_input(self, *inputs, config, **kwargs):\n        \"\"\"\n        quantize should overload this method to quantize input.\n        This method is effectively hooked to :meth:`forward` of the model.\n\n        Parameters\n        ----------\n        inputs : Tensor\n            inputs that needs to be quantized\n        config : dict\n            the configuration for inputs quantization\n        \"\"\"\n\n        # Put your code to generate `new_input` here\n        new_input = ...\n        return new_input\n\n    def update_epoch(self, epoch_num):\n        pass\n\n    def step(self):\n        \"\"\"\n        Can do some processing based on the model or weights binded\n        in the func bind_model\n        \"\"\"\n        pass"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Customize backward function\n\nSometimes it's necessary for a quantization operation to have a customized backward function,\nsuch as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\\ ,\nuser can customize a backward function as follow:\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType\n\nclass ClipGrad(QuantGrad):\n    @staticmethod\n    def quant_backward(tensor, grad_output, quant_type):\n        \"\"\"\n        This method should be overrided by subclass to provide customized backward function,\n        default implementation is Straight-Through Estimator\n        Parameters\n        ----------\n        tensor : Tensor\n            input of quantization operation\n        grad_output : Tensor\n            gradient of the output of quantization operation\n        quant_type : QuantType\n            the type of quantization, it can be `QuantType.INPUT`, `QuantType.WEIGHT`, `QuantType.OUTPUT`,\n            you can define different behavior for different types.\n        Returns\n        -------\n        tensor\n            gradient of the input of quantization operation\n        \"\"\"\n\n        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1\n        if quant_type == QuantType.OUTPUT:\n            grad_output[tensor.abs() > 1] = 0\n        return grad_output\n\nclass _YourQuantizer(Quantizer):\n    def __init__(self, model, config_list):\n        super().__init__(model, config_list)\n        # set your customized backward function to overwrite default backward function\n        self.quant_grad = ClipGrad"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "If you do not customize ``QuantGrad``, the default backward is Straight-Through Estimator. \n\n"
-      ]
-    }
-  ],
-  "metadata": {
-    "kernelspec": {
-      "display_name": "Python 3",
-      "language": "python",
-      "name": "python3"
-    },
-    "language_info": {
-      "codemirror_mode": {
-        "name": "ipython",
-        "version": 3
-      },
-      "file_extension": ".py",
-      "mimetype": "text/x-python",
-      "name": "python",
-      "nbconvert_exporter": "python",
-      "pygments_lexer": "ipython3",
-      "version": "3.8.8"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
--- a/docs/source/tutorials/quantization_customize.py
+++ b/docs/source/tutorials/quantization_customize.py
@ -1,123 +0,0 @@
-"""
-Customize a new quantization algorithm
-======================================
-
-To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``.
-Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``.
-``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
-"""
-
-from nni.compression.pytorch import Quantizer
-
-class YourQuantizer(Quantizer):
-    def __init__(self, model, config_list):
-        """
-        Suggest you to use the NNI defined spec for config
-        """
-        super().__init__(model, config_list)
-
-    def quantize_weight(self, weight, config, **kwargs):
-        """
-        quantize should overload this method to quantize weight tensors.
-        This method is effectively hooked to :meth:`forward` of the model.
-
-        Parameters
-        ----------
-        weight : Tensor
-            weight that needs to be quantized
-        config : dict
-            the configuration for weight quantization
-        """
-
-        # Put your code to generate `new_weight` here
-        new_weight = ...
-        return new_weight
-
-    def quantize_output(self, output, config, **kwargs):
-        """
-        quantize should overload this method to quantize output.
-        This method is effectively hooked to `:meth:`forward` of the model.
-
-        Parameters
-        ----------
-        output : Tensor
-            output that needs to be quantized
-        config : dict
-            the configuration for output quantization
-        """
-
-        # Put your code to generate `new_output` here
-        new_output = ...
-        return new_output
-
-    def quantize_input(self, *inputs, config, **kwargs):
-        """
-        quantize should overload this method to quantize input.
-        This method is effectively hooked to :meth:`forward` of the model.
-
-        Parameters
-        ----------
-        inputs : Tensor
-            inputs that needs to be quantized
-        config : dict
-            the configuration for inputs quantization
-        """
-
-        # Put your code to generate `new_input` here
-        new_input = ...
-        return new_input
-
-    def update_epoch(self, epoch_num):
-        pass
-
-    def step(self):
-        """
-        Can do some processing based on the model or weights binded
-        in the func bind_model
-        """
-        pass
-
-# %%
-# Customize backward function
-# ^^^^^^^^^^^^^^^^^^^^^^^^^^^
-#
-# Sometimes it's necessary for a quantization operation to have a customized backward function,
-# such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ ,
-# user can customize a backward function as follow:
-
-from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
-
-class ClipGrad(QuantGrad):
-    @staticmethod
-    def quant_backward(tensor, grad_output, quant_type):
-        """
-        This method should be overrided by subclass to provide customized backward function,
-        default implementation is Straight-Through Estimator
-        Parameters
-        ----------
-        tensor : Tensor
-            input of quantization operation
-        grad_output : Tensor
-            gradient of the output of quantization operation
-        quant_type : QuantType
-            the type of quantization, it can be `QuantType.INPUT`, `QuantType.WEIGHT`, `QuantType.OUTPUT`,
-            you can define different behavior for different types.
-        Returns
-        -------
-        tensor
-            gradient of the input of quantization operation
-        """
-
-        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
-        if quant_type == QuantType.OUTPUT:
-            grad_output[tensor.abs() > 1] = 0
-        return grad_output
-
-class _YourQuantizer(Quantizer):
-    def __init__(self, model, config_list):
-        super().__init__(model, config_list)
-        # set your customized backward function to overwrite default backward function
-        self.quant_grad = ClipGrad
-
-# %%
-# If you do not customize ``QuantGrad``, the default backward is Straight-Through Estimator. 
--- a/docs/source/tutorials/quantization_customize.py.md5
+++ b/docs/source/tutorials/quantization_customize.py.md5
@ -1 +0,0 @@
-387ac974594fa239c25479453b808ec8
--- a/docs/source/tutorials/quantization_customize.rst
+++ b/docs/source/tutorials/quantization_customize.rst
@ -1,200 +0,0 @@
-
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
-.. "tutorials/quantization_customize.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
-
-.. only:: html
-
-    .. note::
-        :class: sphx-glr-download-link-note
-
-        Click :ref:`here <sphx_glr_download_tutorials_quantization_customize.py>`
-        to download the full example code
-
-.. rst-class:: sphx-glr-example-title
-
-.. _sphx_glr_tutorials_quantization_customize.py:
-
-
-Customize a new quantization algorithm
-======================================
-
-To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``.
-Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``.
-``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
-
-.. GENERATED FROM PYTHON SOURCE LINES 9-80
-
-.. code-block:: default
-
-
-    from nni.compression.pytorch import Quantizer
-
-    class YourQuantizer(Quantizer):
-        def __init__(self, model, config_list):
-            """
-            Suggest you to use the NNI defined spec for config
-            """
-            super().__init__(model, config_list)
-
-        def quantize_weight(self, weight, config, **kwargs):
-            """
-            quantize should overload this method to quantize weight tensors.
-            This method is effectively hooked to :meth:`forward` of the model.
-
-            Parameters
-            ----------
-            weight : Tensor
-                weight that needs to be quantized
-            config : dict
-                the configuration for weight quantization
-            """
-
-            # Put your code to generate `new_weight` here
-            new_weight = ...
-            return new_weight
-
-        def quantize_output(self, output, config, **kwargs):
-            """
-            quantize should overload this method to quantize output.
-            This method is effectively hooked to `:meth:`forward` of the model.
-
-            Parameters
-            ----------
-            output : Tensor
-                output that needs to be quantized
-            config : dict
-                the configuration for output quantization
-            """
-
-            # Put your code to generate `new_output` here
-            new_output = ...
-            return new_output
-
-        def quantize_input(self, *inputs, config, **kwargs):
-            """
-            quantize should overload this method to quantize input.
-            This method is effectively hooked to :meth:`forward` of the model.
-
-            Parameters
-            ----------
-            inputs : Tensor
-                inputs that needs to be quantized
-            config : dict
-                the configuration for inputs quantization
-            """
-
-            # Put your code to generate `new_input` here
-            new_input = ...
-            return new_input
-
-        def update_epoch(self, epoch_num):
-            pass
-
-        def step(self):
-            """
-            Can do some processing based on the model or weights binded
-            in the func bind_model
-            """
-            pass
-
-
-
-
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 81-87
-
-Customize backward function
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Sometimes it's necessary for a quantization operation to have a customized backward function,
-such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ ,
-user can customize a backward function as follow:
-
-.. GENERATED FROM PYTHON SOURCE LINES 87-122
-
-.. code-block:: default
-
-
-    from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
-
-    class ClipGrad(QuantGrad):
-        @staticmethod
-        def quant_backward(tensor, grad_output, quant_type):
-            """
-            This method should be overrided by subclass to provide customized backward function,
-            default implementation is Straight-Through Estimator
-            Parameters
-            ----------
-            tensor : Tensor
-                input of quantization operation
-            grad_output : Tensor
-                gradient of the output of quantization operation
-            quant_type : QuantType
-                the type of quantization, it can be `QuantType.INPUT`, `QuantType.WEIGHT`, `QuantType.OUTPUT`,
-                you can define different behavior for different types.
-            Returns
-            -------
-            tensor
-                gradient of the input of quantization operation
-            """
-
-            # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
-            if quant_type == QuantType.OUTPUT:
-                grad_output[tensor.abs() > 1] = 0
-            return grad_output
-
-    class _YourQuantizer(Quantizer):
-        def __init__(self, model, config_list):
-            super().__init__(model, config_list)
-            # set your customized backward function to overwrite default backward function
-            self.quant_grad = ClipGrad
-
-
-
-
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 123-124
-
-If you do not customize ``QuantGrad``, the default backward is Straight-Through Estimator. 
-
-
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 0 minutes  1.269 seconds)
-
-
-.. _sphx_glr_download_tutorials_quantization_customize.py:
-
-
-.. only :: html
-
- .. container:: sphx-glr-footer
-    :class: sphx-glr-footer-example
-
-
-
-  .. container:: sphx-glr-download sphx-glr-download-python
-
-     :download:`Download Python source code: quantization_customize.py <quantization_customize.py>`
-
-
-
-  .. container:: sphx-glr-download sphx-glr-download-jupyter
-
-     :download:`Download Jupyter notebook: quantization_customize.ipynb <quantization_customize.ipynb>`
-
-
-.. only:: html
-
- .. rst-class:: sphx-glr-signature
-
-    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
--- a/docs/source/tutorials/quantization_customize_codeobj.pickle
+++ b/docs/source/tutorials/quantization_customize_codeobj.pickle
--- a/docs/source/tutorials/quantization_quick_start.ipynb
+++ b/docs/source/tutorials/quantization_quick_start.ipynb
@ -15,7 +15,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "\n# Quantization Quickstart\n\nHere is a four-minute video to get you started with model quantization.\n\n..  youtube:: MSfV7AyfiA4\n    :align: center\n\nQuantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.\n\nIn NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.\nHere we use `QATQuantizer` as an example to show the usage of quantization in NNI.\n"
+        "\n# Quantization Quickstart\n\nQuantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.\n\nIn NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.\nHere we use `QATQuantizer` as an example to show the usage of quantization in NNI.\n"
      ]
    },
    {
@ -33,7 +33,7 @@
      },
      "outputs": [],
      "source": [
-        "import functools\nimport time\nfrom typing import Callable, Union, List, Dict, Tuple, Union\n\nimport torch\nimport torch.nn.functional as F\nfrom torch.optim import Optimizer, SGD\nfrom torch.utils.data import DataLoader\nfrom torch import Tensor\n\nfrom nni.common.types import SCHEDULER"
+        "import time\nfrom typing import Callable, Union, Union\n\nimport torch\nimport torch.nn.functional as F\nfrom torch.optim import Optimizer, SGD\nfrom torch.utils.data import DataLoader\nfrom torch import Tensor\n\nfrom nni.common.types import SCHEDULER"
      ]
    },
    {
@ -112,7 +112,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "## Quantizing Model\n\nInitialize a `config_list`.\nDetailed about how to write ``config_list`` please refer :doc:`Config Specification <../compression_preview/config_list>`.\n\n"
+        "## Quantizing Model\n\nInitialize a `config_list`.\nDetailed about how to write ``config_list`` please refer :doc:`Config Specification <../compression/config_list>`.\n\n"
      ]
    },
    {
@ -143,7 +143,7 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.9.16"
+      "version": "3.8.8"
    }
  },
  "nbformat": 4,
--- a/docs/source/tutorials/quantization_quick_start.py
+++ b/docs/source/tutorials/quantization_quick_start.py
@ -2,11 +2,6 @@
 Quantization Quickstart
 =======================

-Here is a four-minute video to get you started with model quantization.
-
-..  youtube:: MSfV7AyfiA4
-    :align: center
-
 Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.

 In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
@ -20,9 +15,8 @@ Here we use `QATQuantizer` as an example to show the usage of quantization in NN
 # In this tutorial, we use a simple model and pre-train on MNIST dataset.
 # If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.

-import functools
 import time
-from typing import Callable, Union, List, Dict, Tuple, Union
+from typing import Callable, Union, Union

 import torch
 import torch.nn.functional as F
@ -139,7 +133,7 @@ print(f'pure evaluating: {time.time() - start}s    Acc.: {acc}')
 # ----------------
 #
 # Initialize a `config_list`.
-# Detailed about how to write ``config_list`` please refer :doc:`Config Specification <../compression_preview/config_list>`.
+# Detailed about how to write ``config_list`` please refer :doc:`Config Specification <../compression/config_list>`.

 import nni
 from nni.contrib.compression.quantization import QATQuantizer
--- a/docs/source/tutorials/quantization_quick_start.py.md5
+++ b/docs/source/tutorials/quantization_quick_start.py.md5
@ -1 +1 @@
-f72305e67164ac9f28472df05bd8c53d
+d3d1074e56626255e3e19ef2a2ff057f
--- a/docs/source/tutorials/quantization_quick_start.rst
+++ b/docs/source/tutorials/quantization_quick_start.rst
@ -10,7 +10,7 @@
    .. note::
        :class: sphx-glr-download-link-note

-        :ref:`Go to the end <sphx_glr_download_tutorials_quantization_quick_start.py>`
+        Click :ref:`here <sphx_glr_download_tutorials_quantization_quick_start.py>`
        to download the full example code

 .. rst-class:: sphx-glr-example-title
@ -21,17 +21,12 @@
 Quantization Quickstart
 =======================

-Here is a four-minute video to get you started with model quantization.
-
-..  youtube:: MSfV7AyfiA4
-    :align: center
-
 Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.

 In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
 Here we use `QATQuantizer` as an example to show the usage of quantization in NNI.

-.. GENERATED FROM PYTHON SOURCE LINES 17-22
+.. GENERATED FROM PYTHON SOURCE LINES 12-17

 Preparation
 -----------
@ -39,14 +34,13 @@ Preparation
 In this tutorial, we use a simple model and pre-train on MNIST dataset.
 If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.

-.. GENERATED FROM PYTHON SOURCE LINES 22-36
+.. GENERATED FROM PYTHON SOURCE LINES 17-30

 .. code-block:: default


-    import functools
    import time
-    from typing import Callable, Union, List, Dict, Tuple, Union
+    from typing import Callable, Union, Union

    import torch
    import torch.nn.functional as F
@ -64,11 +58,11 @@ If you are familiar with defining a model and training in pytorch, you can skip



-.. GENERATED FROM PYTHON SOURCE LINES 37-38
+.. GENERATED FROM PYTHON SOURCE LINES 31-32

 Define the model

-.. GENERATED FROM PYTHON SOURCE LINES 38-63
+.. GENERATED FROM PYTHON SOURCE LINES 32-57

 .. code-block:: default

@ -104,11 +98,11 @@ Define the model



-.. GENERATED FROM PYTHON SOURCE LINES 64-65
+.. GENERATED FROM PYTHON SOURCE LINES 58-59

 Create training and evaluation dataloader

-.. GENERATED FROM PYTHON SOURCE LINES 65-78
+.. GENERATED FROM PYTHON SOURCE LINES 59-72

 .. code-block:: default

@ -132,11 +126,11 @@ Create training and evaluation dataloader



-.. GENERATED FROM PYTHON SOURCE LINES 79-80
+.. GENERATED FROM PYTHON SOURCE LINES 73-74

 Define training and evaluation functions

-.. GENERATED FROM PYTHON SOURCE LINES 80-124
+.. GENERATED FROM PYTHON SOURCE LINES 74-118

 .. code-block:: default

@ -191,11 +185,11 @@ Define training and evaluation functions



-.. GENERATED FROM PYTHON SOURCE LINES 125-126
+.. GENERATED FROM PYTHON SOURCE LINES 119-120

 Pre-train and evaluate the model on MNIST dataset

-.. GENERATED FROM PYTHON SOURCE LINES 126-137
+.. GENERATED FROM PYTHON SOURCE LINES 120-131

 .. code-block:: default

@ -223,21 +217,21 @@ Pre-train and evaluate the model on MNIST dataset
    Epoch 2 start!
    Epoch 3 start!
    Epoch 4 start!
-    pure training 5 epochs: 47.914021015167236s
-    pure evaluating: 1.2639274597167969s    Acc.: 0.9897
+    pure training 5 epochs: 71.90893840789795s
+    pure evaluating: 1.6302893161773682s    Acc.: 0.9908




-.. GENERATED FROM PYTHON SOURCE LINES 138-143
+.. GENERATED FROM PYTHON SOURCE LINES 132-137

 Quantizing Model
 ----------------

 Initialize a `config_list`.
-Detailed about how to write ``config_list`` please refer :doc:`Config Specification <../compression_preview/config_list>`.
+Detailed about how to write ``config_list`` please refer :doc:`Config Specification <../compression/config_list>`.

-.. GENERATED FROM PYTHON SOURCE LINES 143-177
+.. GENERATED FROM PYTHON SOURCE LINES 137-171

 .. code-block:: default

@ -288,9 +282,9 @@ Detailed about how to write ``config_list`` please refer :doc:`Config Specificat
    Epoch 2 start!
    Epoch 3 start!
    Epoch 4 start!
-    pure training 5 epochs: 78.95339393615723s
-    defaultdict(<class 'dict'>, {'fc2': {'weight': {'scale': tensor(0.0017), 'zero_point': tensor(-5.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(0.2286), 'tracked_min': tensor(-0.2105)}, '_input_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(6.), 'tracked_min': tensor(0.)}, '_output_0': {'scale': tensor(0.1543), 'zero_point': tensor(-35.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(25.0385), 'tracked_min': tensor(-14.1545)}}, 'conv2': {'weight': {'scale': tensor(0.0011), 'zero_point': tensor(-19.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(0.1659), 'tracked_min': tensor(-0.1226)}, '_input_0': {'scale': tensor(0.0230), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(5.8373), 'tracked_min': tensor(0.)}, '_output_0': {'scale': tensor(0.0971), 'zero_point': tensor(-6.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(12.9122), 'tracked_min': tensor(-11.7522)}}, 'fc1': {'weight': {'scale': tensor(0.0007), 'zero_point': tensor(-3.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(0.0885), 'tracked_min': tensor(-0.0844)}, '_input_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(6.), 'tracked_min': tensor(0.)}, '_output_0': {'scale': tensor(0.0611), 'zero_point': tensor(-7.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(8.2104), 'tracked_min': tensor(-7.3205)}}, 'conv1': {'weight': {'scale': tensor(0.0021), 'zero_point': tensor(-19.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(0.3130), 'tracked_min': tensor(-0.2318)}, '_input_0': {'scale': tensor(0.0128), 'zero_point': tensor(-94.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(2.8215), 'tracked_min': tensor(-0.4242)}, '_output_0': {'scale': tensor(0.0311), 'zero_point': tensor(13.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(3.5516), 'tracked_min': tensor(-4.3537)}}, 'relu3': {'_output_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(6.), 'tracked_min': tensor(0.)}}, 'relu1': {'_output_0': {'scale': tensor(0.0232), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(5.8952), 'tracked_min': tensor(0.)}}, 'relu2': {'_output_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(6.), 'tracked_min': tensor(0.)}}})
-    quantization evaluating: 1.2496261596679688s    Acc.: 0.9902
+    pure training 5 epochs: 117.75990748405457s
+    defaultdict(<class 'dict'>, {'fc2': {'weight': {'scale': tensor(0.0020), 'zero_point': tensor(-8.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(0.2640), 'tracked_min': tensor(-0.2319)}, '_input_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(6.), 'tracked_min': tensor(0.)}, '_output_0': {'scale': tensor(0.1541), 'zero_point': tensor(-39.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(25.6346), 'tracked_min': tensor(-13.5170)}}, 'conv1': {'weight': {'scale': tensor(0.0023), 'zero_point': tensor(-12.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(0.3128), 'tracked_min': tensor(-0.2606)}, '_input_0': {'scale': tensor(0.0128), 'zero_point': tensor(-94.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(2.8215), 'tracked_min': tensor(-0.4242)}, '_output_0': {'scale': tensor(0.0265), 'zero_point': tensor(-5.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(3.4957), 'tracked_min': tensor(-3.2373)}}, 'fc1': {'weight': {'scale': tensor(0.0007), 'zero_point': tensor(3.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(0.0894), 'tracked_min': tensor(-0.0943)}, '_input_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(6.), 'tracked_min': tensor(0.)}, '_output_0': {'scale': tensor(0.0678), 'zero_point': tensor(-8.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(9.1579), 'tracked_min': tensor(-8.0707)}}, 'conv2': {'weight': {'scale': tensor(0.0012), 'zero_point': tensor(-35.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(0.1927), 'tracked_min': tensor(-0.1097)}, '_input_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(5.9995), 'tracked_min': tensor(0.)}, '_output_0': {'scale': tensor(0.0893), 'zero_point': tensor(2.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(11.1702), 'tracked_min': tensor(-11.5212)}}, 'relu3': {'_output_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(6.), 'tracked_min': tensor(0.)}}, 'relu2': {'_output_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(6.), 'tracked_min': tensor(0.)}}, 'relu1': {'_output_0': {'scale': tensor(0.0236), 'zero_point': tensor(-127.), 'quant_dtype': 'int8', 'quant_scheme': 'affine', 'quant_bits': 8, 'tracked_max': tensor(5.9996), 'tracked_min': tensor(0.)}}})
+    quantization evaluating: 1.6024222373962402s    Acc.: 0.9915



@ -298,7 +292,7 @@ Detailed about how to write ``config_list`` please refer :doc:`Config Specificat

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 2 minutes  14.073 seconds)
+   **Total running time of the script:** ( 3 minutes  22.673 seconds)


 .. _sphx_glr_download_tutorials_quantization_quick_start.py:
@ -308,8 +302,6 @@ Detailed about how to write ``config_list`` please refer :doc:`Config Specificat
  .. container:: sphx-glr-footer sphx-glr-footer-example


-
-
    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: quantization_quick_start.py <quantization_quick_start.py>`
--- a/docs/source/tutorials/quantization_quick_start_codeobj.pickle
+++ b/docs/source/tutorials/quantization_quick_start_codeobj.pickle
--- a/docs/source/tutorials/quantization_quick_start_mnist.ipynb
+++ b/docs/source/tutorials/quantization_quick_start_mnist.ipynb
@ -1,151 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "%matplotlib inline"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "\n# Quantization Quickstart\n\nHere is a four-minute video to get you started with model quantization.\n\n..  youtube:: MSfV7AyfiA4\n    :align: center\n\nQuantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.\n\nIn NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.\nHere we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Preparation\n\nIn this tutorial, we use a simple model and pre-train on MNIST dataset.\nIf you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "import torch\nimport torch.nn.functional as F\nfrom torch.optim import SGD\n\nfrom nni_assets.compression.mnist_model import TorchModel, trainer, evaluator, device, test_trt\n\n# define the model\nmodel = TorchModel().to(device)\n\n# define the optimizer and criterion for pre-training\n\noptimizer = SGD(model.parameters(), 1e-2)\ncriterion = F.nll_loss\n\n# pre-train and evaluate the model on MNIST dataset\nfor epoch in range(3):\n    trainer(model, optimizer, criterion)\n    evaluator(model)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Quantizing Model\n\nInitialize a `config_list`.\nDetailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "config_list = [{\n    'quant_types': ['input', 'weight'],\n    'quant_bits': {'input': 8, 'weight': 8},\n    'op_types': ['Conv2d']\n}, {\n    'quant_types': ['output'],\n    'quant_bits': {'output': 8},\n    'op_types': ['ReLU']\n}, {\n    'quant_types': ['input', 'weight'],\n    'quant_bits': {'input': 8, 'weight': 8},\n    'op_names': ['fc1', 'fc2']\n}]"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "finetuning the model by using QAT\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "from nni.compression.pytorch.quantization import QAT_Quantizer\ndummy_input = torch.rand(32, 1, 28, 28).to(device)\nquantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)\nquantizer.compress()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "The model has now been wrapped, and quantization targets ('quant_types' setting in `config_list`)\nwill be quantized & dequantized for simulated quantization in the wrapped layers.\nQAT is a training-aware quantizer, it will update scale and zero point during training.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "for epoch in range(3):\n    trainer(model, optimizer, criterion)\n    evaluator(model)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "export model and get calibration_config\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "model_path = \"./log/mnist_model.pth\"\ncalibration_path = \"./log/mnist_calibration.pth\"\ncalibration_config = quantizer.export_model(model_path, calibration_path)\n\nprint(\"calibration_config: \", calibration_config)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "build tensorRT engine to make a real speedup, for more information about speedup, please refer :doc:`quantization_speedup`.\n\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "collapsed": false
-      },
-      "outputs": [],
-      "source": [
-        "from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT\ninput_shape = (32, 1, 28, 28)\nengine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)\nengine.compress()\ntest_trt(engine)"
-      ]
-    }
-  ],
-  "metadata": {
-    "kernelspec": {
-      "display_name": "Python 3",
-      "language": "python",
-      "name": "python3"
-    },
-    "language_info": {
-      "codemirror_mode": {
-        "name": "ipython",
-        "version": 3
-      },
-      "file_extension": ".py",
-      "mimetype": "text/x-python",
-      "name": "python",
-      "nbconvert_exporter": "python",
-      "pygments_lexer": "ipython3",
-      "version": "3.8.8"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
--- a/docs/source/tutorials/quantization_quick_start_mnist.py
+++ b/docs/source/tutorials/quantization_quick_start_mnist.py
@ -1,94 +0,0 @@
-"""
-Quantization Quickstart
-=======================
-
-Here is a four-minute video to get you started with model quantization.
-
-..  youtube:: MSfV7AyfiA4
-    :align: center
-
-Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.
-
-In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
-Here we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.
-"""
-
-# %%
-# Preparation
-# -----------
-#
-# In this tutorial, we use a simple model and pre-train on MNIST dataset.
-# If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.
-
-import torch
-import torch.nn.functional as F
-from torch.optim import SGD
-
-from nni_assets.compression.mnist_model import TorchModel, trainer, evaluator, device, test_trt
-
-# define the model
-model = TorchModel().to(device)
-
-# define the optimizer and criterion for pre-training
-
-optimizer = SGD(model.parameters(), 1e-2)
-criterion = F.nll_loss
-
-# pre-train and evaluate the model on MNIST dataset
-for epoch in range(3):
-    trainer(model, optimizer, criterion)
-    evaluator(model)
-
-# %%
-# Quantizing Model
-# ----------------
-#
-# Initialize a `config_list`.
-# Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.
-
-config_list = [{
-    'quant_types': ['input', 'weight'],
-    'quant_bits': {'input': 8, 'weight': 8},
-    'op_types': ['Conv2d']
-}, {
-    'quant_types': ['output'],
-    'quant_bits': {'output': 8},
-    'op_types': ['ReLU']
-}, {
-    'quant_types': ['input', 'weight'],
-    'quant_bits': {'input': 8, 'weight': 8},
-    'op_names': ['fc1', 'fc2']
-}]
-
-# %%
-# finetuning the model by using QAT
-from nni.compression.pytorch.quantization import QAT_Quantizer
-dummy_input = torch.rand(32, 1, 28, 28).to(device)
-quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
-quantizer.compress()
-
-# %%
-# The model has now been wrapped, and quantization targets ('quant_types' setting in `config_list`)
-# will be quantized & dequantized for simulated quantization in the wrapped layers.
-# QAT is a training-aware quantizer, it will update scale and zero point during training.
-
-for epoch in range(3):
-    trainer(model, optimizer, criterion)
-    evaluator(model)
-
-# %%
-# export model and get calibration_config
-model_path = "./log/mnist_model.pth"
-calibration_path = "./log/mnist_calibration.pth"
-calibration_config = quantizer.export_model(model_path, calibration_path)
-
-print("calibration_config: ", calibration_config)
-
-# %%
-# build tensorRT engine to make a real speedup, for more information about speedup, please refer :doc:`quantization_speedup`.
-
-from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
-input_shape = (32, 1, 28, 28)
-engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
-engine.compress()
-test_trt(engine)
--- a/docs/source/tutorials/quantization_quick_start_mnist.py.md5
+++ b/docs/source/tutorials/quantization_quick_start_mnist.py.md5
@ -1 +0,0 @@
-0039cfb7fbdb08b31568e04f7a4d4e6f
--- a/docs/source/tutorials/quantization_quick_start_mnist.rst
+++ b/docs/source/tutorials/quantization_quick_start_mnist.rst
@ -1,295 +0,0 @@
-
-.. DO NOT EDIT.
-.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
-.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
-.. "tutorials/quantization_quick_start_mnist.py"
-.. LINE NUMBERS ARE GIVEN BELOW.
-
-.. only:: html
-
-    .. note::
-        :class: sphx-glr-download-link-note
-
-        Click :ref:`here <sphx_glr_download_tutorials_quantization_quick_start_mnist.py>`
-        to download the full example code
-
-.. rst-class:: sphx-glr-example-title
-
-.. _sphx_glr_tutorials_quantization_quick_start_mnist.py:
-
-
-Quantization Quickstart
-=======================
-
-Here is a four-minute video to get you started with model quantization.
-
-..  youtube:: MSfV7AyfiA4
-    :align: center
-
-Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.
-
-In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
-Here we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.
-
-.. GENERATED FROM PYTHON SOURCE LINES 17-22
-
-Preparation
-----------
-
-In this tutorial, we use a simple model and pre-train on MNIST dataset.
-If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.
-
-.. GENERATED FROM PYTHON SOURCE LINES 22-42
-
-.. code-block:: default
-
-
-    import torch
-    import torch.nn.functional as F
-    from torch.optim import SGD
-
-    from nni_assets.compression.mnist_model import TorchModel, trainer, evaluator, device, test_trt
-
-    # define the model
-    model = TorchModel().to(device)
-
-    # define the optimizer and criterion for pre-training
-
-    optimizer = SGD(model.parameters(), 1e-2)
-    criterion = F.nll_loss
-
-    # pre-train and evaluate the model on MNIST dataset
-    for epoch in range(3):
-        trainer(model, optimizer, criterion)
-        evaluator(model)
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    Average test loss: 0.6440, Accuracy: 8230/10000 (82%)
-    Average test loss: 0.2512, Accuracy: 9272/10000 (93%)
-    Average test loss: 0.1569, Accuracy: 9542/10000 (95%)
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 43-48
-
-Quantizing Model
----------------
-
-Initialize a `config_list`.
-Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.
-
-.. GENERATED FROM PYTHON SOURCE LINES 48-63
-
-.. code-block:: default
-
-
-    config_list = [{
-        'quant_types': ['input', 'weight'],
-        'quant_bits': {'input': 8, 'weight': 8},
-        'op_types': ['Conv2d']
-    }, {
-        'quant_types': ['output'],
-        'quant_bits': {'output': 8},
-        'op_types': ['ReLU']
-    }, {
-        'quant_types': ['input', 'weight'],
-        'quant_bits': {'input': 8, 'weight': 8},
-        'op_names': ['fc1', 'fc2']
-    }]
-
-
-
-
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 64-65
-
-finetuning the model by using QAT
-
-.. GENERATED FROM PYTHON SOURCE LINES 65-70
-
-.. code-block:: default
-
-    from nni.compression.pytorch.quantization import QAT_Quantizer
-    dummy_input = torch.rand(32, 1, 28, 28).to(device)
-    quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
-    quantizer.compress()
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-
-    TorchModel(
-      (conv1): QuantizerModuleWrapper(
-        (module): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
-      )
-      (conv2): QuantizerModuleWrapper(
-        (module): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
-      )
-      (fc1): QuantizerModuleWrapper(
-        (module): Linear(in_features=256, out_features=120, bias=True)
-      )
-      (fc2): QuantizerModuleWrapper(
-        (module): Linear(in_features=120, out_features=84, bias=True)
-      )
-      (fc3): Linear(in_features=84, out_features=10, bias=True)
-      (relu1): QuantizerModuleWrapper(
-        (module): ReLU()
-      )
-      (relu2): QuantizerModuleWrapper(
-        (module): ReLU()
-      )
-      (relu3): QuantizerModuleWrapper(
-        (module): ReLU()
-      )
-      (relu4): QuantizerModuleWrapper(
-        (module): ReLU()
-      )
-      (pool1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
-      (pool2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
-    )
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 71-74
-
-The model has now been wrapped, and quantization targets ('quant_types' setting in `config_list`)
-will be quantized & dequantized for simulated quantization in the wrapped layers.
-QAT is a training-aware quantizer, it will update scale and zero point during training.
-
-.. GENERATED FROM PYTHON SOURCE LINES 74-79
-
-.. code-block:: default
-
-
-    for epoch in range(3):
-        trainer(model, optimizer, criterion)
-        evaluator(model)
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    Average test loss: 0.1209, Accuracy: 9629/10000 (96%)
-    Average test loss: 0.1032, Accuracy: 9696/10000 (97%)
-    Average test loss: 0.0909, Accuracy: 9736/10000 (97%)
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 80-81
-
-export model and get calibration_config
-
-.. GENERATED FROM PYTHON SOURCE LINES 81-87
-
-.. code-block:: default
-
-    model_path = "./log/mnist_model.pth"
-    calibration_path = "./log/mnist_calibration.pth"
-    calibration_config = quantizer.export_model(model_path, calibration_path)
-
-    print("calibration_config: ", calibration_config)
-
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    calibration_config:  {'conv1': {'weight_bits': 8, 'weight_scale': tensor([0.0032], device='cuda:0'), 'weight_zero_point': tensor([92.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': -0.4242129623889923, 'tracked_max_input': 2.821486711502075}, 'conv2': {'weight_bits': 8, 'weight_scale': tensor([0.0022], device='cuda:0'), 'weight_zero_point': tensor([110.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 11.599255561828613}, 'fc1': {'weight_bits': 8, 'weight_scale': tensor([0.0010], device='cuda:0'), 'weight_zero_point': tensor([113.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 26.364503860473633}, 'fc2': {'weight_bits': 8, 'weight_scale': tensor([0.0013], device='cuda:0'), 'weight_zero_point': tensor([124.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 26.364498138427734}, 'relu1': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 11.658699989318848}, 'relu2': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 26.645591735839844}, 'relu3': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 26.877971649169922}, 'relu4': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 16.9318904876709}}
-
-
-
-
-.. GENERATED FROM PYTHON SOURCE LINES 88-89
-
-build tensorRT engine to make a real speedup, for more information about speedup, please refer :doc:`quantization_speedup`.
-
-.. GENERATED FROM PYTHON SOURCE LINES 89-95
-
-.. code-block:: default
-
-
-    from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
-    input_shape = (32, 1, 28, 28)
-    engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
-    engine.compress()
-    test_trt(engine)
-
-
-
-
-.. rst-class:: sphx-glr-script-out
-
- Out:
-
- .. code-block:: none
-
-    Loss: 0.09197621383666992  Accuracy: 97.29%
-    Inference elapsed_time (whole dataset): 0.036701202392578125s
-
-
-
-
-
-.. rst-class:: sphx-glr-timing
-
-   **Total running time of the script:** ( 1 minutes  46.013 seconds)
-
-
-.. _sphx_glr_download_tutorials_quantization_quick_start_mnist.py:
-
-
-.. only :: html
-
- .. container:: sphx-glr-footer
-    :class: sphx-glr-footer-example
-
-
-
-  .. container:: sphx-glr-download sphx-glr-download-python
-
-     :download:`Download Python source code: quantization_quick_start_mnist.py <quantization_quick_start_mnist.py>`
-
-
-
-  .. container:: sphx-glr-download sphx-glr-download-jupyter
-
-     :download:`Download Jupyter notebook: quantization_quick_start_mnist.ipynb <quantization_quick_start_mnist.ipynb>`
-
-
-.. only:: html
-
- .. rst-class:: sphx-glr-signature
-
-    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
--- a/docs/source/tutorials/quantization_quick_start_mnist_codeobj.pickle
+++ b/docs/source/tutorials/quantization_quick_start_mnist_codeobj.pickle
--- a/docs/source/tutorials/sg_execution_times.rst
+++ b/docs/source/tutorials/sg_execution_times.rst
@ -3,39 +3,26 @@

 .. _sphx_glr_tutorials_sg_execution_times:

-
 Computation times
 =================
-**02:14.073** total execution time for **tutorials** files:
+**03:22.673** total execution time for **tutorials** files:

-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_quick_start.py` (``quantization_quick_start.py``)             | 02:14.073 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_nasbench_as_dataset.py` (``nasbench_as_dataset.py``)                       | 01:51.444 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_bert_glue.py` (``quantization_bert_glue.py``)                 | 09:40.647 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_darts.py` (``darts.py``)                                                   | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_hello_nas.py` (``hello_nas.py``)                                           | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_nasbench_as_dataset.py` (``nasbench_as_dataset.py``)                       | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_new_pruning_bert_glue.py` (``new_pruning_bert_glue.py``)                   | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_pruning_bert_glue.py` (``pruning_bert_glue.py``)                           | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_new_pruning_bert_glue.py` (``new_pruning_bert_glue.py``)                   | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_pruning_bert_glue.py` (``pruning_bert_glue.py``)                           | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_pruning_quick_start_mnist.py` (``pruning_quick_start_mnist.py``)           | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_pruning_speedup.py` (``pruning_speedup.py``)                               | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_customize.py` (``quantization_customize.py``)                 | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py` (``quantization_quick_start_mnist.py``) | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_speedup.py` (``quantization_speedup.py``)                     | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_quantization_quick_start.py` (``quantization_quick_start.py``) | 03:22.673 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_darts.py` (``darts.py``)                                       | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_hello_nas.py` (``hello_nas.py``)                               | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_nasbench_as_dataset.py` (``nasbench_as_dataset.py``)           | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_new_pruning_bert_glue.py` (``new_pruning_bert_glue.py``)       | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_pruning_quick_start.py` (``pruning_quick_start.py``)           | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_pruning_speedup.py` (``pruning_speedup.py``)                   | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_quantization_bert_glue.py` (``quantization_bert_glue.py``)     | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_quantization_speedup.py` (``quantization_speedup.py``)         | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------+-----------+--------+
--- a/examples/model_compress/quantization/bnn_example.py
+++ b/examples/model_compress/quantization/bnn_example.py
--- a/examples/model_compress/quantization/dorefa_example.py
+++ b/examples/model_compress/quantization/dorefa_example.py
--- a/examples/model_compress/quantization/lsq_example.py
+++ b/examples/model_compress/quantization/lsq_example.py
--- a/examples/model_compress/quantization/ptq_example.py
+++ b/examples/model_compress/quantization/ptq_example.py
--- a/examples/model_compress/quantization/qat_example.py
+++ b/examples/model_compress/quantization/qat_example.py
--- a/Показать больше
+++ b/Показать больше