MoE for NLG tutorial (#1633)

2021-12-10 15:23:52 -08:00 · 2021-12-10 15:23:52 -08:00 · c6ace162c4
--- a/README.md
+++ b/README.md
@ -11,6 +11,7 @@ Remove until pypi issue is resolved: https://status.python.org/incidents/2jj696s
 * [2021/12/09] [DeepSpeed-MoE for NLG: Reducing the training cost of language models by 5 times](https://www.deepspeed.ai/news/2021/12/09/deepspeed-moe-nlg.html)
  * [2021/08/18] [DeepSpeed powers 8x larger MoE model training with high performance](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/)
    * [Mixture of Experts (MoE) tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts/).
+    * [Mixture of Experts (MoE) for NLG tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/).
 * [2021/11/15] [Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed](https://www.deepspeed.ai/news/2021/11/15/autotuning.html)
 * [2021/10/11] [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
  * Read more on how to [train large models with DeepSpeed](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
--- a/docs/_config.yml
+++ b/docs/_config.yml
@ -44,6 +44,7 @@ collections:
      - lrrt.md
      - megatron.md
      - mixture-of-experts.md
+      - mixture-of-experts-nlg.md
      - one-cycle.md
      - onebit-adam.md
      - onebit-lamb.md
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@ -92,6 +92,8 @@ lnav:
        url: /tutorials/megatron/
      - title: "Mixture-of-Experts (MoE)"
        url: /tutorials/mixture-of-experts/
+      - title: "Mixture-of-Experts for NLG"
+        url: /tutorials/mixture-of-experts-nlg/
      - title: "Mixture-of-Quantization"
        url: /tutorials/MoQ-tutorial/
      - title: "One-Cycle Schedule"
--- a/docs/_posts/2021-12-09-deepspeed-moe-nlg.md
+++ b/docs/_posts/2021-12-09-deepspeed-moe-nlg.md
@ -81,7 +81,7 @@ We pre-trained both the dense and MoE version of the above models using
 combination of data parallel and expert parallel training to effectively scale
 the [MoE model training](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/).

-We used the same training data as described in the MT-NLG blog. For a fair
+We used the same training data as described in the [MT-NLG blog](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/). For a fair
 comparison, we use 300B tokens to train both the dense model and the MoE model.

 ## MoE leads to better quality for NLG models
@ -179,12 +179,12 @@ To this end we are releasing our [end-to-end pipeline for training MoE based
 NLG models](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training),
 along with [specific example
 scripts](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training/examples/MoE)
-to help get started with our pipeline.  We look forward to the application and
+and [tutorial](/tutorials/mixture-of-experts-nlg) to help get started with our pipeline.  We look forward to the application and
 the innovations that this may bring to the deep learning community.

 ## Acknowledgement

-This work was done in collaboration with Brandon Norick and Xia Song from the
+This work was done in collaboration with Brandon Norick, Zhun Liu, Xia Song from the
 Turing Team, and Young Jin Kim, Alex Muzio, Hany Hassan Awadalla from Z-Code
 Team. We also thank Luis Vargas, Umesh Madan, Gopi Kumar, Andrey Proskurin and
 Mikhail Parakhin for their continuous support and guidance.
--- a/docs/_tutorials/mixture-of-experts-nlg.md
+++ b/docs/_tutorials/mixture-of-experts-nlg.md
@ -0,0 +1,31 @@
+---
+title: "Mixture of Experts for NLG models"
+---
+
+In this tutorial, we introduce how to apply DeepSpeed Mixture of Experts (MoE) to NLG models, which reduces the training cost by 5 times (details in our [Newsletter](https://www.deepspeed.ai/news/2021/12/09/deepspeed-moe-nlg.html)). We use the GPT-3 like models in Megatron-LM framework as the example. Before reading this tutorial, we recommend to first read the tutorials about [Mixture of Experts](/tutorials/mixture-of-experts/) and [Megatron-LM GPT pre-training](/tutorials/megatron/).
+
+## 1. Installation
+
+You would need to install DeepSpeed v0.5.8 or higher to use the MoE feature. The MoE for NLG model examples are in the [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) repo (currently under [the moe-training branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training) but later could be merged to main branch).
+
+## 2. Training NLG+MoE models
+
+### 2.1. Changes to the model
+To apply MoE to the GPT-style model, we made several changes in Megatron framework, mostly in `megatron/model/` where we add the MoE layers into the model. Details of the code changes are at [this commit](https://github.com/microsoft/Megatron-DeepSpeed/commit/3c666e85b46ab26ef2dfadfdf7a18d186887856b).
+
+### 2.2. Pre-training the model
+We provide example training scripts under [examples/MoE](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training/examples/MoE) which we used to perform the experiments in our [Newsletter](https://www.deepspeed.ai/news/2021/12/09/deepspeed-moe-nlg.html). There are a few new hyperparameters for MoE model:
+
+`--num-experts`: the number of experts per MoE layer. In our experiments we set it to 128. Larger number of experts tend to provide better convergence, but it's a diminishing return.
+
+`--moe-expert-parallel-size`: degree of the MoE expert parallelism. In other words, there will be `num-experts/moe-expert-parallel-size` experts on each GPU. Thus `--moe-expert-parallel-size` should be no more than both number of GPUs, and `--num-experts`.
+
+`--moe-loss-coeff`: scaling coefficient for adding MoE loss to model loss. In our experiments we find that 0.01 is a good setting.
+
+`--moe-train-capacity-factor`, `--moe-eval-capacity-factor`, `--moe-min-capacity`: these configs determine how many tokens can a single expert handle. Larger numbers could lead to better convergence, but would also lead to slower training since the load would be more unbalanced on different experts.
+
+`--disable-moe-token-dropping`: this will completely remove the limitation of how many tokens can a single expert handle. For the same reason as above, we only recommend using this during inference/eval.
+
+In addition to the new hyperparameters above, for NLG+MoE models we found that it's helpful to lower the learning rate and increase the learning rate decay duration compared to the base dense model. Details of our tuning can be found in the example training scripts.
+
+Regarding training data, we are not able to release our internal data but any public data for Megatron-LM pre-training (e.g., [The Pile dataset](https://the-eye.eu/public/AI/pile_neox/)) can be directly used to train MoE models (with the caveat that it might not provide the exact same model quality as in our experiments). We are evaluating public dataset for MoE pre-training and will post our findings here.