Update README.md

2023-01-22 12:30:32 -08:00 · 2023-01-22 12:30:32 -08:00 · 133ef61857
--- a/README.md
+++ b/README.md
@ -128,6 +128,7 @@ optimizer = MuSGD(model.parameters(), lr=0.1)
 Note the base and delta models *do not need to be trained* --- we are only extracting parameter shape information from them.
 Therefore, optionally, we can avoid instantiating these potentially large models by using the `deferred_init` function in `torchdistx`.
 After installing [`torchdistx`](https://github.com/pytorch/torchdistx), use `torchdistx.deferred_init.deferred_init(MyModel, **args)` instead of `MyModel(**args)`. See [this page](https://pytorch.org/torchdistx/latest/deferred_init.html) for more detail.
+In the MLP and Transformer examples (not `mutransformers`) we provided, you can activate this feature by passing `--deferred_init`.


 ## How `mup` Works Under the Hood