This commit is contained in:
Edward Hu 2023-01-22 12:30:32 -08:00 коммит произвёл GitHub
Родитель 2448e700e3
Коммит 133ef61857
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 1 добавлений и 0 удалений

Просмотреть файл

@ -128,6 +128,7 @@ optimizer = MuSGD(model.parameters(), lr=0.1)
Note the base and delta models *do not need to be trained* --- we are only extracting parameter shape information from them.
Therefore, optionally, we can avoid instantiating these potentially large models by using the `deferred_init` function in `torchdistx`.
After installing [`torchdistx`](https://github.com/pytorch/torchdistx), use `torchdistx.deferred_init.deferred_init(MyModel, **args)` instead of `MyModel(**args)`. See [this page](https://pytorch.org/torchdistx/latest/deferred_init.html) for more detail.
In the MLP and Transformer examples (not `mutransformers`) we provided, you can activate this feature by passing `--deferred_init`.
## How `mup` Works Under the Hood