Add instructions for choosing cudatoolkit version and upgrading cuda driver.
This commit is contained in:
Родитель
bbc1287783
Коммит
82e02b2ffc
87
SETUP.md
87
SETUP.md
|
@ -58,31 +58,84 @@ You can specify the environment name as well with the flag `-n`.
|
||||||
Click on the following menus to see how to install the Python GPU environment:
|
Click on the following menus to see how to install the Python GPU environment:
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary><strong><em>Python GPU environment on Linux, MacOS</em></strong></summary>
|
<summary><strong><em>Python GPU environment</em></strong></summary>
|
||||||
|
|
||||||
Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment:
|
Assuming that you have a GPU machine, to install the Python GPU environment,
|
||||||
|
1. Check the CUDA **driver** version on your machine by running
|
||||||
|
|
||||||
cd nlp-recipes
|
nvidia-smi
|
||||||
python tools/generate_conda_file.py --gpu
|
The top of the output shows the CUDA **driver** version, which is 10.0 in the example below.
|
||||||
conda env create -n nlp_gpu -f nlp_gpu.yaml
|
+-----------------------------------------------------------------------------+
|
||||||
|
| NVIDIA-SMI 410.79 Driver Version: 410. CUDA Version: 10.0 |
|
||||||
|
|-------------------------------+----------------------+----------------------+
|
||||||
|
2. Decide which cuda **runtime** version you should install.
|
||||||
|
The cuda **runtime** version is the version of the cudatoolkit that will be installed in the conda environment in the next step, which should be <= the CUDA **driver** version found in step 1.
|
||||||
|
Currently, this repo uses PyTorch 1.4.0 which is compatible with cuda 9.2 and cuda 10.1. The conda environment file generated in step 3 installs cudatoolkit 10.1 by default. If your CUDA **driver** version is < 10.1, you should add additional argument "--cuda_version 9.2" when calling generate_conda_files.py.
|
||||||
|
|
||||||
</details>
|
3. Install the GPU environment:
|
||||||
|
If CUDA **driver** version >= 10.1
|
||||||
<details>
|
|
||||||
<summary><strong><em>Python GPU environment on Windows</em></strong></summary>
|
|
||||||
|
|
||||||
Assuming that you have an Azure GPU DSVM machine, here are the steps to setup the Python GPU environment:
|
|
||||||
1. Make sure you have CUDA Toolkit version 9.0 above installed on your Windows machine. You can run the command below in your terminal to check.
|
|
||||||
|
|
||||||
nvcc --version
|
|
||||||
If you don't have CUDA Toolkit or don't have the right version, please download it from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit)
|
|
||||||
|
|
||||||
2. Install the GPU environment.
|
|
||||||
|
|
||||||
cd nlp-recipes
|
cd nlp-recipes
|
||||||
python tools/generate_conda_file.py --gpu
|
python tools/generate_conda_file.py --gpu
|
||||||
conda env create -n nlp_gpu -f nlp_gpu.yaml
|
conda env create -n nlp_gpu -f nlp_gpu.yaml
|
||||||
|
|
||||||
|
If CUDA **driver** version < 10.1
|
||||||
|
|
||||||
|
cd nlp-recipes
|
||||||
|
python tools/generate_conda_file.py --gpu --cuda_version 9.2
|
||||||
|
conda env create -n nlp_gpu -f nlp_gpu.yaml
|
||||||
|
|
||||||
|
4. Enable mixed precision training (optional)
|
||||||
|
Mixed precision training is particularly useful if your model takes a long time to train. It usually reduces the training time by 50% and produces the same model quality. To enable mixed precision training, run the following command
|
||||||
|
|
||||||
|
conda activate nlp_gpu
|
||||||
|
git clone https://github.com/NVIDIA/apex.git
|
||||||
|
cd apex
|
||||||
|
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
|
||||||
|
|
||||||
|
**Troubleshooting**:
|
||||||
|
If you run into an error message "RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.", you need to make sure your NVIDIA Cuda compiler driver (nvcc) version and your cuda **runtime** version are exactly the same. To check the nvcc version, run
|
||||||
|
|
||||||
|
nvcc -V
|
||||||
|
|
||||||
|
If the nvcc version is 10.0, it's recommended to upgrade to 10.1 and re-create your conda environment with cudatoolkit=10.1.
|
||||||
|
|
||||||
|
**Steps to upgrade CUDA **driver** version and nvcc version**
|
||||||
|
We have tested the following steps. Alternatively, you can follow the official instructions [here](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
|
||||||
|
a. Update apt-get and reboot your machine
|
||||||
|
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get upgrade --fix-missing
|
||||||
|
sudo reboot
|
||||||
|
b. Download the CUDA toolkit .run file from https://developer.nvidia.com/cuda-10.1-download-archive-base based on your target platform. For example, on a Linux machine with Ubuntu 16.04, run
|
||||||
|
|
||||||
|
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.105_418.39_linux.run
|
||||||
|
|
||||||
|
c. Upgrade CUDA driver by running
|
||||||
|
|
||||||
|
sudo sh cuda_10.1.105_418.39_linux.run
|
||||||
|
First, accept the user agreement.
|
||||||
|
![](https://nlpbp.blob.core.windows.net/images/upgrade_cuda_driver/1agree_to_user_agreement.PNG)
|
||||||
|
Next, choose the components to install.
|
||||||
|
It's possible that you already have NVIDIA driver 418.39 and CUDA 10.1, but nvcc 10.0. In this case, you can uncheck the "DRIVER" box and upgrade nvcc by re-installing CUDA toolkit only.
|
||||||
|
![](https://nlpbp.blob.core.windows.net/images/upgrade_cuda_driver/2install_cuda_only.PNG)
|
||||||
|
|
||||||
|
If you choose to install all components, follow the instructions on the screen to uninstall existing NVIDIA driver and CUDA toolkit first.
|
||||||
|
![](https://nlpbp.blob.core.windows.net/images/upgrade_cuda_driver/3install_all.PNG)
|
||||||
|
Then re-run
|
||||||
|
|
||||||
|
sudo sh cuda_10.1.105_418.39_linux.run
|
||||||
|
Select "Yes" to update the cuda symlink.
|
||||||
|
![](https://nlpbp.blob.core.windows.net/images/upgrade_cuda_driver/4Upgrade_symlink.PNG)
|
||||||
|
|
||||||
|
d. Run the following commands again to make sure you have NVIDIA driver 418.39, CUDA driver 10.1 and nvcc 10.1
|
||||||
|
|
||||||
|
nvidia-smi
|
||||||
|
nvcc -V
|
||||||
|
|
||||||
|
e. Repeat steps 3 & 4 to recreate your conda environment with cudatoolkit **runtime** 10.1 and apex installed for mixed precision training.
|
||||||
|
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
### Register Conda Environment in DSVM JupyterHub
|
### Register Conda Environment in DSVM JupyterHub
|
||||||
|
|
Загрузка…
Ссылка в новой задаче