Add instructions for choosing cudatoolkit version and upgrading cuda driver.

This commit is contained in:
hlums 2020-02-28 00:00:49 +00:00
Родитель bbc1287783
Коммит 82e02b2ffc
1 изменённых файлов: 70 добавлений и 17 удалений

Просмотреть файл

@ -58,31 +58,84 @@ You can specify the environment name as well with the flag `-n`.
Click on the following menus to see how to install the Python GPU environment: Click on the following menus to see how to install the Python GPU environment:
<details> <details>
<summary><strong><em>Python GPU environment on Linux, MacOS</em></strong></summary> <summary><strong><em>Python GPU environment</em></strong></summary>
Assuming that you have a GPU machine, to install the Python GPU environment, which by default installs the CPU environment: Assuming that you have a GPU machine, to install the Python GPU environment,
1. Check the CUDA **driver** version on your machine by running
cd nlp-recipes nvidia-smi
python tools/generate_conda_file.py --gpu The top of the output shows the CUDA **driver** version, which is 10.0 in the example below.
conda env create -n nlp_gpu -f nlp_gpu.yaml +-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Driver Version: 410. &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
2. Decide which cuda **runtime** version you should install.
The cuda **runtime** version is the version of the cudatoolkit that will be installed in the conda environment in the next step, which should be <= the CUDA **driver** version found in step 1.
Currently, this repo uses PyTorch 1.4.0 which is compatible with cuda 9.2 and cuda 10.1. The conda environment file generated in step 3 installs cudatoolkit 10.1 by default. If your CUDA **driver** version is < 10.1, you should add additional argument "--cuda_version 9.2" when calling generate_conda_files.py.
</details> 3. Install the GPU environment:
If CUDA **driver** version >= 10.1
<details>
<summary><strong><em>Python GPU environment on Windows</em></strong></summary>
Assuming that you have an Azure GPU DSVM machine, here are the steps to setup the Python GPU environment:
1. Make sure you have CUDA Toolkit version 9.0 above installed on your Windows machine. You can run the command below in your terminal to check.
nvcc --version
If you don't have CUDA Toolkit or don't have the right version, please download it from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit)
2. Install the GPU environment.
cd nlp-recipes cd nlp-recipes
python tools/generate_conda_file.py --gpu python tools/generate_conda_file.py --gpu
conda env create -n nlp_gpu -f nlp_gpu.yaml conda env create -n nlp_gpu -f nlp_gpu.yaml
If CUDA **driver** version < 10.1
cd nlp-recipes
python tools/generate_conda_file.py --gpu --cuda_version 9.2
conda env create -n nlp_gpu -f nlp_gpu.yaml
4. Enable mixed precision training (optional)
Mixed precision training is particularly useful if your model takes a long time to train. It usually reduces the training time by 50% and produces the same model quality. To enable mixed precision training, run the following command
conda activate nlp_gpu
git clone https://github.com/NVIDIA/apex.git
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
**Troubleshooting**:
If you run into an error message "RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.", you need to make sure your NVIDIA Cuda compiler driver (nvcc) version and your cuda **runtime** version are exactly the same. To check the nvcc version, run
nvcc -V
If the nvcc version is 10.0, it's recommended to upgrade to 10.1 and re-create your conda environment with cudatoolkit=10.1.
**Steps to upgrade CUDA **driver** version and nvcc version**
We have tested the following steps. Alternatively, you can follow the official instructions [here](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
a. Update apt-get and reboot your machine
sudo apt-get update
sudo apt-get upgrade --fix-missing
sudo reboot
b. Download the CUDA toolkit .run file from https://developer.nvidia.com/cuda-10.1-download-archive-base based on your target platform. For example, on a Linux machine with Ubuntu 16.04, run
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.105_418.39_linux.run
c. Upgrade CUDA driver by running
sudo sh cuda_10.1.105_418.39_linux.run
First, accept the user agreement.
![](https://nlpbp.blob.core.windows.net/images/upgrade_cuda_driver/1agree_to_user_agreement.PNG)
Next, choose the components to install.
It's possible that you already have NVIDIA driver 418.39 and CUDA 10.1, but nvcc 10.0. In this case, you can uncheck the "DRIVER" box and upgrade nvcc by re-installing CUDA toolkit only.
![](https://nlpbp.blob.core.windows.net/images/upgrade_cuda_driver/2install_cuda_only.PNG)
If you choose to install all components, follow the instructions on the screen to uninstall existing NVIDIA driver and CUDA toolkit first.
![](https://nlpbp.blob.core.windows.net/images/upgrade_cuda_driver/3install_all.PNG)
Then re-run
sudo sh cuda_10.1.105_418.39_linux.run
Select "Yes" to update the cuda symlink.
![](https://nlpbp.blob.core.windows.net/images/upgrade_cuda_driver/4Upgrade_symlink.PNG)
d. Run the following commands again to make sure you have NVIDIA driver 418.39, CUDA driver 10.1 and nvcc 10.1
nvidia-smi
nvcc -V
e. Repeat steps 3 & 4 to recreate your conda environment with cudatoolkit **runtime** 10.1 and apex installed for mixed precision training.
</details> </details>
### Register Conda Environment in DSVM JupyterHub ### Register Conda Environment in DSVM JupyterHub