Debugging GKE GPU configuration

2 minute read



Getting PyTorch workloads running on Google Kubernetes Engine GPU nodepools doesn’t have to take long if you know what to look for. The diagram below shows the high level dependencies that have to align.

flowchart LR A(Application logic) --> B(PyTorch) B --> C(CUDA) C --> D(Nvidia drivers) D --> E(GKE GPU Nodepool)

I’ve primarily just been working on PyTorch which - handily - installs most of the necessary dependencies (CUDA) [^1] as part of the installation process. The Nvidia drivers are injected by a daemonset when a workload is scheduled. The drivers can be configured to DEFAULT or LATEST when the nodepool is created or the daemonset can be manually installed. Finally, the GPU (and available number) are requested when the nodepool is created.

I work from the GPU configuration backwards when debugging what hasn’t been configured correctly. You will see an error like below if there is a compatibility issue between the nvidia drivers and the Python (PyTorch) code. Remember that the driver is the component that actually get’s the code to run on the hardware. I have found that forcing the nodepool to install the LATEST drivers usually fixes this problem, but you might have to manually install the right drivers you need on a daemonset.

1
2
3
4
5
main ERROR: Error processing dataset: The NVIDIA driver on your system is too old
(found version 11040). Please update your GPU driver by downloading and installing
 a new version from the URL: http://www.nvidia.com/Download/index.aspx
 Alternatively, go to: https://pytorch.org to install a PyTorch version that has
 been compiled with your version of the CUDA driver.

If you’ve used Docker to build your own container then you’ll need to have installed the nvidia container toolkit. This toolkit makes sure that the container you’ve built is compatible with Nvidia GPUs.

Once the drivers are configured on the nodepool it’s time to check if the container you would like to run can “see” the GPU. On the command line run nvidia-smi and you should see the GPU(s) and associated information. If that command fails you’ve not configured it correctly, so it’s time to double check container compatibility and whether any GPUs are actually assigned to that nodepool.

After the interface between the container and the GPU(s) is confirmed then check whether PyTorch is installed and can connect to the GPUs. Exec in and run the below commands on the Python terminal. If a number greater than zero is returned then you should be able to run your workload as expected - so long as you were hoping to use the number of GPUs that were found! Otherwise, you’ll need to check Python version and the software dependencies it was installed with.

1
2
~ import torch
~ torch.cuda.device_count()