GPU in the DAIR OpenStack Cloud

Using a GPU in the cloud is much different than using the one in your laptop or gaming rig – to make the most of cloud GPU you’ll need to learn how to manage it through the terminal. If you’re new to Linux, and Linux administration, don’t worry, it’s not a complicated process. But do keep in mind, when you eventually move onto one of the public cloud providers, GPU resources can be costly. Try to think about how you can leverage other technologies to help lower the cost of your projects. Your wallet will thank you!

Do I even need a GPU?

When deciding if a GPU is required, consider the following:

  • most workloads don’t require one,
  • a multithreaded CPU can lead to more accurate results (for example, fabric simulation)
  • the cost it will accrue when migrating to your personal cloud account.

Which GPU should I choose?

So, you’ve decided that a GPU is required – you must now choose the right one for the job. If you’re not glued to the latest hardware news sites, you might not realise enterprise GPUs have different naming conventions and different use cases. DAIR currently hosts two different NVIDIA cards for DAIR GPU workloads, the A100 (V2.large) and the P100 (V2.medium). Both cards support the latest versions of CUDA and the latest NVIDIA drivers (as of early 2023).

NVIDIA A100 (V2.Large)

The A100 is a card designed with Machine learning and Data Analytics in mind demonstrated by the 512 tensor cores that are on the card. These cores take advantage of libraries like TensorFlow, which can potentially cut hours off your training times. The A100 has almost double the performance of the P100 in rendering graphics.

NVIDIA P100 (V2.medium)

The P100 is an older card designed for scientific and research workflows. It is not equipped with any tensor cores and is using the older Pascal architecture. Being on the Pascal architecture, these cards do not have access to dedicated tensor cores. This means if you’re using the GPU for a machine learning workflow, you can expect the model to run noticeably slower than what is possible with the A100. However, the P100’s strength is that it is typically used as a significantly less expensive cloud-available GPU. If you are building a platform that will require the dynamic allocation of GPUs instances or working on projects that aren’t extremely time sensitive, the P100 is an excellent choice.

How can I tell if I am utilising my GPU?

Linux

This is probably the number one question people should be asking themselves. Oftentimes, we see people using GPU instances, and they never make use of the powerful graphics capabilities. The quickest way to view your usage is by connecting to your GPU instance via another terminal window and running the command:

nvidia-smi

This command will show you information about the GPU attached to your instance, including its usage. Please note which version of the driver you are using. Depending on your use case, you might require newer or older driver versions to work correctly with your applications. For instance , if you want to render via Blender CLI, you will require modern packages (and drivers) for almost everything.

You can use the -l or –loop flag followed by a number indicating the interval (in seconds) at which the data should be displayed. For example, the following command will display the GPU’s memory usage, temperature, and other metrics every 1 second:

nvidia-smi -l 1

You can also use the -q or –query flag to specify which metrics to display. For example, the following command will display the GPU’s temperature and utilization every 1 second:

nvidia-smi -l 1 -q -d TEMPERATURE,UTILIZATION

Windows

On Windows, the process is simple. Open your Task Manager and sort by GPU usage. If you’re running machine learning scripts, you’ll likely see it show up in the task manager as a Python or a Powershell application in the task manager.

GPU Setup

As with most things, keeping your GPU up to date is a good way to ensure you are using every ounce of the GPU available. Follow the steps below to learn how to update your drivers on an Ubuntu instance.

Please note: Updated versions of these drivers will be made available several times a year. You will be notified on Slack or via email when newer drivers are available.

Ubuntu

  1. Run the nvidia-smi command to check if you have a driver currently installed on your machine.
  2. If no driver is installed, we will be using wget to download the nvidia drivers for your system. If wget is not installed on your machine run the following command: sudo apt install wget
  1. Create a new file called dair-nvidia-installer.sh with command: nano nvidia-installer.sh
  1. Now, with the text editor paste the following into the script:

wget https://swift-yyc.cloud.cybera.ca:8080/v1/AUTH_8c4974ed39a44c2fabd9d75895f6e28b/cybera_public/NVIDIA-GRID-Linux-KVM-510.85.03-510.85.02-513.46.zip
unzip NVIDIA-GRID-Linux-KVM-*.zip
chmod 755 Guest_Drivers/NVIDIA-Linux-x86_64-*-grid.run

sudo apt-get install -y nvidia-modprobe

sudo ./Guest_Drivers/NVIDIA-Linux-x86_64-*-grid.run --dkms --skip-module-unload -as -k $(ls --sort=time /boot | grep vmlinuz- | head -n 1 | sed 's/vmlinuz-//')

  1. Once that’s done, save the file by pressing CTRL + O, and exit using CTRL + X.
  2. Next, we make the script just created executable, by running the following command: sudo chmod +x dair-nvidia-installer.sh
  3. Now run the script as root using sudo ./dair-nvidia-installer
  4. Once installed reboot the machine using the command reboot
  5. After reboot, run the command nvidia-smi to ensure the driver has been installed.

Windows

  1. Download the new driver (version 513.46) here.
  2. Unzip the zip file.
  3. Run the command: 513.46_grid_win10_win11_server2019_server2022_64bit_international.exe
  4. Reboot your instance once complete.
  5. Reconnect to your instance and verify NVIDIA software shows the new version.

PyTorch CUDA test

If you are using PyTorch, here is a sample script you can use to check and see if your GPU is available:

import torch

# Check if a GPU is available
if torch.cuda.is_available():
  # Use the GPU
  device = torch.device("cuda")
else:
  # Use the CPU
  device = torch.device("cpu")

# Create a tensor on the specified device
x = torch.zeros(5, 5, device=device)

This script uses the torch.cuda.is_available() function to receive a Boolean response on whether or not the script detects CUDA is available. If there is one available, it stores it in the device variable, which can be used when creating tensors.

Additional information

The DAIR program also provides a series of BoosterPacks in a catalogue that will help you learn new technologies and speed up your development. These BoosterPacks provide many great resources, some of which include various Machine Learning models you may wish to consider for your own application and can be quickly deployed on a DAIR GPU instance for experimentation.

Each BoosterPack contains:

Here are two AI/ML related BoosterPacks to get you started:

Time-Series Prediction with Machine Learning

Builder: BluWave-ai

This BoosterPack ​​demonstrates the application of machine learning ​to develop models that provide good predictions for time-series data.

Read the Flight Plan to learn more.

See the Sample Solution to learn how.

Automatic Recommendation System Using Machine Learning

Builder: Carla Margalef Bentabol

This BoosterPack demonstrates how a Collaborative Filtering Deep Model is used to provide recommendations to users based on their past preferences and similarities to other users. This is very useful for software developers needing an Automatic Recommendation System.

Read the Flight Plan to learn more.

See the Sample Solution to learn how.