It’s been a while since my last post (a lot of stuff going on at work), but as I have just upgraded to Ubuntu 18.04 and had to reinstall some stuff - I decided to share a trick I find useful.
I have a laptop with 2 GPUs - discrete (GeForce GTX 950M) and integrated (Intel HD Graphics 520). Even though discrete one is not the most powerful, it’s still sometimes reasonable to train small neural networks using it. But limited memory (2 GB) quickly becomes a bottleneck, especially given that without any training 25-50% of it is already taken by gnome/xorg/etc. There is another GPU which is typically unused at all, perhaps we can use both of them at the same time? This way smaller one will be responsible for rendering UI, and the other will be completely dedicated to compute.
Disclaimer: I take no responsibility for the consequences of performing steps I describe below. Your computer might turn into pumpkin, your data might get lost, and you might even have to reinstall NVIDIA drivers afterwards. Try it out only if you’re feeling adventurous.
Turns out it’s not a very common setup - most search results for “ubuntu dual gpu” query are about setting up 2 discrete GPUs, which is not our intention. Perhaps this use case is too narrow, but anyway I was curious enough to find a working solution, so I want to duplicate it here to make it easier to find, and to prevent losing it if the original page will be removed. To make this post less redundant, I will focus on my experience with it, including quirks and workarounds. This is especially important on Ubuntu 18.04, because solution described above seems to no longer work.
Differences start from the location of libraries. Before, they used to be placed in /usr/local/cuda/lib64
and /usr/lib/nvidia-xxx
(where xxx stands for driver number) for CUDA and driver libraries, respectively. But now there doesn’t seem to be nvidia-xxx
directory created under /usr/lib
for each version of the driver. Let’s try to find out where they are now (I omitted most of the output for brevity):
alexey@laptop:~$ ldconfig -p | grep nvidia
...
libnvidia-opencl.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
libnvidia-opencl.so.1 (libc6) => /usr/lib/i386-linux-gnu/libnvidia-opencl.so.1
libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so.1 (libc6) => /usr/lib/i386-linux-gnu/libnvidia-ml.so.1
libnvidia-ml.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
libnvidia-ml.so (libc6) => /usr/lib/i386-linux-gnu/libnvidia-ml.so
...
Interesting, it seems that most of the stuff goes to /usr/lib/x86_64-linux-gnu/
and /usr/lib/i386-linux-gnu/
now. What about CUDA libs?
alexey@laptop:~$ ldconfig -p | grep cuda
...
libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so
libcuda.so (libc6) => /usr/lib/i386-linux-gnu/libcuda.so
Same here. It doesn’t seem like we need to do anything special to make libraries on those directories discoverable, but lets try to follow the steps above anyway:
LD_LIBRARY_PATH
(via .bashrc
for instance)alexey@laptop:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
That sucks. That used to work before, what’s going on?
It took me a while to find out, but it seems that now PRIME profile switching (or at least some part of it) is done via blacklisting the NVIDIA driver. That’s how /etc/modprobe.d/blacklist-nvidia.conf
looks on my machine:
# Do not modify
# This file was generated by nvidia-prime
blacklist nvidia
blacklist nvidia-drm
blacklist nvidia-modeset
alias nvidia off
alias nvidia-drm off
alias nvidia-modeset off
After checking out “Chapter 5. Listing of Installed Components” of NVIDIA 396.24 driver documentation, it seems that nvidia-modeset is responsible for for programming the display engine of the GPU
and nvidia-drm handles DRM in some way. This means we don’t really need to turn them on, as opposed to the nvidia, which sound pretty critical to us :)
Let’s try commenting out lines that blacklist nvidia and rebooting.
alexey@laptop:~$ nvidia-smi
Sun Jun 17 17:02:44 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.24.02 Driver Version: 396.24.02 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 950M Off | 00000000:01:00.0 Off | N/A |
| N/A 57C P0 N/A / N/A | 0MiB / 2004MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Wow, now we’re talking! nvidia-smi is functioning properly and no memory is spent on rendering - that’s exactly what we need. Most importantly, training networks also works (at least with PyTorch). We’ll also achieve same results if we don’t add paths with NVIDIA and CUDA libs to LD_LIBRARY_PATH
.
Following the original guide, I tried to run glmark2, and it seems to work properly without any additional steps.
A small downside is:
alexey@laptop:~$ nvidia-settings
ERROR: Unable to load info from any available system
Unfortunately I haven’t found a workaround for this yet, but on the other hand I don’t need to switch PRIME profiles often, and if I need - I can always unblacklist remaining components in /etc/modprobe.d/blacklist-nvidia.conf
, reboot and have nvidia-settings working (if you plan to switch PRIME profile this will need 2 reboots vs one though).
Another nice thing is that now after you wake laptop after suspending, you’ll no longer have a black screen - I’m quite surprised that this bug is still not fixed. But there’s no free lunch - nvidia-smi still won’t work after suspend/wake cycle, so you’ll have to reboot anyway if you plan to use GPU.
That’s it I guess. Hope it helped anyone, and here’s a TL;DR section (bottom of a post is a perfect place for it) to wrap it up:
/etc/modprobe.d/
you have blacklist-nvidia.conf
or something similar#
in the beggining of the line) blacklist nvidia
and alias nvidia off
in this file and save itNo running processes found
shown under Processes
)I’ll be using this setup from now on, and in case I’ll discover something new I’ll update this post.