outofmemory-when-there-is-still-enough-memory-on-the-gpu

Private: Q&ACategory: Questionsoutofmemory-when-there-is-still-enough-memory-on-the-gpu
sdonn asked 8 years ago

During some random testing, I stumbled upon this error message: [Quasar CUDA Engine] – OUT OF MEMORY detected (request size 536870912 bytes)! Starting memory diagnostics subprogram… Amount of pinned memory: 67897344 bytes Freelist size: 2 memory blocks Largest free block: 67108864 bytes Process total: 201326592, Inuse: 67897344 bytes, Free: 133429248 bytes; Device total: 2147352576, Free: 1655570432 Chunk 0 size 67108864 bytes: Fragmentation: 0.0%, free: 67108864 bytes Chunk 1 size 134217728 bytes: Fragmentation: 0.0%, free: 66320384 bytes Info: CUDA memory failure arises when too many large memory blocks are used by the same kernel function. Please split the input data into blocks and let the program process these blocks individually, to avoid the CUDA memory failure. Basically, I request 500MB video memory. Okay, the process can\’t serve this because it only gets 200MB to start with. However, the GPU itself still has 1.6GB of free memory! Why can\’t the quasar process access this memory?

2 Answers
bgoossen answered 8 years ago

The Quasar process tries to allocate a memory block that is large enough to hold the 536 MB using cudaMalloc, but this fails. There might be 1.6 GB available, but due to memory fragmentation (especially if there are other processes that take GPU memory, it could also be opengl) and other issues, a contiguous block of 536 MB might not be available, unfortunately…
I will update the error message so that it is more clear what exactly goes wrong.
Something worth to test would be to set the GPU memory model (program settings/runtime) to “large footprint” from the beginning. Note that this will allocate a lot of GPU memory so that little remains available for other users/processes.
Check if other (dead) Quasar / Redshift processes are resident (ps x). It happened once that this was the cause of the issue.
Also useful links with some explanation on the issue:

sdonn answered 8 years ago

i used nvidia-smi to check other GPU memory users. There was just 200MB allocated to X11, and about 10MB for kwin. So it would have been possible that there was no 550MB free, but that would have required some pretty bad memory allocation from the GPU’s side. I now set the GPU memory footprint to ‘large’ by default. When I am running quasar I’m at work anyhow and nothing GPU-intensive should be running, aside from X11.
Thanks for the info!