RuntimeError resource_exhausted out of memory while trying to allocate

Question

import jax
import jax.numpy as np
import numpy as onp

def E_fn(conf):
    ri = np.expand_dims(conf, 0)
    rj = np.expand_dims(conf, 1)
    dxdydz = np.power(ri - rj, 2)
    dij = np.sqrt(np.sum(dxdydz, axis=-1))
    return np.sum(dij)

dE_dx_fn = jax.jacrev(E_fn, argnums=(0,))
d2E_dx2_fn = jax.jacfwd(dE_dx_fn, argnums=(0,))

d2E_dx2_fn(onp.random.rand(2483, 3))

Results in:

RuntimeError: Resource exhausted: Out of memory while trying to allocate 551102853132 bytes.

This happens on both CPU and GPU.

There’s no reason this calculation should require 551GB’s worth of ram. The explicit hessian is “only” (24833)^24 bytes=221 MB

sdonn asked 6 years ago

During some random testing, I stumbled upon this error message: [Quasar CUDA Engine] – OUT OF MEMORY detected (request size 536870912 bytes)! Starting memory diagnostics subprogram… Amount of pinned memory: 67897344 bytes Freelist size: 2 memory blocks Largest free block: 67108864 bytes Process total: 201326592, Inuse: 67897344 bytes, Free: 133429248 bytes; Device total: 2147352576, Free: 1655570432 Chunk 0 size 67108864 bytes: Fragmentation: 0.0%, free: 67108864 bytes Chunk 1 size 134217728 bytes: Fragmentation: 0.0%, free: 66320384 bytes Info: CUDA memory failure arises when too many large memory blocks are used by the same kernel function. Please split the input data into blocks and let the program process these blocks individually, to avoid the CUDA memory failure. Basically, I request 500MB video memory. Okay, the process can\’t serve this because it only gets 200MB to start with. However, the GPU itself still has 1.6GB of free memory! Why can\’t the quasar process access this memory?

2 Answers

The Quasar process tries to allocate a memory block that is large enough to hold the 536 MB using cudaMalloc, but this fails. There might be 1.6 GB available, but due to memory fragmentation (especially if there are other processes that take GPU memory, it could also be opengl) and other issues, a contiguous block of 536 MB might not be available, unfortunately…
I will update the error message so that it is more clear what exactly goes wrong.
Something worth to test would be to set the GPU memory model (program settings/runtime) to “large footprint” from the beginning. Note that this will allocate a lot of GPU memory so that little remains available for other users/processes.
Check if other (dead) Quasar / Redshift processes are resident (ps x). It happened once that this was the cause of the issue.
Also useful links with some explanation on the issue:

http://stackoverflow.com/questions/8684770/how-is-cuda-memory-managed
http://stackoverflow.com/questions/8905949/why-is-cudamalloc-giving-me-an-error-when-i-know-there-is-sufficient-memory-spac

sdonn answered 6 years ago

i used nvidia-smi to check other GPU memory users. There was just 200MB allocated to X11, and about 10MB for kwin. So it would have been possible that there was no 550MB free, but that would have required some pretty bad memory allocation from the GPU’s side. I now set the GPU memory footprint to ‘large’ by default. When I am running quasar I’m at work anyhow and nothing GPU-intensive should be running, aside from X11.
Thanks for the info!

Toggle in-page Table of Contents

GPU memory allocation#

JAX will preallocate 90% of currently-available GPU memory when the first JAX operation is run. Preallocating minimizes allocation overhead and memory fragmentation, but can sometimes cause out-of-memory (OOM) errors. If your JAX process fails with OOM, the following environment variables can be used to override the default behavior:

XLA_PYTHON_CLIENT_PREALLOCATE=false

This disables the preallocation behavior. JAX will instead allocate GPU memory as needed, potentially decreasing the overall memory usage. However, this behavior is more prone to GPU memory fragmentation, meaning a JAX program that uses most of the available GPU memory may OOM with preallocation disabled.

XLA_PYTHON_CLIENT_MEM_FRACTION=.XX

If preallocation is enabled, this makes JAX preallocate XX% of currently-available GPU memory, instead of the default 90%. Lowering the amount preallocated can fix OOMs that occur when the JAX program starts.

XLA_PYTHON_CLIENT_ALLOCATOR=platform

This makes JAX allocate exactly what is needed on demand, and deallocate memory that is no longer needed (note that this is the only configuration that will deallocate GPU memory, instead of reusing it). This is very slow, so is not recommended for general use, but may be useful for running with the minimal possible GPU memory footprint or debugging OOM failures.

Common causes of OOM failures#

Running multiple JAX processes concurrently.

Either use XLA_PYTHON_CLIENT_MEM_FRACTION to give each process an appropriate amount of memory, or set XLA_PYTHON_CLIENT_PREALLOCATE=false.

Running JAX and GPU TensorFlow concurrently.

TensorFlow also preallocates by default, so this is similar to running multiple JAX processes concurrently.

One solution is to use CPU-only TensorFlow (e.g. if you’re only doing data loading with TF). You can prevent TensorFlow from using the GPU with the command tf.config.experimental.set_visible_devices([], "GPU")

Alternatively, use XLA_PYTHON_CLIENT_MEM_FRACTION or XLA_PYTHON_CLIENT_PREALLOCATE. There are also similar options to configure TensorFlow’s GPU memory allocation (gpu_memory_fraction and allow_growth in TF1, which should be set in a tf.ConfigProto passed to tf.Session. See Using GPUs: Limiting GPU memory growth for TF2).

Running JAX on the display GPU.

Use XLA_PYTHON_CLIENT_MEM_FRACTION or XLA_PYTHON_CLIENT_PREALLOCATE.