-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
Description
Some of the validators are getting CUDA OOM every now and then (including the test validator).
https://wandb.ai/opentensor-dev/openvalidators/runs/7p6prmo1/logs?workspace=user-opentensor-pedro
My initial hypothesis is that things are getting stacked in the GPU until they reach the limit. Considering that we have a validator that should run for days, it would be nice to identify some potential points of improvement for GPU management in order to avoid reaching the OOM point.
Reactions are currently unavailable