The Wayback Machine - https://web.archive.org/web/20201027033109/https://github.com/allenai/allennlp/issues/2167
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change "unable to check gpu_memory_mb()" error to a warning, remove uninformative stack trace #2167

Open
magic282 opened this issue Dec 11, 2018 · 4 comments

Comments

@magic282
Copy link

@magic282 magic282 commented Dec 11, 2018

Describe the bug

2018-12-10 23:59:23,547 - ERROR - allennlp.common.util - unable to check gpu_memory_mb(), continuing
Traceback (most recent call last):
  File "/home/v-qizhou/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/common/util.py", line 343, in gpu_memory_mb
    encoding='utf-8')
  File "/home/v-qizhou/anaconda3/envs/allennlp/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/home/v-qizhou/anaconda3/envs/allennlp/lib/python3.6/subprocess.py", line 403, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/v-qizhou/anaconda3/envs/allennlp/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/home/v-qizhou/anaconda3/envs/allennlp/lib/python3.6/subprocess.py", line 1275, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
2018-12-10 23:59:23,547 - INFO - allennlp.training.trainer - Training

To Reproduce
No idea how to reproduce.

Expected behavior
A clear and concise description of what you expected to happen.

System (please complete the following information):

  • OS: Linux
  • Python version: 3.6.7
  • AllenNLP version: 0.7.2
  • PyTorch version: (if you installed it yourself)

Additional context
Under linux screen

@DeNeutoy
Copy link
Collaborator

@DeNeutoy DeNeutoy commented Dec 22, 2018

This isn't critical functionality and we actually expect it to fail occasionally, which is why we just continue when it happens. We probably won't be able to fix this, sorry!

@DeNeutoy DeNeutoy closed this Dec 22, 2018
@WindChimeRan
Copy link

@WindChimeRan WindChimeRan commented Jun 1, 2019

Same error.
Thanks.
I think it'd be better if you tell us this is not an error in the log, not here.

@guoquan
Copy link
Contributor

@guoquan guoquan commented Jan 15, 2020

Maybe better to just put a warning instead of an error at L461.
And document the exact potential exception there.

def gpu_memory_mb() -> Dict[int, int]:
"""
Get the current GPU memory usage.
Based on https://discuss.pytorch.org/t/access-gpu-memory-usage-in-pytorch/3192/4
# Returns
``Dict[int, int]``
Keys are device ids as integers.
Values are memory usage as integers in MB.
Returns an empty ``dict`` if GPUs are not available.
"""
try:
result = subprocess.check_output(
["nvidia-smi", "--query-gpu=memory.used", "--format=csv,nounits,noheader"],
encoding="utf-8",
)
gpu_memory = [int(x) for x in result.strip().split("\n")]
return {gpu: memory for gpu, memory in enumerate(gpu_memory)}
except FileNotFoundError:
# `nvidia-smi` doesn't exist, assume that means no GPU.
return {}
except: # noqa
# Catch *all* exceptions, because this memory check is a nice-to-have
# and we'd never want a training run to fail because of it.
logger.exception("unable to check gpu_memory_mb(), continuing")
return {}
@matt-gardner
Copy link
Member

@matt-gardner matt-gardner commented Jan 15, 2020

@guoquan (and @WindChimeRan), I agree with this. I'm re-opening this issue, and renaming it. Should be an easy fix; PR welcome!

@matt-gardner matt-gardner reopened this Jan 15, 2020
@matt-gardner matt-gardner changed the title unable to check gpu_memory_mb() Change "unable to check gpu_memory_mb()" error to a warning, remove uninformative stack trace Jan 15, 2020
guoquan added a commit to guoquan/allennlp that referenced this issue Feb 5, 2020
According to the discussion in allenai#2167 , there are known occasional failures in this non-critical function.
Log as a warning with the stack trace and error information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.