Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplication of steps in step_graph #541

Closed
ryan6310 opened this issue Mar 24, 2023 · 2 comments
Closed

Duplication of steps in step_graph #541

ryan6310 opened this issue Mar 24, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@ryan6310
Copy link
Contributor

🐛 Describe the bug

During the construction of the step graph (here) upstream steps in the dag are duplicated in memory. This leads to a large memory footprint and often OOM errors. I tested this by modify the step_graph.from_params method to print out the memory location. Code was inserted here

step_dict[step_name] = Step.from_params(step_params, step_name=step_name)
   print(f"Size of step {step_name}: {getsize(step_dict[step_name])} mb")
   step_data = step_dict[step_name]
   for key, value in step_data.config.items():
       print(f"{key}: {id(value)}")
    print()

This behavior can be reproduced with this simple example.

config.jsonnet

{
  steps: {
    build_data_step: {
      type: 'build_data',
      size: 1000,
    },
    compute_mean_step: {
      type: 'compute_mean',
      data: { type: 'ref', ref: 'build_data_step' },
    },
    compute_std_step: {
      type: 'compute_std',
      data: { type: 'ref', ref: 'build_data_step' },
    },
  },
}

ex_steps.py

from tango import Step
import numpy as np


@Step.register("build_data")
class BuildData(Step):
    def run(self, size: int = 1000) -> np.ndarray:
        return np.random.rand(size)


@Step.register("compute_mean")
class ComputeMean(Step):
    def run(self, data: np.ndarray) -> float:
        print(f"mean: {np.mean(data)}")
        return np.mean(data)


@Step.register("compute_std")
class ComputeStd(Step):
    def run(self, data: np.ndarray) -> float:
        print(f"mean: {np.mean(data)}")
        return np.std(data)

The output is

build_data_step
Size of step build_data_step: 0.001422 mb
size: 140357577810480
type: 140357568998896

compute_mean_step
Size of step compute_mean_step: 0.00384 mb
data: 140356623235728
type: 140357568998512

compute_std_step
Size of step compute_std_step: 0.003838 mb
data: 140356622670768
type: 140357568998192

done StepGraph.from_params
Starting new run stable-gnat
● Starting step "build_data_step"...
✓ Finished step "build_data_step"
✓ Found output for step "build_data_step" in cache (needed by "compute_mean_step")...
● Starting step "compute_mean_step"...
mean: 0.5024763834565448
✓ Finished step "compute_mean_step"
✓ Found output for step "build_data_step" in cache (needed by "compute_std_step")...
● Starting step "compute_std_step"...
mean: 0.5024763834565448
✓ Finished step "compute_std_step"
✓ Finished run stable-gnat
                                                                                                                                 
 ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓                                                                                   
 ┃ Step Name         ┃ Status      ┃ Results ┃                                                                                   
 ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩                                                                                   
 │ build_data_step   │ ✓ succeeded │ N/A     │                                                                                   
 │ compute_mean_step │ ✓ succeeded │ N/A     │                                                                                   
 │ compute_std_step  │ ✓ succeeded │ N/A     │                                                                                   
 └───────────────────┴─────────────┴─────────┘                                                                                   
                 ✓ 3 succeeded   

Versions

Python 3.9.5

absl-py==1.4.0
ai2-tango==1.2.0
aiohttp==3.8.4
aiosignal==1.3.1
alembic==1.8.0
appdirs==1.4.4
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.2
attrs==22.2.0
backcall==0.2.0
base58==2.1.1
beaker-py==1.18.1
boto3==1.24.26
botocore==1.27.96
cached-path==1.1.6
cached-property==1.5.2
cachetools==5.3.0
certifi==2022.12.7
charset-normalizer==2.1.1
chex==0.1.6
click==8.1.3
click-help-colors==0.9.1
cloudpickle==2.2.1
comm==0.1.3
commonmark==0.9.1
cycler==0.11.0
dask==2022.8.1
datasets==2.10.1
debugpy==1.6.6
decorator==5.1.1
dill==0.3.6
distributed==2022.8.1
dm-tree==0.1.8
docker==6.0.1
docker-pycreds==0.4.0
etils==1.1.0
exceptiongroup==1.1.1
executing==1.2.0
fairscale==0.4.9
filelock==3.8.2
flatbuffers==23.3.3
flax==0.6.7
fonttools==4.39.2
frozenlist==1.3.3
fsspec==2023.3.0
gast==0.4.0
gitdb==4.0.10
GitPython==3.1.31
glob2==0.7
glog==0.3.1
google-api-core==2.8.2
google-auth==2.16.2
google-auth-oauthlib==0.4.6
google-cloud-core==2.3.2
google-cloud-storage==2.7.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.4.1
googleapis-common-protos==1.56.4
gprof2dot==2022.7.29
greenlet==1.1.3
grpcio==1.51.3
h5py==3.8.0
HeapDict==1.0.1
heartpy==1.2.7
huggingface-hub==0.10.1
idna==3.4
importlib-metadata==6.0.0
importlib-resources==5.12.0
iniconfig==2.0.0
ipdb==0.13.13
ipykernel==6.22.0
ipython==8.11.0
jax==0.4.6
jaxlib==0.4.6
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
jsonnet-binary==0.17.0
jupyter_client==8.1.0
jupyter_core==5.3.0
keras==2.11.0
kiwisolver==1.4.4
libclang==15.0.6.1
locket==1.0.0
Mako==1.2.4
Markdown==3.4.1
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.5.2
matplotlib-inline==0.1.6
mdurl==0.1.2
more-itertools==8.14.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.14
nest-asyncio==1.5.6
neurokit2==0.2.0
numexpr==2.8.4
numpy==1.23.0
oauthlib==3.2.2
opencv-python==4.6.0.66
opt-einsum==3.3.0
optax==0.1.4
orbax==0.1.4
packaging==23.0
pandas==1.4.3
parso==0.8.3
partd==1.3.0
pathtools==0.1.2
patsy==0.5.3
pendulum==2.1.2
petname==2.6
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==3.1.1
pluggy==1.0.0
prompt-toolkit==3.0.38
protobuf==3.19.4
psutil==5.9.1
psycopg2-binary==2.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
pyarrow==11.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydantic==1.10.6
pyDeprecate==0.3.2
Pygments==2.14.0
pyparsing==3.0.9
pytest==7.2.2
python-dateutil==2.8.2
python-gflags==3.1.2
pytorch-lightning==1.7.7
pytz==2022.7.1
pytzdata==2020.1
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==25.0.2
regex==2022.10.31
requests==2.28.1
requests-oauthlib==1.3.1
responses==0.18.0
retry==0.9.2
rich==12.6.0
rjsonnet==0.5.2
rsa==4.9
s3transfer==0.6.0
sacremoses==0.0.53
scikit-learn==1.2.2
scipy==1.8.1
seaborn==0.12.0
sentencepiece==0.1.97
sentry-sdk==1.17.0
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
snakeviz==2.1.1
sortedcontainers==2.4.0
SQLAlchemy==1.4.39
sqlitedict==2.1.0
stack-data==0.6.2
statsmodels==0.13.2
tables==3.7.0
tblib==1.7.0
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow-cpu==2.11.0
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.31.0
tensorstore==0.1.33
termcolor==2.2.0
threadpoolctl==3.1.0
tokenizers==0.13.2
tomli==2.0.1
toolz==0.12.0
torch==1.12.1
torchaudio==0.12.1
torchmetrics==0.11.4
torchvision==0.13.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
transformers==4.25.1
typing_extensions==4.5.0
urllib3==1.26.15
validators==0.20.0
wandb==0.13.11
wcwidth==0.2.6
websocket-client==1.5.1
Werkzeug==2.2.3
wrapt==1.15.0
xarray==2022.3.0
xgboost==1.6.2
xxhash==3.2.0
yarl==1.8.2
zict==2.2.0
zipp==3.15.0

@ryan6310 ryan6310 added the bug Something isn't working label Mar 24, 2023
@dirkgr
Copy link
Member

dirkgr commented Jun 6, 2023

We know of this issue, and there is a fix in the latest main branch. We just haven't been able to make a release with it because of a problem with Torch 2 that I'm tracking in #560.

@ryan6310
Copy link
Contributor Author

ryan6310 commented Jun 6, 2023

Right on. Thanks. I just forgot to close this when that fix went in.

@ryan6310 ryan6310 closed this as completed Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants