Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf_vae.json empty after running vae_train.py #31

Open
asolano opened this issue Aug 26, 2019 · 4 comments
Open

tf_vae.json empty after running vae_train.py #31

asolano opened this issue Aug 26, 2019 · 4 comments

Comments

@asolano
Copy link

asolano commented Aug 26, 2019

Greetings,

I am trying to reproduce the experiment on a DGX station I currently have access to, and the fist two steps looks alright, but the result of the command:

$ python vae_train.py
...
step 298000 35.82913 3.7688284 32.0603
step 298500 34.947067 2.9355032 32.011562
step 299000 35.83263 3.8249977 32.007633
step 299500 36.45114 4.418231 32.03291
step 300000 35.098816 3.0974069 32.001408
step 300500 35.483387 3.4664068 32.01698
step 301000 35.43274 3.4285662 32.004173

is an empty array:

$ cat tf_vae/vae.json 
[]

According to the documentation the model should be saved on that file, so any hint about where to look for the problem is appreciated.

Thanks,

Alfredo

PS: I am using the following Dockerfile to recreate the environment in the paper, in case in might be relevant:

FROM tensorflow/tensorflow:1.8.0-gpu-py3

# gym-doom requirements
RUN apt-get update && apt-get install -y --no-install-recommends \
        cmake \
        zlib1g-dev \
        libjpeg-dev \
        libboost-all-dev \
        gcc \
        libsdl2-dev \
        wget \
        unzip \
        python3-tk \
        && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# make python3 the default
RUN update-alternatives --remove python /usr/bin/python2 && \
    update-alternatives --install /usr/bin/python python /usr/bin/python3 10

# NOTE overriding numpy version to match the paper's
# NOTE numpy==1.13.3 gives an error importing vizdoom
RUN pip install --upgrade pip && \
    pip install --no-cache-dir --user --upgrade \
        gym==0.9.4 \
        ppaquette-gym-doom==0.0.6 \
        cma==2.2.0  \
        mpi4py==2.0.0

ENTRYPOINT ["/bin/bash"]
@leekwoon
Copy link

leekwoon commented Sep 4, 2019

Hi,

I think the problem comes from the location of

with tf.variable_scope('conv_vae', reuse=self.reuse): in __init function

I addressed this problem by moving it to _builg_graph function

def _build_graph(self):
    self.g = tf.Graph()
    with self.g.as_default():
      with tf.variable_scope('conv_vae', reuse=self.reuse):

@asolano
Copy link
Author

asolano commented Sep 19, 2019

Thanks for your suggestion, @leekwoon.

I no longer have access to the DGX station but I tried the change in a AWS instance. For now it looks good, the vae.json file was generated.

Do you have a fork or pull request to check out for other necessary changes before continuing the training? I suspect there may be more troubles ahead.

@asolano
Copy link
Author

asolano commented Sep 27, 2019

FWIW, I did find that step 3 of the GPU jobs showed some problems with the patched code so after a bit of failed troubleshooting I just decided to go back to a commit around the time the paper was published (c0cb2de) and try again. Everything worked as expected without changing any code. 👍

@hardmaru
Copy link
Owner

hardmaru commented Mar 5, 2020

Thanks for the testing, @asolano. Maybe I should just roll back the code to that time...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants