Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step-by-step How To #58

Open
abettati opened this issue Apr 11, 2023 · 18 comments
Open

Step-by-step How To #58

abettati opened this issue Apr 11, 2023 · 18 comments

Comments

@abettati
Copy link

abettati commented Apr 11, 2023

Hi everyone,
is a step-by-step how to guide on how to set up the environment and run a simple example available anywhere?
If not, I would like to try and build one.

@abettati abettati changed the title Step by step how-to Step-by-step How To Apr 11, 2023
@sakundu
Copy link
Collaborator

sakundu commented Apr 12, 2023

Hi @abettati,

Are you planning to run CircuitTraining or the SP&R flow scripts we provide? --> For our SP&R flow scripts, you will need commercial tools like Innovus and Genus. If you already have access to these tools and planned to run the SP&R, you can follow these steps to generate SP&R data for Ariane on NG45.

cd ./Flows/NanGate45/ariane133/run/
cp ../scripts/cadence flow1
cd flow1
./run.sh

Thanks,
Sayak

@abettati
Copy link
Author

As a first experiment I would like to try both, CT and your open implementation of closed CT APi (e.g. cost function).
Later maybe I will try the Cadence flow as well.

@sakundu
Copy link
Collaborator

sakundu commented Apr 13, 2023

Hi @abettati

We use CT August 17 version for our experiments. We are working on bringing up the latest version. In the CT-Fork MacroPlacement branch we have also added working version of DREAMPlace.

Use the following steps to run CT:

  • Steps to run CT on the docker build using the script available here. → this supposed to generate docker image with tag: circuit_training:corepy39

  • Use the following command to start the docker (if you want to use gpu use --gpu all option):
    docker run -it -v $(pwd):/workspace --workdir /workspace circuit_training:corepy39 bash

  • Now you are in docker in the workspace directory.

  • As you are trying to test Circuit training my suggestion is to updated these variables in ./run_scripts/ct_setup.sh (For complete run you can use the default values)

export NUM_ITERATION=2
export NUM_EPISODE_PER_ITERATION=16
export BATCH_SIZE=64
  • Use this command to start reverb server, training and eval job (please update python3 to python3.9)
    ./run_scripts/start_ct_training.sh
  • The above command will start the jobs in tmux server. So use ctrl + b and then d to detach the tmux server.
    Now use the following command to start the collect job. (please update python3 to python3.9)
    ./run_scripts/start_ct_training_client.sh

@abettati
Copy link
Author

Hi @sakundu ,
Thanks for your support!
I am working through your guide, I am stuck as I can't find the run_scripts folder: `find . -type d -name "run_scripts"``

I also noticed that in my case:

root@14f7e63f9334:/workspace# echo $NUM_ITERATION

root@14f7e63f9334:/workspace# echo $NUM_EPISODE_PER_ITERATION

root@14f7e63f9334:/workspace# echo $BATCH_SIZE

The variables are not defined.
Moreover, but I think it is not problematic, this command

docker build --pull --no-cache --tag circuit_training:core \
    --build-arg tf_agents_version="${TF_AGENTS_PIP_VERSION}" \
    -f "${REPO_ROOT}"/tools/docker/ubuntu_circuit_training ${REPO_ROOT}/tools/docker/

creates a cicuit_training:core image, not corepy39. But my understanding is that this is just a tag to name the instance.

@sakundu
Copy link
Collaborator

sakundu commented Apr 14, 2023

Hi @abettati,
Please make sure that you have completed the following steps:

  1. Please clone the MacroPlacement branch from this CT-Fork.
    git clone --branch MacroPlacement https://github.com/sakundu/circuit_training.git
  2. cd ./circuit_training
  3. sed -i 's@^\spython3 @python3.9 @' ./run_scripts/*.sh // This updates python3 to python3.9
  4. cd ./tools
  5. sh ./bootstrap_dreamplace_build.sh // This will create a docker image with tag circuit_training:corepy39
  6. cd ..
  7. docker run -it -v $(pwd):/workspace --workdir /workspace circuit_training:corepy39 bash
  8. Update the following variables if you just try to test the setup in ./run_scripts/ct_setup.sh
export NUM_ITERATION=2
export NUM_EPISODE_PER_ITERATION=16
export BATCH_SIZE=64
  1. Use this command to start reverb server, training and eval job: (The following file first source the ./run_scripts/ct_setup.sh and then launch the three jobs.)
    ./run_scripts/start_ct_training.sh
  2. The above command will start the jobs in tmux server. So use ctrl + b and then d to detach the tmux server.
    Now use the following command to start the collect job.
    ./run_scripts/start_ct_training_client.sh

If this does not work please let me know.

Thanks,
Sayak

@abettati
Copy link
Author

Hi @sakundu
thanks for the instructions, I am trying them, but I think I need to ask for a more powerful machine.
In the meantime, do you have an estimate of how much memory can the docker image occupy?

After following you instructions (until the VM crashes) I got:

REPOSITORY         TAG                IMAGE ID       CREATED             SIZE
circuit_training   dreamplace_build   2c08fe667b86   48 minutes ago      15.9GB
circuit_training   ci                 1eca9db9be9f   About an hour ago   1.38GB

@sakundu
Copy link
Collaborator

sakundu commented Apr 17, 2023

Hi @abettati,

The final docker image circuit_training:corepy39 is supposed to be ~18GB.

Remove the existing docker images and try out the following.

You may try adding the below command here:

docker rmi circuit_training:dreamplace_build
docker rmi circuit_training:ci

And rerun the build script.

Thanks,
Sayak

@sharkoo7
Copy link

sharkoo7 commented Jun 3, 2023

Hi @sakundu ,
I followed your instructions step by step here.
However I came across the following error running the train_job (other job works fine):
Can you help me this this issue?It will be very helpful if could help me out here.
Thanks a lot!

Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/workspace/circuit_training/learning/train_ppo.py", line 133, in
multiprocessing.handle_main(functools.partial(app.run, main))
File "/usr/local/lib/python3.9/dist-packages/tf_agents/system/default/multiprocessing_core.py", line 77, in handle_main
return app.run(parent_main_fn, args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/workspace/circuit_training/learning/train_ppo.py", line 111, in main
train_ppo_lib.train(
File "/workspace/circuit_training/learning/train_ppo_lib.py", line 123, in train
save_model_trigger = triggers.PolicySavedModelTrigger(
File "/usr/local/lib/python3.9/dist-packages/tf_agents/train/triggers.py", line 127, in init
self._raw_policy_saver = self._build_saver(raw_policy, batch_size,
File "/usr/local/lib/python3.9/dist-packages/tf_agents/train/triggers.py", line 168, in _build_saver
saver = policy_saver.PolicySaver(
File "/usr/local/lib/python3.9/dist-packages/tf_agents/policies/policy_saver.py", line 383, in init
polymorphic_action_fn.get_concrete_function(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 1239, in get_concrete_function
concrete = self._get_concrete_function_garbage_collected(args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 1219, in _get_concrete_function_garbage_collected
self._initialize(args, kwargs, add_initializers_to=initializers)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 785, in _initialize
self._stateful_fn._get_concrete_function_internal_garbage_collected( # pylint: disable=protected-access
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 2523, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 2725, in _maybe_define_function
func_cache_key, _ = function_context.make_cache_key(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function_context.py", line 131, in make_cache_key
args_signature = trace_type.from_object(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 111, in from_object
return default_types.Tuple(
(from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 111, in
return default_types.Tuple(
(from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 111, in from_object
return default_types.Tuple((from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 111, in
return default_types.Tuple(
(from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 109, in from_object
named_tuple_type, tuple(from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 109, in
named_tuple_type, tuple(from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 96, in from_object
if isinstance(obj, trace.SupportsTracingProtocol):
File "/usr/local/lib/python3.9/dist-packages/typing_extensions.py", line 612, in instancecheck
val = inspect.getattr_static(instance, attr)
File "/usr/lib/python3.9/inspect.py", line 1624, in getattr_static
instance_result = _check_instance(obj, attr)
File "/usr/lib/python3.9/inspect.py", line 1571, in _check_instance
instance_dict = object.getattribute(obj, "dict")
TypeError: this dict descriptor does not support '_DictWrapper' objects

@sakundu
Copy link
Collaborator

sakundu commented Jun 3, 2023

Hi @sharkoo7
This appears to be a type error that I haven't encountered before. From the error message, it seems like you're using Python 3.9. I experienced a similar TypeError when I was using Python 3.7. Could you please confirm that you're using Python 3.9 in your script? Could you also share the train.log file found in the log directory?

@sharkoo7
Copy link

sharkoo7 commented Jun 3, 2023

Thank you for your reply!
Yes I am using python 3.9 in script. Here is the train.log file
train.log

@sakundu
Copy link
Collaborator

sakundu commented Jun 3, 2023

From the log file, it appears that you're attempting to run the train job on a CPU. If that's the case, could you please remove the referenced line from your code and try running it again? If the problem persists, please share the updated log file for further troubleshooting.

@sharkoo7
Copy link

sharkoo7 commented Jun 4, 2023

Actually I wanted to run train job on GPU, so I kept --use_gpu.
this time I used this command docker run -it -v $(pwd):/workspace --workdir /workspace --gpus all circuit_training:corepy39 bash to start docker. But I still got this error :train.log

@sakundu
Copy link
Collaborator

sakundu commented Jun 4, 2023

I'm curious if you've tried running it on the CPU instead? If you haven't, you can do so by launching docker container without the --gpus all option.

Could you also please check if nvidia-smi is functioning within the Docker container? Additionally, make sure your driver version is up-to-date with the CUDA version.

@sakundu
Copy link
Collaborator

sakundu commented Jun 4, 2023

It seems that your driver and other versions are up to date. I'd still like you to try running it on the CPU first. If that works, we can at least ascertain that there's an issue with the GPU script. I've tested it on my end, and it's working perfectly.

@sharkoo7
Copy link

sharkoo7 commented Jun 5, 2023

Here is the train.log when I launching docker container without the --gpus all option and removing --use_gpu option :
train.log
When I lauched docker container with the --gpus all option, nvidia-sim :
image
Also, I searched this problem, some people say it is related to wrapt version . Could you please tell me your wrapt version? mine is 1.15.0

@sakundu
Copy link
Collaborator

sakundu commented Jun 5, 2023

My wrapt version is 1.14.1.

@luarss
Copy link

luarss commented Jun 6, 2023

Hello @sakundu,

May I check what is the minimum configuration the scripts be run on? I have a RTX A4000 card to run both collect and train jobs and it seems to be failing. I wonder if you have any recommendations

@sakundu
Copy link
Collaborator

sakundu commented Jun 6, 2023

Hi @luarss
The training task is the only one that requires GPU usage; all other tasks can be carried out using the CPU. If your training job is failing due to an overload of GPU memory, I would recommend reducing both the number of iterations and the batch size. This should fix the OOM problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants