Step-by-step How To #58

abettati · 2023-04-11T15:44:52Z

Hi everyone,
is a step-by-step how to guide on how to set up the environment and run a simple example available anywhere?
If not, I would like to try and build one.

sakundu · 2023-04-12T05:28:29Z

Hi @abettati,

Are you planning to run CircuitTraining or the SP&R flow scripts we provide? --> For our SP&R flow scripts, you will need commercial tools like Innovus and Genus. If you already have access to these tools and planned to run the SP&R, you can follow these steps to generate SP&R data for Ariane on NG45.

cd ./Flows/NanGate45/ariane133/run/
cp ../scripts/cadence flow1
cd flow1
./run.sh

Thanks,
Sayak

abettati · 2023-04-12T09:29:51Z

As a first experiment I would like to try both, CT and your open implementation of closed CT APi (e.g. cost function).
Later maybe I will try the Cadence flow as well.

sakundu · 2023-04-13T19:16:20Z

Hi @abettati

We use CT August 17 version for our experiments. We are working on bringing up the latest version. In the CT-Fork MacroPlacement branch we have also added working version of DREAMPlace.

Use the following steps to run CT:

Steps to run CT on the docker build using the script available here. → this supposed to generate docker image with tag: circuit_training:corepy39
Use the following command to start the docker (if you want to use gpu use --gpu all option):
docker run -it -v $(pwd):/workspace --workdir /workspace circuit_training:corepy39 bash
Now you are in docker in the workspace directory.
As you are trying to test Circuit training my suggestion is to updated these variables in ./run_scripts/ct_setup.sh (For complete run you can use the default values)

export NUM_ITERATION=2
export NUM_EPISODE_PER_ITERATION=16
export BATCH_SIZE=64

Use this command to start reverb server, training and eval job (please update python3 to python3.9)
./run_scripts/start_ct_training.sh
The above command will start the jobs in tmux server. So use ctrl + b and then d to detach the tmux server.
Now use the following command to start the collect job. (please update python3 to python3.9)
./run_scripts/start_ct_training_client.sh

abettati · 2023-04-14T16:18:48Z

Hi @sakundu ,
Thanks for your support!
I am working through your guide, I am stuck as I can't find the run_scripts folder: `find . -type d -name "run_scripts"``

I also noticed that in my case:

root@14f7e63f9334:/workspace# echo $NUM_ITERATION

root@14f7e63f9334:/workspace# echo $NUM_EPISODE_PER_ITERATION

root@14f7e63f9334:/workspace# echo $BATCH_SIZE

The variables are not defined.
Moreover, but I think it is not problematic, this command

docker build --pull --no-cache --tag circuit_training:core \
    --build-arg tf_agents_version="${TF_AGENTS_PIP_VERSION}" \
    -f "${REPO_ROOT}"/tools/docker/ubuntu_circuit_training ${REPO_ROOT}/tools/docker/

creates a cicuit_training:core image, not corepy39. But my understanding is that this is just a tag to name the instance.

sakundu · 2023-04-14T17:46:09Z

Hi @abettati,
Please make sure that you have completed the following steps:

Please clone the MacroPlacement branch from this CT-Fork.
git clone --branch MacroPlacement https://github.com/sakundu/circuit_training.git
cd ./circuit_training
sed -i 's@^\spython3 @python3.9 @' ./run_scripts/*.sh // This updates python3 to python3.9
cd ./tools
sh ./bootstrap_dreamplace_build.sh // This will create a docker image with tag circuit_training:corepy39
cd ..
docker run -it -v $(pwd):/workspace --workdir /workspace circuit_training:corepy39 bash
Update the following variables if you just try to test the setup in ./run_scripts/ct_setup.sh

export NUM_ITERATION=2
export NUM_EPISODE_PER_ITERATION=16
export BATCH_SIZE=64

Use this command to start reverb server, training and eval job: (The following file first source the ./run_scripts/ct_setup.sh and then launch the three jobs.)
./run_scripts/start_ct_training.sh
The above command will start the jobs in tmux server. So use ctrl + b and then d to detach the tmux server.
Now use the following command to start the collect job.
./run_scripts/start_ct_training_client.sh

If this does not work please let me know.

Thanks,
Sayak

abettati · 2023-04-17T09:53:18Z

Hi @sakundu
thanks for the instructions, I am trying them, but I think I need to ask for a more powerful machine.
In the meantime, do you have an estimate of how much memory can the docker image occupy?

After following you instructions (until the VM crashes) I got:

REPOSITORY         TAG                IMAGE ID       CREATED             SIZE
circuit_training   dreamplace_build   2c08fe667b86   48 minutes ago      15.9GB
circuit_training   ci                 1eca9db9be9f   About an hour ago   1.38GB

sakundu · 2023-04-17T10:11:57Z

Hi @abettati,

The final docker image circuit_training:corepy39 is supposed to be ~18GB.

Remove the existing docker images and try out the following.

You may try adding the below command here:

docker rmi circuit_training:dreamplace_build
docker rmi circuit_training:ci

And rerun the build script.

Thanks,
Sayak

sharkoo7 · 2023-06-03T03:13:21Z

Hi @sakundu ,
I followed your instructions step by step here.
However I came across the following error running the train_job (other job works fine):
Can you help me this this issue?It will be very helpful if could help me out here.
Thanks a lot!

Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/workspace/circuit_training/learning/train_ppo.py", line 133, in
multiprocessing.handle_main(functools.partial(app.run, main))
File "/usr/local/lib/python3.9/dist-packages/tf_agents/system/default/multiprocessing_core.py", line 77, in handle_main
return app.run(parent_main_fn, args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/workspace/circuit_training/learning/train_ppo.py", line 111, in main
train_ppo_lib.train(
File "/workspace/circuit_training/learning/train_ppo_lib.py", line 123, in train
save_model_trigger = triggers.PolicySavedModelTrigger(
File "/usr/local/lib/python3.9/dist-packages/tf_agents/train/triggers.py", line 127, in init
self._raw_policy_saver = self._build_saver(raw_policy, batch_size,
File "/usr/local/lib/python3.9/dist-packages/tf_agents/train/triggers.py", line 168, in _build_saver
saver = policy_saver.PolicySaver(
File "/usr/local/lib/python3.9/dist-packages/tf_agents/policies/policy_saver.py", line 383, in init
polymorphic_action_fn.get_concrete_function(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 1239, in get_concrete_function
concrete = self._get_concrete_function_garbage_collected(args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 1219, in _get_concrete_function_garbage_collected
self._initialize(args, kwargs, add_initializers_to=initializers)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 785, in _initialize
self._stateful_fn._get_concrete_function_internal_garbage_collected( # pylint: disable=protected-access
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 2523, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 2725, in _maybe_define_function
func_cache_key, _ = function_context.make_cache_key(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function_context.py", line 131, in make_cache_key
args_signature = trace_type.from_object(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 111, in from_object
return default_types.Tuple((from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 111, in
return default_types.Tuple((from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 111, in from_object
return default_types.Tuple((from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 111, in
return default_types.Tuple((from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 109, in from_object
named_tuple_type, tuple(from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 109, in
named_tuple_type, tuple(from_object(c, context) for c in obj))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 96, in from_object
if isinstance(obj, trace.SupportsTracingProtocol):
File "/usr/local/lib/python3.9/dist-packages/typing_extensions.py", line 612, in instancecheck
val = inspect.getattr_static(instance, attr)
File "/usr/lib/python3.9/inspect.py", line 1624, in getattr_static
instance_result = _check_instance(obj, attr)
File "/usr/lib/python3.9/inspect.py", line 1571, in _check_instance
instance_dict = object.getattribute(obj, "dict")
TypeError: this dict descriptor does not support '_DictWrapper' objects

sakundu · 2023-06-03T05:27:15Z

Hi @sharkoo7
This appears to be a type error that I haven't encountered before. From the error message, it seems like you're using Python 3.9. I experienced a similar TypeError when I was using Python 3.7. Could you please confirm that you're using Python 3.9 in your script? Could you also share the train.log file found in the log directory?

sharkoo7 · 2023-06-03T06:31:52Z

Thank you for your reply!
Yes I am using python 3.9 in script. Here is the train.log file
train.log

sakundu · 2023-06-03T17:45:34Z

From the log file, it appears that you're attempting to run the train job on a CPU. If that's the case, could you please remove the referenced line from your code and try running it again? If the problem persists, please share the updated log file for further troubleshooting.

sharkoo7 · 2023-06-04T09:03:34Z

Actually I wanted to run train job on GPU, so I kept --use_gpu.
this time I used this command docker run -it -v $(pwd):/workspace --workdir /workspace --gpus all circuit_training:corepy39 bash to start docker. But I still got this error :train.log

sakundu · 2023-06-04T17:54:52Z

I'm curious if you've tried running it on the CPU instead? If you haven't, you can do so by launching docker container without the --gpus all option.

Could you also please check if nvidia-smi is functioning within the Docker container? Additionally, make sure your driver version is up-to-date with the CUDA version.

sakundu · 2023-06-04T19:53:35Z

It seems that your driver and other versions are up to date. I'd still like you to try running it on the CPU first. If that works, we can at least ascertain that there's an issue with the GPU script. I've tested it on my end, and it's working perfectly.

sharkoo7 · 2023-06-05T01:16:22Z

Here is the train.log when I launching docker container without the --gpus all option and removing --use_gpu option :
train.log
When I lauched docker container with the --gpus all option, nvidia-sim :

Also, I searched this problem, some people say it is related to wrapt version . Could you please tell me your wrapt version? mine is 1.15.0

sakundu · 2023-06-05T01:40:29Z

My wrapt version is 1.14.1.

luarss · 2023-06-06T07:21:12Z

Hello @sakundu,

May I check what is the minimum configuration the scripts be run on? I have a RTX A4000 card to run both collect and train jobs and it seems to be failing. I wonder if you have any recommendations

sakundu · 2023-06-06T07:26:23Z

Hi @luarss
The training task is the only one that requires GPU usage; all other tasks can be carried out using the CPU. If your training job is failing due to an overload of GPU memory, I would recommend reducing both the number of iterations and the batch size. This should fix the OOM problem.

abettati changed the title ~~Step by step how-to~~ Step-by-step How To Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step-by-step How To #58

Step-by-step How To #58

abettati commented Apr 11, 2023 •

edited

Loading

sakundu commented Apr 12, 2023

abettati commented Apr 12, 2023

sakundu commented Apr 13, 2023 •

edited

Loading

abettati commented Apr 14, 2023

sakundu commented Apr 14, 2023

abettati commented Apr 17, 2023

sakundu commented Apr 17, 2023

sharkoo7 commented Jun 3, 2023

sakundu commented Jun 3, 2023 •

edited

Loading

sharkoo7 commented Jun 3, 2023

sakundu commented Jun 3, 2023

sharkoo7 commented Jun 4, 2023 •

edited

Loading

sakundu commented Jun 4, 2023

sakundu commented Jun 4, 2023

sharkoo7 commented Jun 5, 2023

sakundu commented Jun 5, 2023

luarss commented Jun 6, 2023

sakundu commented Jun 6, 2023

Step-by-step How To #58

Step-by-step How To #58

Comments

abettati commented Apr 11, 2023 • edited Loading

sakundu commented Apr 12, 2023

abettati commented Apr 12, 2023

sakundu commented Apr 13, 2023 • edited Loading

abettati commented Apr 14, 2023

sakundu commented Apr 14, 2023

abettati commented Apr 17, 2023

sakundu commented Apr 17, 2023

sharkoo7 commented Jun 3, 2023

sakundu commented Jun 3, 2023 • edited Loading

sharkoo7 commented Jun 3, 2023

sakundu commented Jun 3, 2023

sharkoo7 commented Jun 4, 2023 • edited Loading

sakundu commented Jun 4, 2023

sakundu commented Jun 4, 2023

sharkoo7 commented Jun 5, 2023

sakundu commented Jun 5, 2023

luarss commented Jun 6, 2023

sakundu commented Jun 6, 2023

abettati commented Apr 11, 2023 •

edited

Loading

sakundu commented Apr 13, 2023 •

edited

Loading

sakundu commented Jun 3, 2023 •

edited

Loading

sharkoo7 commented Jun 4, 2023 •

edited

Loading