check total cpu cores for the cpu_num var run_baseline_parallel.py #52

techmore · 2023-10-16T18:02:24Z

The code now checks how much cpus the dev has and set the cpu_num equal to that amount.

PWhiddy · 2023-10-16T19:27:16Z

Hm I'm torn on this. This is much better for booting up and running on a typical machine, but it also means the original results won't be reproducible without changing the code. Perhaps if I have the time (or someone else does) to verify that good results can still be obtained when running with much less environments, then I'd happily make this the default.

techmore · 2023-10-16T20:43:31Z

Thats a great point. I changed it to an --auto flag I will say I get about a 5x improvement on a M1 Pro with this change by timing the "Steps" count at 1 min, 5 min, 10 min marks. Here is 5 minute marks. I don't know if that is an accurate method of measuring speed improvements though.

5 min STD- Screenshot 2023-10-16 at 4 27 52 PM

5 min MOD-Screenshot 2023-10-16 at 4 33 37 PM

RussellMaggs · 2023-10-17T11:24:32Z

I am happy to run an experiment in regards to seeing how much less environments impacts the quality of the results

I did some quick performance tests by drastically reducing the episode length ect so the test could run in a reasonable timeframe and using time.perf_counter() I got the following results:

num_cpu = 8 Finished in 72.54381019999983 seconds
num_cpu = 16 Finished in 90.48242070000015 seconds
num_cpu = 32 Finished in 201.33944800000063 seconds

This is on a 16 core/ 32 thread cpu (os.cpu_count() seems to look at threads)

@PWhiddy Do you have a preference on how many agents I use for the experiment?

PWhiddy · 2023-10-19T03:41:34Z

@RussellMaggs
16 envs is probably a reasonable number. I'm also testing something similar.

RussellMaggs · 2023-10-19T07:17:46Z

Well I have started a run with 16 agents a couple of days ago and am current 111 iterations in. So far the performance is not great

However it makes sense to me that the learning would be much slower with such a reduced agent count so I will give it more time before giving up

RussellMaggs · 2023-10-19T12:04:41Z

It seems like waiting another couple of hours was worth it as finally improvement is happening:

techmore · 2023-10-20T13:13:02Z

When I apply this patch #1 I am now able to use num_cpu = os.cpu_count() * 6 and still have little headroom. *7 doesn't work well for me. M1pro.

RussellMaggs · 2023-10-20T21:05:49Z

Wow if I got similar results that would be a huge improvement! I will give that a try when I get time.

so far with 16 agents the results are not impressive and I think I will stop running it for now unless someone wants me to run it for longer:

techmore · 2023-10-21T01:25:08Z

@RussellMaggs Dude... how did you get visualizations working....
I really think you should consider stopping, merging the #1 patch to your code and also there is some Checkpoint restore code if you want or you can goto run_baseline_parallels.py and manually write in your checkpoint or just merge the code.

The speed improvements mean you will catch up very quickly but you can of course train many more simultaneous agents to provide everyone better training metrics.

RussellMaggs · 2023-10-21T05:13:30Z

I pretty much hacked at it until it worked and still have a poor understanding of how it works. If I have some free time I will see if I can provide some polish to the notebook so that it works better

I think the biggest reason the plot_runs throws errors most of the time is because of the group_runs=44, argument. My current theory is that the group runs argument must divide perfectly into the total iterations you have done so changing the number up or down into different groupings should work

Onto your second suggestion, I am happy to merge that in and spend some time running it but I am pretty busy for the next couple of days so it will have to be later

techmore · 2023-10-21T19:51:54Z

I believe this is no longer an issue with the current optimizations #98

Update run_baseline_parallel.py

232f081

The code now checks how much cpus the dev has and set the cpu_num equal to that amount.

techmore changed the title ~~Update run_baseline_parallel.py~~ check total cpu cores for the cpu_num var run_baseline_parallel.py Oct 16, 2023

Merge branch 'PWhiddy:master' into patch-4

dd35103

PWhiddy added the feature label Oct 19, 2023

techmore closed this Oct 21, 2023

techmore deleted the patch-4 branch October 23, 2023 11:52

techmore restored the patch-4 branch October 23, 2023 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check total cpu cores for the cpu_num var run_baseline_parallel.py #52

check total cpu cores for the cpu_num var run_baseline_parallel.py #52

techmore commented Oct 16, 2023

PWhiddy commented Oct 16, 2023

techmore commented Oct 16, 2023

RussellMaggs commented Oct 17, 2023 •

edited

Loading

PWhiddy commented Oct 19, 2023

RussellMaggs commented Oct 19, 2023

RussellMaggs commented Oct 19, 2023

techmore commented Oct 20, 2023 •

edited

Loading

RussellMaggs commented Oct 20, 2023

techmore commented Oct 21, 2023

RussellMaggs commented Oct 21, 2023

techmore commented Oct 21, 2023

check total cpu cores for the cpu_num var run_baseline_parallel.py #52

check total cpu cores for the cpu_num var run_baseline_parallel.py #52

Conversation

techmore commented Oct 16, 2023

PWhiddy commented Oct 16, 2023

techmore commented Oct 16, 2023

RussellMaggs commented Oct 17, 2023 • edited Loading

PWhiddy commented Oct 19, 2023

RussellMaggs commented Oct 19, 2023

RussellMaggs commented Oct 19, 2023

techmore commented Oct 20, 2023 • edited Loading

RussellMaggs commented Oct 20, 2023

techmore commented Oct 21, 2023

RussellMaggs commented Oct 21, 2023

techmore commented Oct 21, 2023

RussellMaggs commented Oct 17, 2023 •

edited

Loading

techmore commented Oct 20, 2023 •

edited

Loading