Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check total cpu cores for the cpu_num var run_baseline_parallel.py #52

Closed
wants to merge 2 commits into from

Conversation

techmore
Copy link

The code now checks how much cpus the dev has and set the cpu_num equal to that amount.

The code now checks how much cpus the dev has and set the cpu_num equal to that amount.
@techmore techmore changed the title Update run_baseline_parallel.py check total cpu cores for the cpu_num var run_baseline_parallel.py Oct 16, 2023
@PWhiddy
Copy link
Owner

PWhiddy commented Oct 16, 2023

Hm I'm torn on this. This is much better for booting up and running on a typical machine, but it also means the original results won't be reproducible without changing the code. Perhaps if I have the time (or someone else does) to verify that good results can still be obtained when running with much less environments, then I'd happily make this the default.

@techmore
Copy link
Author

Thats a great point. I changed it to an --auto flag I will say I get about a 5x improvement on a M1 Pro with this change by timing the "Steps" count at 1 min, 5 min, 10 min marks. Here is 5 minute marks. I don't know if that is an accurate method of measuring speed improvements though.

5 min STD- Screenshot 2023-10-16 at 4 27 52 PM 5 min MOD-Screenshot 2023-10-16 at 4 33 37 PM

@RussellMaggs
Copy link

RussellMaggs commented Oct 17, 2023

I am happy to run an experiment in regards to seeing how much less environments impacts the quality of the results

I did some quick performance tests by drastically reducing the episode length ect so the test could run in a reasonable timeframe and using time.perf_counter() I got the following results:

num_cpu = 8 Finished in 72.54381019999983 seconds
num_cpu = 16 Finished in 90.48242070000015 seconds
num_cpu = 32 Finished in 201.33944800000063 seconds

This is on a 16 core/ 32 thread cpu (os.cpu_count() seems to look at threads)

@PWhiddy Do you have a preference on how many agents I use for the experiment?

@PWhiddy
Copy link
Owner

PWhiddy commented Oct 19, 2023

@RussellMaggs
16 envs is probably a reasonable number. I'm also testing something similar.

@RussellMaggs
Copy link

Well I have started a run with 16 agents a couple of days ago and am current 111 iterations in. So far the performance is not great

However it makes sense to me that the learning would be much slower with such a reduced agent count so I will give it more time before giving up

image

@RussellMaggs
Copy link

It seems like waiting another couple of hours was worth it as finally improvement is happening:

image

@techmore
Copy link
Author

techmore commented Oct 20, 2023

When I apply this patch #1 I am now able to use num_cpu = os.cpu_count() * 6 and still have little headroom. *7 doesn't work well for me. M1pro.

@RussellMaggs
Copy link

Wow if I got similar results that would be a huge improvement! I will give that a try when I get time.

so far with 16 agents the results are not impressive and I think I will stop running it for now unless someone wants me to run it for longer:

image

@techmore
Copy link
Author

@RussellMaggs Dude... how did you get visualizations working....
I really think you should consider stopping, merging the #1 patch to your code and also there is some Checkpoint restore code if you want or you can goto run_baseline_parallels.py and manually write in your checkpoint or just merge the code.

The speed improvements mean you will catch up very quickly but you can of course train many more simultaneous agents to provide everyone better training metrics.

@RussellMaggs
Copy link

I pretty much hacked at it until it worked and still have a poor understanding of how it works. If I have some free time I will see if I can provide some polish to the notebook so that it works better

I think the biggest reason the plot_runs throws errors most of the time is because of the group_runs=44, argument. My current theory is that the group runs argument must divide perfectly into the total iterations you have done so changing the number up or down into different groupings should work

Onto your second suggestion, I am happy to merge that in and spend some time running it but I am pretty busy for the next couple of days so it will have to be later

@techmore
Copy link
Author

I believe this is no longer an issue with the current optimizations #98

@techmore techmore closed this Oct 21, 2023
@techmore techmore deleted the patch-4 branch October 23, 2023 11:52
@techmore techmore restored the patch-4 branch October 23, 2023 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants