Skip to content

Create a new README and patch bugs from rewrite #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: dev
Choose a base branch
from
Open

Conversation

HyperCodec
Copy link
Owner

No description provided.

@HyperCodec HyperCodec added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 4, 2025
@HyperCodec HyperCodec self-assigned this Feb 4, 2025
@HyperCodec HyperCodec linked an issue Feb 4, 2025 that may be closed by this pull request
@HyperCodec
Copy link
Owner Author

I still need to test the example and maybe polish documentation a bit before merging.

@HyperCodec HyperCodec marked this pull request as ready for review February 4, 2025 17:23
@HyperCodec
Copy link
Owner Author

Weird, I'm encountering a lot of bugs while testing with an example that don't appear in the actual unit tests. I'm not sure why this is.

@HyperCodec
Copy link
Owner Author

HyperCodec commented Feb 4, 2025

it seems to always hang after the first 2-3 generations. it happens in both crossover and division reproduction.

@HyperCodec
Copy link
Owner Author

printing from the yield_now it says Idle, meaning there are no more tasks and this is probably an issue with input_count. if it is an input_count issue, it's most likely to be related to RandomlyMutable, as it's used in both types of reproduction.

@viw-ty
Copy link

viw-ty commented Feb 4, 2025

So I've just stumbled into this crate, and I'm not even sure how you're getting past the first generation, I just get errors coming from here. I'll try to get it to work.

stack backtrace:
   // unwind
   3: rand::rng::Rng::gen_range
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/rand-0.8.5/src/rng.rs:134:9
   4: neat::neuralnet::NeuralNetwork<_,_>::random_location
             at /home/virtio/.cargo/git/checkouts/neat-ea8e987889abd0bb/1a8896b/src/neuralnet.rs:253:41
   5: neat::neuralnet::NeuralNetwork<_,_>::random_location_in_scope
             at /home/virtio/.cargo/git/checkouts/neat-ea8e987889abd0bb/1a8896b/src/neuralnet.rs:265:19
   6: <neat::neuralnet::NeuralNetwork<_,_> as genetic_rs_common::builtin::RandomlyMutable>::mutate
             at /home/virtio/.cargo/git/checkouts/neat-ea8e987889abd0bb/1a8896b/src/neuralnet.rs:381:22
   7: <neat::neuralnet::NeuralNetwork<_,_> as genetic_rs_common::builtin::DivisionReproduction>::divide
             at /home/virtio/.cargo/git/checkouts/neat-ea8e987889abd0bb/1a8896b/src/neuralnet.rs:417:13
   8: <n::Agent as genetic_rs_common::builtin::DivisionReproduction>::divide
             at ./src/main.rs:9:45
   9: genetic_rs_common::builtin::next_gen::division_pruning_nextgen
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/genetic-rs-common-0.5.4/src/builtin.rs:111:27
  10: core::ops::function::Fn::call
             at /home/virtio/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:79:5
  11: <F as genetic_rs_common::NextgenFn<G>>::next_gen
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/genetic-rs-common-0.5.4/src/lib.rs:41:9
  12: genetic_rs_common::GeneticSim<F,NG,G>::next_generation::{{closure}}
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/genetic-rs-common-0.5.4/src/lib.rs:198:13
  13: replace_with::replace_with::{{closure}}
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/replace_with-0.1.7/src/lib.rs:156:31
  14: replace_with::on_unwind
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/replace_with-0.1.7/src/lib.rs:105:10
  15: replace_with::replace_with
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/replace_with-0.1.7/src/lib.rs:156:13
  16: replace_with::replace_with_or_abort
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/replace_with-0.1.7/src/lib.rs:244:2
  17: genetic_rs_common::GeneticSim<F,NG,G>::next_generation
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/genetic-rs-common-0.5.4/src/lib.rs:189:9
  18: n::main
             at ./src/main.rs:65:5
  19: core::ops::function::FnOnce::call_once
             at /home/virtio/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
use std::ops::Deref;

use neat::*;

const TESTS: [(usize, usize, usize); 4] = [(0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)];
const INPUTS: usize = 2;
const OUTPUTS: usize = 2;

#[derive(Clone, PartialEq, RandomlyMutable, DivisionReproduction)]
struct Agent {
    net: NeuralNetwork<INPUTS, OUTPUTS>,
}

impl Prunable for Agent {}

impl GenerateRandom for Agent {
    fn gen_random(rng: &mut impl rand::Rng) -> Self {
        Self {
            net: NeuralNetwork::new(MutationSettings::default(), rng),
        }
    }
}

impl Deref for Agent {
    type Target = NeuralNetwork<INPUTS, OUTPUTS>;

    fn deref(&self) -> &Self::Target {
        &self.net
    }
}

fn fitness(agent: &Agent) -> f32 {
    let mut fit = 0.0;
    for (a, b, c) in TESTS {
        let prediction = agent.predict([a as f32, b as f32]);
        let correct = max_index(prediction) == c;

        fit += match correct {
            true => 1.0,
            false => -0.1,
        }
    }
    fit
}

fn max_index<T>(v: T) -> usize
where
    T: IntoIterator,
    <T as IntoIterator>::Item: PartialOrd,
{
    v.into_iter()
        .enumerate()
        .reduce(|(ai, av), (bi, bv)| match bv > av {
            true => (bi, bv),
            false => (ai, av),
        })
        .unwrap()
        .0
}

fn main() {
    let mut sim = GeneticSim::new(Vec::gen_random(1000), fitness, division_pruning_nextgen);

    sim.next_generation();
}

@HyperCodec
Copy link
Owner Author

So I've just stumbled into this crate, and I'm not even sure how you're getting past the first generation, I just get errors coming from here. I'll try to get it to work.

stack backtrace:
   // unwind
   3: rand::rng::Rng::gen_range
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/rand-0.8.5/src/rng.rs:134:9
   4: neat::neuralnet::NeuralNetwork<_,_>::random_location
             at /home/virtio/.cargo/git/checkouts/neat-ea8e987889abd0bb/1a8896b/src/neuralnet.rs:253:41
   5: neat::neuralnet::NeuralNetwork<_,_>::random_location_in_scope
             at /home/virtio/.cargo/git/checkouts/neat-ea8e987889abd0bb/1a8896b/src/neuralnet.rs:265:19
   6: <neat::neuralnet::NeuralNetwork<_,_> as genetic_rs_common::builtin::RandomlyMutable>::mutate
             at /home/virtio/.cargo/git/checkouts/neat-ea8e987889abd0bb/1a8896b/src/neuralnet.rs:381:22
   7: <neat::neuralnet::NeuralNetwork<_,_> as genetic_rs_common::builtin::DivisionReproduction>::divide
             at /home/virtio/.cargo/git/checkouts/neat-ea8e987889abd0bb/1a8896b/src/neuralnet.rs:417:13
   8: <n::Agent as genetic_rs_common::builtin::DivisionReproduction>::divide
             at ./src/main.rs:9:45
   9: genetic_rs_common::builtin::next_gen::division_pruning_nextgen
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/genetic-rs-common-0.5.4/src/builtin.rs:111:27
  10: core::ops::function::Fn::call
             at /home/virtio/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:79:5
  11: <F as genetic_rs_common::NextgenFn<G>>::next_gen
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/genetic-rs-common-0.5.4/src/lib.rs:41:9
  12: genetic_rs_common::GeneticSim<F,NG,G>::next_generation::{{closure}}
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/genetic-rs-common-0.5.4/src/lib.rs:198:13
  13: replace_with::replace_with::{{closure}}
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/replace_with-0.1.7/src/lib.rs:156:31
  14: replace_with::on_unwind
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/replace_with-0.1.7/src/lib.rs:105:10
  15: replace_with::replace_with
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/replace_with-0.1.7/src/lib.rs:156:13
  16: replace_with::replace_with_or_abort
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/replace_with-0.1.7/src/lib.rs:244:2
  17: genetic_rs_common::GeneticSim<F,NG,G>::next_generation
             at /home/virtio/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/genetic-rs-common-0.5.4/src/lib.rs:189:9
  18: n::main
             at ./src/main.rs:65:5
  19: core::ops::function::FnOnce::call_once
             at /home/virtio/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5

I have it patched already on a codespace, I just discovered more bugs so I didn't end up pushing.

@viw-ty
Copy link

viw-ty commented Feb 4, 2025

Alright, I'll just figure out the easiest way and stare at it more KEKW. Good job on (as far as I know) somehow the best documented ML crate by the way.

@HyperCodec
Copy link
Owner Author

HyperCodec commented Feb 4, 2025

I did end up discovering a bunch of cases within stuff like add_connection where I forgot to update input_count (I had added random mutation before adding the field to neuron).

After patching like 3 of them and printing whenever a mutation occurs in a generation, it passes a lot more generations but eventually still hangs, even if there were no mutations occurring in the generation directly prior to the one where it hangs.

I’ll probably push these patches either tonight or tomorrow morning (my weekdays are super busy and I mainly just work on this in study hall and downtime between classes) so you can take a look.

@viw-ty
Copy link

viw-ty commented Feb 4, 2025

I have no clue how far you are, but I'm going off and will just abuse this as a sticky note:
After having to learn how to use gdb on the fly, and being an idiot, this is the lock.
1738705652

@viw-ty
Copy link

viw-ty commented Feb 4, 2025

oh wait I'm stupid this is literally the lock and I spent god knows how long doing nothing

@HyperCodec
Copy link
Owner Author

HyperCodec commented Feb 4, 2025

oh wait I'm stupid this is literally the lock and I spent god knows how long doing nothing

yeah spamming println everywhere typically works better than gdb in deadlock/livelock situations. i had to implement a rather complex task management system (and if i really want to make it more efficient i would have to fork rayon, which sounds awful to do) in order to handle joins properly without wasting a ton of cpu usage or spamming locks (which fully deadlock on all threads bc they stop the entire rayon thread while locking) and i have no idea if it actually works because my unit tests apparently just completely lied to me.

i still have no clue how the unit tests manage to pass but not an example that does something pretty similar.

@viw-ty
Copy link

viw-ty commented Feb 5, 2025

For now, this seems to fix the lock without any immediately apparent downsides, I'm not sure if I'm missing something important here, but I'll roll with it.

diff --git a/src/neuralnet.rs b/src/neuralnet.rs
index cce0d61..9e9ee3a 100644
--- a/src/neuralnet.rs
+++ b/src/neuralnet.rs
@@ -137,19 +139,17 @@ impl<const I: usize, const O: usize> NeuralNetwork<I, O> {
     fn eval(&self, loc: impl AsRef<NeuronLocation>, cache: Arc<NeuralNetCache<I, O>>) {
         let loc = loc.as_ref();
 
-        if !cache.claim(loc) {
+        if !cache.is_ready(loc) && !cache.claim(loc) {
+            // two things can happen here:
+            // we're trying to get a neuron that's not ready, which
+            // seems to cause deadlocks every time
+            // or
             // some other thread is already
             // waiting to do this task, currently doing it, or done.
             // no need to do it again.
             return;
         }
 
-        while !cache.is_ready(loc) {
-            // essentially spinlocks until the dependency tasks are complete,
-            // while letting this thread do some work on random tasks.
-            rayon::yield_now();
-        }
-
         let val = cache.get(loc);
         let n = self.get_neuron(loc);

@HyperCodec
Copy link
Owner Author

HyperCodec commented Feb 5, 2025

For now, this seems to fix the lock without any immediately apparent downsides, I'm not sure if I'm missing something important here, but I'll roll with it.

The cache by default is not ready, meaning that this code essentially just returns when it reaches the first hidden layer neuron. The cache.is_ready method only returns true when finished_inputs >= expected_inputs (i.e. all neurons that affect that neuron’s value have finished computing). The cache.claim is used to ensure that only one worker can sit there and spinlock on the value, and kills off all tasks that reach a neuron that’s already being processed. A task that has claimed a neuron needs to wait until that neuron’s value is ready before it tries to operate on it.

Btw the reason this whole system is in place is because I can rayon::yield_now to allow the thread to work on other tasks while this task waits for another dependency task to complete (which could be behind it in that thread's queue). If I had simply used an std-provided lock type such as RwLock, if a task reaches a neuron before the rest of tasks can push the data to it, it steals the lock and blocks those tasks from being able to push the data. Additionally, because rayon allocates multiple tasks to each thread, and std lock types force the entire thread to sleep, which freezes all subsequent tasks on that thread. #58 has a decent explanation if you're still confused.

@HyperCodec
Copy link
Owner Author

HyperCodec commented Feb 5, 2025

I pushed some of my patches if you want to take a look. I drastically reduced the size of the simulation in readme_ex
as well just to get a better view of what's going on without 100 concurrent simulations blasting stdout with text.

@HyperCodec
Copy link
Owner Author

HyperCodec commented Feb 5, 2025

image
In this run, it gets through 6 generations before a hang occurs, but there was not a single mutation aside from weights (which shouldn't ever cause a deadlock since it's just multiplication). This means there's probably something wrong with crossover.

@HyperCodec
Copy link
Owner Author

For now, this seems to fix the lock without any immediately apparent downsides, I'm not sure if I'm missing something important here, but I'll roll with it.

The cache by default is not ready, meaning that this code essentially just returns when it reaches the first hidden layer neuron. The cache.is_ready method only returns true when finished_inputs >= expected_inputs (i.e. all neurons that affect that neuron’s value have finished computing). The cache.claim is used to ensure that only one worker can sit there and spinlock on the value, and kills off all tasks that reach a neuron that’s already being processed. A task that has claimed a neuron needs to wait until that neuron’s value is ready before it tries to operate on it.

Btw the reason this whole system is in place is because I can rayon::yield_now to allow the thread to work on other tasks while this task waits for another dependency task to complete (which could be behind it in that thread's queue). If I had simply used an std-provided lock type such as RwLock, if a task reaches a neuron before the rest of tasks can push the data to it, it steals the lock and blocks those tasks from being able to push the data. Additionally, because rayon allocates multiple tasks to each thread, and std lock types force the entire thread to sleep, which freezes all subsequent tasks on that thread. #58 has a decent explanation if you're still confused.

Oh wait maybe you are right. There's no point in a complex claiming and waiting system if I can just make it so the last task to reach the neuron runs on it and then cancel any other ones. Not sure how I didn't think of that.

@HyperCodec
Copy link
Owner Author

might want to add the tracing crate to this as a feature or something to help with debugging. i'll create a separate issue for it

@HyperCodec
Copy link
Owner Author

Without the filter_map in handle_removed, it can sometimes overflow the stack because of cyclic connections. This means the input_count is probably 0 despite there being connections in the network. I'm not sure whether to keep the filter_map and potentially have some random bugs that are just hidden from causing a crash (because if a user calls remove_neuron on a linked neuron it'll just break things without the filter_map) or not.

Perhaps I have two remove_neuron methods, one public and one private, so that I still get that panic reassurance when I know there shouldn't be any inputs, but the public one is still user-friendly?

@HyperCodec
Copy link
Owner Author

Also probably going to branch out and implement #94 just so it's easier to debug this PR.

@HyperCodec
Copy link
Owner Author

oops for whatever reason my git is on a different branch but still pushed to here smh

@HyperCodec
Copy link
Owner Author

HyperCodec commented Apr 10, 2025

Considering #83 to help even further with debugging. Tracing adds a lot of context but it's still pretty hard to visualize what's happening. Also it spams a ton of output. Would be better if I could dump all that info somewhere and then render/visualize it.

@HyperCodec HyperCodec changed the title Create a new README and patch some bugs from rewrite Create a new README and patch bugs from rewrite Apr 10, 2025
@HyperCodec
Copy link
Owner Author

The problem with doing any rendering stuff is that I do most of the work for this crate on a codespace, which can't render windows. Maybe I could output it to an image file, but then it's hard to really browse the spans and such.

@HyperCodec
Copy link
Owner Author

Working on a pretty major update for genetic-rs so going to delay this again for that. Don't think it should be a major delay bc the overall API is relatively the same. Helps solve issues like #92.

@HyperCodec
Copy link
Owner Author

HyperCodec commented May 12, 2025

Actually since the changes are relatively small, I could probably just make another PR once this is merged. I'm kind of tired of doing everything in this one branch. It's been open for way too long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finish new README
2 participants