Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the problems of the buffer #15

Open
pipiPdesu opened this issue Sep 11, 2024 · 6 comments
Open

About the problems of the buffer #15

pipiPdesu opened this issue Sep 11, 2024 · 6 comments

Comments

@pipiPdesu
Copy link
Contributor

I found somethings wrong with the buffer which makes it doesn't work.

  1. The buffer doesn't sort until it's full. This means that the buffer ignores init_buffer_losses. It directly selects the first candidate, optimizes it, and then puts it back. It also replaces the last candidate while ignoring the loss as well. However, this only occurs in the first step. I manage to fix it by sorting every time something is added.

  2. I try to fix condense problem for init buffer. I directly select 20 letters and concatenate them with a space. This approach works for internlm but still fails to function on the Llama3 tokenizer. I noticed that '!' '?' '.' are more likely to be condensed, so I remove these three letters, and then it functions properly with the Llama3 tokenizer XD. It should be noted that this may lead to a decrease in attack performance and I just test on these two model‘s tokenizer.

  3. I'm quite confused of the strategy of the buffer. As mentioned in here, it is no different from simply recording the best loss and reverting to it when the suffix fails to obtain a better loss. I've noticed that neither ACG nor QCG mentions the pop operation so I am quite curious about how it works.

Do you have any idea about 2 and 3? Thanks for your encouragement!

@justinwangx
Copy link
Collaborator

sorry for the delay on my end - thank you for digging in to this!

  1. do you mean that removing '!' '?' '.' from the initialization fixes the issue without concatenating letters with a space for Llama3? if so, we can just do this. even if these characters allow for slightly better optimizations, everything is lost if the optimized sequence gets condensed to something entirely different.

  2. do you mean I-GCG? i'm not aware of a QCG. it looks like ACG does mention the pop operation. the ACG blog mentions that their algorithm tends to explore the same candidate in the buffer, which is the same problem we're having. but interestingly, they report that buffer_size=16 still allows for exploration. do you want to try experimenting with randomly selecting an attack from the buffer at each iteration rather than choosing the one with the lowest loss? this will definitely allow for exploration, and if we limit batch size i'm curious if we can retain performance.

would be great if you can pr for 1, either now or after we've resolved 2 and 3

This was referenced Sep 19, 2024
@pipiPdesu
Copy link
Contributor Author

I've made pr for 1 and 2, for 3, my bad for the confusion on QCG—I was referring to GCQ which appears in the reference(I'll checkout I-GCG later).

As for the buffer, If we only optimize for the minimum candidate and do not care about other candidates, there is no difference between a min-heap and simply recording the minimum value, so the size of the buffer has no impact. Randomly selecting a candidate is a good idea and I'd like to try experimenting this.

By the way, do you have any baseline data for original GCG and NanoGCG? I've conducted some tests on AdvBench using the default parameters for NanoGCG and noticed that its performance isn't as impressive as I had anticipated. I believe that even with the num_steps limited to 250 , NanoGCG should perform better.

@justinwangx
Copy link
Collaborator

justinwangx commented Sep 20, 2024

ah, i see -- curious to see how those experiments go. the buffer update in GCQ would be different, i don't think we want to introduce a proxy model for computing proxy losses...

what metric are you using to judge performance? there shouldn't be a drastic difference in ASR. on llama-3, nanoGCG with the default settings does an iteration in ~2 seconds, while the HarmBench implementation takes ~3 seconds with the same settings (on an 80GB A100). the llm-attacks implementation is close to the HarmBench implementation in terms of speed, iirc

@pipiPdesu
Copy link
Contributor Author

Thanks for your reply!

Apologies for the ambiguity. Regarding the buffer, what is being discussed here is its implementation: as described above, the current buffer mechanism renders the buffer size somewhat meaningless.

In terms of performance, it refers to its ASR, where I attempted to evaluate the first 14 harmful behaviors on llama2-7b-chat-hf using advbench with the addition of a system prompt, and found that only 2 reached early stopping within 250 steps. You can refer to my experimental code, logs, and results here. I think this does not align with GCG's original performance, so I am curious whether there might be an issue with my experimental setup or something went wrong.

After resolving this issue, we can use the early stopping steps as the metric to evaluate the effectiveness of various new features.

@justinwangx
Copy link
Collaborator

yes, random selection from the buffer should make buffer size meaningful.

i don't think this is a good sanity-check, actually. llama is more difficult to jailbreak -- i recommend running 500 steps by default, with early stopping. note that, since nanoGCG just runs GCG by default, there shouldn't be a difference in jailbreaking capability between default nanoGCG and the original llm-attacks implementation (the relevant performance metric that differs is speed, e.g. seconds per GCG iteration). i wouldn't be surprised if the llm-attacks implementation also gets 2 / 14 to the same early stop loss when running for 250 steps.

@pipiPdesu
Copy link
Contributor Author

pipiPdesu commented Sep 28, 2024

I reran the experiment with 1000 steps and found that only 3 out of 15 reached early stopping ,log is here.

I think there might be an issue somewhere, maybe related to #20 or #22. I also find some issue with HarmBench, so I think segment embedding might not be a reasonable approach. I plan to conduct some ablation experiments to identify the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants