Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dan, ucar, translation, base64, self-refine attacks #2

Merged
merged 9 commits into from
Sep 11, 2024

Conversation

NickoJo
Copy link
Collaborator

@NickoJo NickoJo commented Sep 8, 2024

added/localized DAN, UCAR and Translation jailbreak attacks

Copy link

github-actions bot commented Sep 8, 2024

🧪 Test coverage: 0.00%

Code Coverage Summary

Filename                                          Stmts    Miss  Cover    Missing
----------------------------------------------  -------  ------  -------  ---------
llamator/hello.py                                     2       2  0.00%    4-6
llamator/initial_validation.py                       30      30  0.00%    1-92
llamator/main.py                                     34      34  0.00%    2-104
llamator/ps_logging.py                               15      15  0.00%    1-37
llamator/attack_provider/attack_loader.py             1       1  0.00%    1
llamator/attack_provider/attack_registry.py          25      25  0.00%    1-76
llamator/attack_provider/run_tests.py                91      91  0.00%    1-363
llamator/attack_provider/test_base.py                52      52  0.00%    1-111
llamator/attack_provider/util.py                     27      27  0.00%    1-82
llamator/attack_provider/work_progress_pool.py       80      80  0.00%    1-135
llamator/attacks/aim.py                              40      40  0.00%    1-120
llamator/attacks/base64_injection.py                 64      64  0.00%    1-144
llamator/attacks/complimentary_transition.py         39      39  0.00%    1-109
llamator/attacks/dan.py                              56      56  0.00%    1-138
llamator/attacks/dynamic_test.py                     83      83  0.00%    1-244
llamator/attacks/ethical_compliance.py               40      40  0.00%    1-108
llamator/attacks/harmful_behavior.py                 43      43  0.00%    1-98
llamator/attacks/ru_dan.py                           56      56  0.00%    1-133
llamator/attacks/ru_self_refine.py                   57      57  0.00%    1-141
llamator/attacks/ru_ucar.py                          51      51  0.00%    1-121
llamator/attacks/self_refine.py                      57      57  0.00%    1-141
llamator/attacks/sycophancy.py                       77      77  0.00%    1-255
llamator/attacks/translation.py                      39      39  0.00%    1-98
llamator/attacks/typoglycemia.py                     28      28  0.00%    1-50
llamator/attacks/ucar.py                             55      55  0.00%    1-139
llamator/attacks/utils.py                            10      10  0.00%    1-17
llamator/client/attack_config.py                      5       5  0.00%    1-7
llamator/client/chat_client.py                       33      33  0.00%    1-131
llamator/client/client_config.py                     21      21  0.00%    1-39
llamator/client/langchain_integration.py             79      79  0.00%    1-131
llamator/client/specific_chat_clients.py             63      63  0.00%    1-220
llamator/format_output/logo.py                        8       8  0.00%    1-17
llamator/format_output/results_table.py              23      23  0.00%    1-30
TOTAL                                              1384    1384  0.00%

Diff against main

Filename                                Stmts    Miss  Cover
------------------------------------  -------  ------  --------
llamator/attacks/base64_injection.py      +20     +20  +100.00%
llamator/attacks/dan.py                   +16     +16  +100.00%
llamator/attacks/ru_dan.py                +56     +56  +100.00%
llamator/attacks/ru_self_refine.py        +57     +57  +100.00%
llamator/attacks/ru_ucar.py               +51     +51  +100.00%
llamator/attacks/self_refine.py           +17     +17  +100.00%
llamator/attacks/translation.py            -2      -2  +100.00%
llamator/attacks/ucar.py                  +15     +15  +100.00%
TOTAL                                    +230    +230  +100.00%

Results for commit: 3d93486

Minimum allowed coverage is 60%

♻️ This comment has been updated with latest results

Copy link
Collaborator

@nizamovtimur nizamovtimur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NickoJo
Copy link
Collaborator Author

NickoJo commented Sep 9, 2024

  • докстринги

@NickoJo NickoJo requested a review from nizamovtimur September 9, 2024 22:30
@NickoJo NickoJo changed the title Add dan, ucar, translation attacks Add dan, ucar, translation, base64 attack Sep 10, 2024
@NickoJo
Copy link
Collaborator Author

NickoJo commented Sep 10, 2024

  1. К своим существующим атакам добавил функционал логирования из ветки test_artifacts.
  2. Помимо переименования локальных атак вернул англоязычные атаки (dan/ru_dan, ucar/ru_ucar).
  3. Добавил атаку ru_self_refined и self_refined.
  4. В датасет base64 добавил англоязычные промпты.
  5. utils.py пополним английскими refusal_words.
  6. Обновил attack_loader.py (в список добавил новые ru* атаки)

@NickoJo NickoJo changed the title Add dan, ucar, translation, base64 attack Add dan, ucar, translation, base64, self-refine attack Sep 10, 2024
@NickoJo
Copy link
Collaborator Author

NickoJo commented Sep 11, 2024

смерджил актуальную ветку main (после изменений @RomiconEZ) с текущей веткой feature/jailbreak_attacks

Copy link
Collaborator

@nizamovtimur nizamovtimur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!!!

Copy link
Owner

@RomiconEZ RomiconEZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Проверено

@nizamovtimur nizamovtimur merged commit b6096ad into main Sep 11, 2024
0 of 3 checks passed
@NickoJo NickoJo removed the request for review from maksiam September 11, 2024 11:21
@NickoJo NickoJo removed the request for review from bulatovv September 11, 2024 11:21
@nizamovtimur nizamovtimur deleted the feature/jailbreak_attacks branch September 11, 2024 11:30
@NickoJo NickoJo changed the title Add dan, ucar, translation, base64, self-refine attack Add dan, ucar, translation, base64, self-refine attacks Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants