Skip to content

PlusLabNLP/SafeWorld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

🌍 SafeWorld: Geo-Diverse Safety Alignment (NeurIPS 2024)


University of California, Los Angeles
Salesforce AI Research

*Equal contribution, listed in alphabetical order by first name.

Outlines

🔍 Motivation for Geo-Diverse Safety in LLMs


Figure 1: Examples of geo-diverse safety standards and the overall introduction of SafeWorld benchmark and its multi-dimensional evaluation.

  • Large Language Models (LLMs), like LLaMA and GPT, are vital to many AI applications, serving millions globally.
  • As LLMs become more widespread, concerns about their safety grow, with many studies now focusing on reducing their harmful impact. However, geo-diversity remains an overlooked aspect.
  • Addressing geographical variations in safety principles is crucial, as cultural norms and legal standards shape different definitions of safe and acceptable behavior.
  • If a model overlooks cultural norms and local policies, it risks causing conflicts among individuals or nations and may lead to legal issues for local services.

➡️ To be equitable and effective, LLMs must align with diverse cultural and legal standards globally!

🧩 SafeWorld Benchmark


Figure 2: The comparison between SafeWorld and other existing benchmarks.

  • We introduce SafeWorld, the first geo-diverse safety alignment evaluation benchmark, focusing on cultural and legal safety.
  • It evaluates an LLM's ability to generate helpful, safe, and appropriate responses in a global context.
  • Built from a global user survey, it includes 2.3k diverse queries, simulating geo-diverse scenarios validated to align with cultural-legal guidelines across 50 countries and 439 regions/races.


Figure 3: Overview of queries generation pipeline.

Based on the survey insights, we have developed four distinct query types, each designed to elicit a specific type of response. Every query within SafeWorld presents a scenario that highlights a culturally or legally sensitive (or insensitive) situation, followed by a relevant question. The table below defines each query type:

Type Description
SpecificAnswer Identifies the specific policy or cultural norm that was violated in the context of the queried country, race, or region.
CompreAnswer Provides a comprehensive explanation of the violated policies or cultural norms related to one or more countries, races, or regions.
RefuseToAnswer Avoids addressing the query directly due to cultural or legal insensitivity.
DoAnswer Directly addresses the query because the query does not violate or show insensitivity towards any norm or policy.


Figure 4: SafeWorld query examples across four types. Some are paired with their corresponding reference (i.e., ground-truth) cultural-legal guidelines.

You can find our evaluation benchmark under /data:

data/
├── culture/
│   ├── type1.json
│   ├── type2.json
│   ├── type3.json
│   └── type4.json
│
└── policy/
    ├── type1.json
    ├── type2.json
    ├── type3.json
    └── type4.json

We divided the data into two separate folders, and the following table presents the statistics for each type of example in SafeWorld. For clarity, we have assigned the following labels: Type 1 corresponds to SpecificAnswer, Type 2 corresponds to CompreAnswer, Type 3 corresponds to RefuseToAnswer, and Type 4 corresponds to DoAnswer. Each number listed in the table indicates the count of examples for each respective type.

Categories Type 1 Type 2 Type 3 Type 4
Cultural Norms 322 286 218 357
Legal Policies 319 291 190 367

For each evaluation example in type1.json or type4.json, it consists of:

Name Description
index A unique identifier for each example, formatted as {country}_{id1}_{id2}, where id1 is specific to the country and id2 pertains to the region within that country if available; otherwise, this is set to empty.
country The name of the country related to the evaluation example.
region The specific region within the mentioned country.
topic The subject concerning the cultural norm or legal policy being evaluated.
root_{norm/policy} The established norm or policy at the country level that applies nationwide.
aug_{norm/policy} A region-specific norm or policy, if available; otherwise, this is set to null.
scene A descriptive scenario illustrating where or how the norm or policy apply.
specific_{norm/policy} A detailed explanation of how the norm or policy is relevant in the given scenario.
query A composite of the scenario and a question, designed to probe understanding or application of the norm or policy.

For each evaluation example in type2.json, it consists of:

Name Description
country The name of the country related to the evaluation example.
violated_speicic_{norm/policy} The specific norm or policy that is allegedly violated in the example.
scene A descriptive scenario illustrating where or how the norm or policy apply.
query A composite of the scenario and a question, designed to probe understanding or application of the norm or policy.

For each evaluation example in type3.json, it consists of:

Name Description
country The name of the country related to the evaluation example.
query A composite of the scenario and a question, designed to probe understanding or application of the norm or policy.

⚖️ Automatic Evaluation Framework

  • We assess LLM responses to geo-diverse safety queries using 3 automated protocols:
    • Contextual appropriateness, accuracy, and comprehensiveness.
  • Our evaluation shows that LLaMA- and Mistral-series models can perform similarly to GPT-3.5 and GPT-4-turbo across several metrics.
  • Despite SafeWorld benchmark guidelines being derived from GPT-4-turbo, the model struggles with implicit queries and often underperforms compared to some open-source LLMs in response appropriateness.

➡️ This suggests that additional alignment methods may be necessary to effectively elicit and apply its learned knowledge in model responses!


Figure 5: Overall framework of geo-diverse safety alignment training.

🧨 Geo-Diverse Safety Alignment Training

  • Focusing on the widely used alignment method Direct Preference Optimization (DPO), we investigate how to synthesize training data for preference pairs that helps LLMs behave appropriately and accurately elicit factual knowledge.
  • We synthesize training queries from human-verified cultural-legal guidelines. Positive responses align with these queries and guidelines. Negative responses are divided into:
    • Category 1: responses that reference guidelines correctly but inappropriately.
    • Category 2: responses that are behaviorally appropriate but incorrectly reference guidelines.


Figure 6: Overview of our multi-dimensional evaluation framework.

  • Our SafeWorldLLM model outperforms all competitors, including GPT-4o, across all three evaluated dimensions, along with a nearly 20% higher winning rate in helpfulness and harmfulness assessments by human evaluators from 9 countries.
  • In addition, our SafeWorldAlign training data proves to be useful for maintaining performance on general NLP and safety evaluation tasks while enhancing geo-diverse safety alignment.

We provide the models' checkpoints in 🤗 Huggingface:

Model Names 🤗 Huggingface Links
SafeWorldLM w/o Neg. Category 1
SafeWorldLM w/o Neg. Category 2
SafeWorldLM (50% Data)
SafeWorldLM


Table 1: Performance of different models on the SAFEWORLD benchmark.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yin-etal-2024-safeworld,
    title = "SafeWorld: Geo-Diverse Safety Alignment",
    author = "Yin, Da  and
              Qiu, Haoyi and
              Huang, Kung-Hsiang  and
              Chang, Kai-Wei  and
              Peng, Nanyun",
    year = "2024",
    publisher = "38th Conference on Neural Information Processing Systems (NeurIPS 2024)",
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published