GitHub - PlusLabNLP/SafeWorld: SafeWorld: Geo-Diverse Safety Alignment (NeurIPS 2024)

🌍 SafeWorld: Geo-Diverse Safety Alignment (NeurIPS 2024)

Da Yin*, Haoyi Qiu*
Kung-Hsiang Huang, Kai-Wei Chang, Nanyun Peng

University of California, Los Angeles
Salesforce AI Research

*Equal contribution, listed in alphabetical order by first name.

Outlines

🔍 Motivation for Geo-Diverse Safety in LLMs
🧩 SafeWorld Benchmark
⚖️ Automatic Evaluation Framework
🧨 Geo-Diverse Safety Alignment Training

🔍 Motivation for Geo-Diverse Safety in LLMs

Figure 1: Examples of geo-diverse safety standards and the overall introduction of SafeWorld benchmark and its multi-dimensional evaluation.

Large Language Models (LLMs), like LLaMA and GPT, are vital to many AI applications, serving millions globally.
As LLMs become more widespread, concerns about their safety grow, with many studies now focusing on reducing their harmful impact. However, geo-diversity remains an overlooked aspect.
Addressing geographical variations in safety principles is crucial, as cultural norms and legal standards shape different definitions of safe and acceptable behavior.
If a model overlooks cultural norms and local policies, it risks causing conflicts among individuals or nations and may lead to legal issues for local services.

➡️ To be equitable and effective, LLMs must align with diverse cultural and legal standards globally!

🧩 SafeWorld Benchmark

Figure 2: The comparison between SafeWorld and other existing benchmarks.

We introduce SafeWorld, the first geo-diverse safety alignment evaluation benchmark, focusing on cultural and legal safety.
It evaluates an LLM's ability to generate helpful, safe, and appropriate responses in a global context.
Built from a global user survey, it includes 2.3k diverse queries, simulating geo-diverse scenarios validated to align with cultural-legal guidelines across 50 countries and 439 regions/races.

Figure 3: Overview of queries generation pipeline.

Based on the survey insights, we have developed four distinct query types, each designed to elicit a specific type of response. Every query within SafeWorld presents a scenario that highlights a culturally or legally sensitive (or insensitive) situation, followed by a relevant question. The table below defines each query type:

Type	Description
`SpecificAnswer`	Identifies the specific policy or cultural norm that was violated in the context of the queried country, race, or region.
`CompreAnswer`	Provides a comprehensive explanation of the violated policies or cultural norms related to one or more countries, races, or regions.
`RefuseToAnswer`	Avoids addressing the query directly due to cultural or legal insensitivity.
`DoAnswer`	Directly addresses the query because the query does not violate or show insensitivity towards any norm or policy.

Figure 4: SafeWorld query examples across four types. Some are paired with their corresponding reference (i.e., ground-truth) cultural-legal guidelines.

You can find our evaluation benchmark under /data:

data/
├── culture/
│   ├── type1.json
│   ├── type2.json
│   ├── type3.json
│   └── type4.json
│
└── policy/
    ├── type1.json
    ├── type2.json
    ├── type3.json
    └── type4.json

We divided the data into two separate folders, and the following table presents the statistics for each type of example in SafeWorld. For clarity, we have assigned the following labels: Type 1 corresponds to SpecificAnswer, Type 2 corresponds to CompreAnswer, Type 3 corresponds to RefuseToAnswer, and Type 4 corresponds to DoAnswer. Each number listed in the table indicates the count of examples for each respective type.

Categories	Type 1	Type 2	Type 3	Type 4
Cultural Norms	322	286	218	357
Legal Policies	319	291	190	367

For each evaluation example in type1.json or type4.json, it consists of:

Name	Description
`index`	A unique identifier for each example, formatted as `{country}_{id1}_{id2}`, where `id1` is specific to the country and `id2` pertains to the region within that country if available; otherwise, this is set to empty.
`country`	The name of the country related to the evaluation example.
`region`	The specific region within the mentioned country.
`topic`	The subject concerning the cultural norm or legal policy being evaluated.
`root_{norm/policy}`	The established norm or policy at the country level that applies nationwide.
`aug_{norm/policy}`	A region-specific norm or policy, if available; otherwise, this is set to null.
`scene`	A descriptive scenario illustrating where or how the norm or policy apply.
`specific_{norm/policy}`	A detailed explanation of how the norm or policy is relevant in the given scenario.
`query`	A composite of the scenario and a question, designed to probe understanding or application of the norm or policy.

For each evaluation example in type2.json, it consists of:

Name	Description
`country`	The name of the country related to the evaluation example.
`violated_speicic_{norm/policy}`	The specific norm or policy that is allegedly violated in the example.
`scene`	A descriptive scenario illustrating where or how the norm or policy apply.
`query`	A composite of the scenario and a question, designed to probe understanding or application of the norm or policy.

For each evaluation example in type3.json, it consists of:

Name	Description
`country`	The name of the country related to the evaluation example.
`query`	A composite of the scenario and a question, designed to probe understanding or application of the norm or policy.

⚖️ Automatic Evaluation Framework

We assess LLM responses to geo-diverse safety queries using 3 automated protocols:
- Contextual appropriateness, accuracy, and comprehensiveness.
Our evaluation shows that LLaMA- and Mistral-series models can perform similarly to GPT-3.5 and GPT-4-turbo across several metrics.
Despite SafeWorld benchmark guidelines being derived from GPT-4-turbo, the model struggles with implicit queries and often underperforms compared to some open-source LLMs in response appropriateness.

➡️ This suggests that additional alignment methods may be necessary to effectively elicit and apply its learned knowledge in model responses!

Figure 5: Overall framework of geo-diverse safety alignment training.

🧨 Geo-Diverse Safety Alignment Training

Focusing on the widely used alignment method Direct Preference Optimization (DPO), we investigate how to synthesize training data for preference pairs that helps LLMs behave appropriately and accurately elicit factual knowledge.
We synthesize training queries from human-verified cultural-legal guidelines. Positive responses align with these queries and guidelines. Negative responses are divided into:
- Category 1: responses that reference guidelines correctly but inappropriately.
- Category 2: responses that are behaviorally appropriate but incorrectly reference guidelines.

Figure 6: Overview of our multi-dimensional evaluation framework.

Our SafeWorldLLM model outperforms all competitors, including GPT-4o, across all three evaluated dimensions, along with a nearly 20% higher winning rate in helpfulness and harmfulness assessments by human evaluators from 9 countries.
In addition, our SafeWorldAlign training data proves to be useful for maintaining performance on general NLP and safety evaluation tasks while enhancing geo-diverse safety alignment.

We provide the models' checkpoints in 🤗 Huggingface:

Model Names	🤗 Huggingface Links
SafeWorldLM w/o Neg. Category 1
SafeWorldLM w/o Neg. Category 2
SafeWorldLM (50% Data)
SafeWorldLM

Table 1: Performance of different models on the SAFEWORLD benchmark.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yin-etal-2024-safeworld,
    title = "SafeWorld: Geo-Diverse Safety Alignment",
    author = "Yin, Da  and
              Qiu, Haoyi and
              Huang, Kung-Hsiang  and
              Chang, Kai-Wei  and
              Peng, Nanyun",
    year = "2024",
    publisher = "38th Conference on Neural Information Processing Systems (NeurIPS 2024)",
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌍 SafeWorld: Geo-Diverse Safety Alignment (NeurIPS 2024)

Outlines

🔍 Motivation for Geo-Diverse Safety in LLMs

🧩 SafeWorld Benchmark

⚖️ Automatic Evaluation Framework

🧨 Geo-Diverse Safety Alignment Training

Citation

About

Uh oh!

Releases

Packages

PlusLabNLP/SafeWorld

Folders and files

Latest commit

History

Repository files navigation

🌍 SafeWorld: Geo-Diverse Safety Alignment (NeurIPS 2024)

Outlines

🔍 Motivation for Geo-Diverse Safety in LLMs

🧩 SafeWorld Benchmark

⚖️ Automatic Evaluation Framework

🧨 Geo-Diverse Safety Alignment Training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages