ACL 2025 Main Conference

A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems

Authors Đorđe Klisura, Astrid R. Bernaga Torres et al.
Institutions UTSA · Tecnológico de Monterrey
Venue ACL 2025 Main Conference
Task Privacy Policy QA · Fairness in NLP
50
English dialects evaluated
82%
reduction in performance gap
2
collaborative AI agents, no fine-tuning
+52%
F1 improvement vs. baseline
01 The Problem

Language shapes who gets protected

Large language models perform significantly worse when users ask privacy-related questions in non-standard English dialects disproportionately harming communities that rely on these systems to understand how their data is used.

0.394
F1 score
Standard American English
"Do you sell my data?"
LLM response
Correct answer retrieved from privacy policy
0.344
F1 score
Rural AAVE
"Does y'all sell my datums?"
LLM response
Same question degraded or incorrect response

The same information request, two different dialects. The standard dialect works. The marginalized dialect fails.

Baseline disparity (Max F1 Diff): SAE=0.394 vs Welsh SWE=0.301 a gap of 0.093 on GPT-4o-mini / PrivacyQA
02 The Framework

Two agents, one goal

A sequential multi-agent pipeline that normalizes dialect without losing intent then answers from the actual policy text. No retraining. No fine-tuning. Click each agent to see its prompt.

User Input
Any dialect
click to expand
query
Dialect Agent
Translate → SAE
click to expand
normalized
§
Privacy Policy Agent
Answer from policy
click to expand
verify
Agreement Check
Dialect Agent reviews
click to expand
output
Final Answer
Verified response
click to expand
If the Dialect Agent disagrees with the Privacy Policy Agent's response, the pipeline iterates up to 2 times refining both the translation and the answer until agreement is reached or the iteration limit is met.
03 Results

From 0.093 to 0.019 on PrivacyQA

The multi-agent framework boosts F1 across all dialects simultaneously compressing the gap between the highest and lowest performing dialect by 80%.

Zero-Shot Baseline
Multi-Agent Zero-Shot
Multi-Agent Few-Shot
0.093
Baseline max disparity
0.019
Our method (few-shot)
82%
Reduction in dialect gap

† GPT-4o-mini results. Max Diff = difference between highest and lowest F1 across all evaluated dialects. PrivacyQA reduction: 0.093 → 0.019 (82%). PolicyQA reduction: 0.029 → 0.024 (17%).

04 50 Dialects

English is not one language

The framework was evaluated on 50 varieties of English from Kenyan to Appalachian, Sri Lankan to Scottish. Each bubble represents one dialect. Hover to see its F1 scores before and after.

Baseline F1
Multi-Agent F1
Improvement
● Bubble size = improvement magnitude ■ High baseline (>0.37) ■ Low baseline (<0.33) ■ Mid baseline
05 Why It Matters

Fairness is not a feature it's a foundation

Privacy policies govern how billions of people's data is handled. If AI-powered interfaces to these policies systematically fail non-standard dialect speakers, the right to understand and contest data practices becomes a privilege of the linguistically mainstream.

01

Algorithmic Fairness

Performance disparities across dialects encode and amplify existing inequalities. A system that works better for SAE speakers than AAVE or Aboriginal English speakers reproduces at scale the very marginalizations it should be neutral to.

02

Data Rights Access

Privacy policies are legal documents. Understanding them is a right under GDPR, CCPA, and similar frameworks. Dialect-biased QA systems create an invisible barrier between marginalized communities and their own legal protections.

03

No Retraining Required

The framework operates as a plug-in pipeline over any existing LLM. This means organizations can reduce dialect bias in deployed systems without costly retraining or access to proprietary model internals.

04

Scalable to 50+ Varieties

Evaluated across 50 English dialects spanning Africa, Asia, the Americas, and Oceania the framework closes performance gaps across the board, not just for the dialects it was tuned on.

"Every dialect deserves equal access to the policies that govern their data."

Read the Paper ↗ ← Back to astrid.mx

Klisura, Bernaga Torres et al. · ACL 2025 Main Conference · UTSA & Tecnológico de Monterrey