ACL 2025 Main Conference

A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems

Authors Đorđe Klisura, Astrid R. Bernaga Torres et al.

Institutions UTSA · Tecnológico de Monterrey

Venue ACL 2025 Main Conference

Task Privacy Policy QA · Fairness in NLP

English dialects evaluated

82%

reduction in performance gap

collaborative AI agents, no fine-tuning

+52%

F1 improvement vs. baseline

01 The Problem

Language shapes who gets protected

Large language models perform significantly worse when users ask privacy-related questions in non-standard English dialects disproportionately harming communities that rely on these systems to understand how their data is used.

0.394

F1 score

Standard American English

"Do you sell my data?"

LLM response

✓ Correct answer retrieved from privacy policy

0.344

F1 score

Rural AAVE

"Does y'all sell my datums?"

LLM response

✗ Same question degraded or incorrect response

The same information request, two different dialects. The standard dialect works. The marginalized dialect fails.

Baseline disparity (Max F1 Diff): SAE=0.394 vs Welsh SWE=0.301 a gap of 0.093 on GPT-4o-mini / PrivacyQA

02 The Framework

Two agents, one goal

A sequential multi-agent pipeline that normalizes dialect without losing intent then answers from the actual policy text. No retraining. No fine-tuning. Click each agent to see its prompt.

◎

User Input

Any dialect

click to expand

query

⇄

Dialect Agent

Translate → SAE

click to expand

normalized

Answer from policy

click to expand

verify

≡

Agreement Check

Dialect Agent reviews

click to expand

output

→

Final Answer

Verified response

click to expand

↺ If the Dialect Agent disagrees with the Privacy Policy Agent's response, the pipeline iterates up to 2 times refining both the translation and the answer until agreement is reached or the iteration limit is met.

03 Results

From 0.093 to 0.019 on PrivacyQA

The multi-agent framework boosts F1 across all dialects simultaneously compressing the gap between the highest and lowest performing dialect by 80%.

Zero-Shot Baseline

Multi-Agent Zero-Shot

Multi-Agent Few-Shot

0.093

Baseline max disparity

0.019
Our method (few-shot)

82%

Reduction in dialect gap

† GPT-4o-mini results. Max Diff = difference between highest and lowest F1 across all evaluated dialects. PrivacyQA reduction: 0.093 → 0.019 (82%). PolicyQA reduction: 0.029 → 0.024 (17%).

04 50 Dialects

English is not one language

The framework was evaluated on 50 varieties of English from Kenyan to Appalachian, Sri Lankan to Scottish. Each bubble represents one dialect. Hover to see its F1 scores before and after.

● Bubble size = improvement magnitude ■ High baseline (>0.37) ■ Low baseline (<0.33) ■ Mid baseline

05 Why It Matters

Fairness is not a feature it's a foundation

Privacy policies govern how billions of people's data is handled. If AI-powered interfaces to these policies systematically fail non-standard dialect speakers, the right to understand and contest data practices becomes a privilege of the linguistically mainstream.

Algorithmic Fairness

Performance disparities across dialects encode and amplify existing inequalities. A system that works better for SAE speakers than AAVE or Aboriginal English speakers reproduces at scale the very marginalizations it should be neutral to.

Data Rights Access

Privacy policies are legal documents. Understanding them is a right under GDPR, CCPA, and similar frameworks. Dialect-biased QA systems create an invisible barrier between marginalized communities and their own legal protections.

No Retraining Required

The framework operates as a plug-in pipeline over any existing LLM. This means organizations can reduce dialect bias in deployed systems without costly retraining or access to proprietary model internals.

Scalable to 50+ Varieties

Evaluated across 50 English dialects spanning Africa, Asia, the Americas, and Oceania the framework closes performance gaps across the board, not just for the dialects it was tuned on.

"Every dialect deserves equal access to the policies that govern their data."

Read the Paper ↗ ← Back to astrid.mx

Klisura, Bernaga Torres et al. · ACL 2025 Main Conference · UTSA & Tecnológico de Monterrey