🔄 Under ACL Rolling Review (ARR) May 2026 · EMNLP 2026 Submission

CAREF

Calibration-Aware Regularization
for Explanation Faithfulness
Without Rationale Supervision

A parameter-efficient fine-tuning framework that jointly optimises predictive accuracy and explanation faithfulness via a single unified loss—the Sparsity-Calibrated Entropic Divergence (ℒ_SCED)—using only 6.43% of trainable parameters.

🌐 Project Page 📄 Paper (ARR) 💻 GitHub 📚 BibTeX

89.04

Best Avg. Accuracy

CAREF-AQ across COS-E, ECQA, ComVE, e-SNLI benchmarks.

81.00

nBERT Alignment

Best explanation faithfulness score without rationale labels.

6.43%

Trainable Parameters

Beats LoRA & AdaLoRA at a fraction of the parameter budget.

Overview

The Problem with Faithful Explanations

Language models excel at NLU yet fail to produce faithful explanations—ones that causally reflect the model's decision process. Standard fine-tuning is agnostic to how a model allocates probability mass, leading to a structural gap between accuracy and explanation quality.

CAREF overview — accuracy vs parameter budget, explanation quality, human evaluation, hyperparameter sensitivity

Figure 1. CAREF at a glance: (a) Accuracy vs. trainable parameter budget across variants; (b) nBERT explanation quality per dataset (w/ vs. w/o CAREF); (c) Human evaluation faithfulness scores; (d) Sensitivity of nBERT to α and β on e-SNLI.

❌ Existing Approaches Fail Because…

• Rationale-supervised methods require expensive token-level annotations
• Post-hoc attribution (attention rollout, IG) operates after training—cannot influence representations
• Entropy penalties flatten distributions indiscriminately, harming interpretability
• Label smoothing redistributes mass uniformly, actively hurting explanation coherence
• Sparsemax/Entmax introduces hard sparsity with discontinuous gradients

✅ CAREF's Key Insight

• Single unified loss ℒ_SCED jointly regulates entropy and sparsity
• No rationale supervision—only task labels required
• Fully differentiable, plug-in compatible with any PEFT method
• Token-adaptive: penalty concentrates on decision-relevant vocabulary
• Architecture-agnostic: works with T5, BART, LLaMA, GPT families

Method

The CAREF Framework

CAREF couples entropy-based calibration with token-level sparsity control through one loss term— without any architectural changes.

Unified Training Objective

\(\mathcal{L}_{\text{CAREF}} = \mathcal{L}_{\text{CE}} + \lambda_{\text{SCED}} \cdot \mathcal{L}_{\text{SCED}} + \lambda_{\text{KL}} \cdot \mathcal{L}_{\text{KL}}\)

Cross-entropy preserves task accuracy; KL term provides global calibration pressure; ℒ_SCED provides local, adaptive token-level regularization.

Sparsity-Calibrated Entropic Divergence (ℒ_SCED)

\(\mathcal{L}_{\text{SCED}} = \sum_{t}\sum_{v=1}^{|\mathcal{V}|} \left|P_{t,v}\log\frac{P_{t,v}}{U_v}\right|^{\alpha} \cdot (1-P_{t,v})^{\beta}\)

α — Entropic Curvature

For α = 1 → standard KL contribution per token pair. For α > 1 → super-linear penalty; large deviations from uniform incur disproportionately higher gradients. Combats overconfidence on spurious tokens that harm faithfulness.

β — Adaptive Token Sparsity

Factor (1−P_t,v)^β attenuates penalty for high-probability tokens, concentrating regularization on the tail of the distribution. Discourages spurious vocabulary-wide mass diffusion without disturbing confident, decision-relevant predictions.

Special Cases — ℒ_SCED Subsumes Classical Regularizers

Parameters	Recovered Behavior
α=1, β=0	Standard per-token KL divergence from uniform; no sparsity weighting
α>1, β=0	Power-law entropic penalty; large deviations incur super-linearly growing cost
α=1, β>0	Sparsity-weighted KL; focuses pressure on low-probability tail vocabulary
α>1, β>0	Full CAREF regime: adaptive sparsity + non-linear entropic curvature ✅

Results

State-of-the-Art with 6.43% Parameters

Evaluated on four NLE benchmarks (COS-E, ECQA, ComVE, e-SNLI) with Flan-T5 across 60 FEB splits. CAREF-AQ achieves the best average accuracy 89.04 and nBERT 81.00, outperforming LoRA and AdaLoRA.

CAREF full results — accuracy vs parameter budget and nBERT explanation quality

Figure 2. Accuracy vs. parameter budget (left) and nBERT explanation quality per dataset (right). CAREF-AQ (star ★) dominates both axes simultaneously.

Unified Results Table (Accuracy A / nBERT E)

Section	Model	M	COS-E	ECQA	ComVE	e-SNLI	Avg	Param%
Ablation (RQ1)	CAREF-BASE	A	85.19	86.29	94.76	88.32	88.64	100%
		E	74.00	77.51	86.60	79.76	79.47
	CAREF-DEC	A	81.56	85.76	94.73	74.73	84.19	52.23%
		E	71.53	77.40	87.54	67.89	76.09
	CAREF-AQKV	A	84.30	88.51	94.37	88.26	88.86	19.28%
		E	73.43	79.29	86.73	81.80	80.31
	CAREF-LAQ	A	84.73	88.93	94.23	88.07	88.99	6.44%
		E	74.57	79.96	86.57	80.96	80.52
	CAREF-AQ ★	A	84.59	88.93	94.16	88.48	89.04	6.43%
		E	74.42	80.31	87.29	81.97	81.00
vs. PEFT (RQ3)	AdaLoRA 4.46%	A	43.32	15.16	35.32	0.56	23.59	4.46%
		E	35.48	12.33	30.74	0.47	19.75
	LoRA R=128	A	84.58	91.77	94.57	83.37	88.57	15.74%
		E	74.52	82.89	87.73	77.17	80.58
	CAREF-AQ ★	A	84.59	88.93	94.16	88.48	89.04	6.43%
		E	74.42	80.31	87.29	81.97	81.00

M: metric — A = Accuracy, E = nBERT. Highlighted rows are CAREF variants. Results averaged over 60 splits.

Qualitative Examples

Faithful Explanations, No Rationale Labels

CAREF-AQ consistently produces concise, causally grounded explanations across datasets. Baseline models generate verbose or semantically incoherent justifications.

CAREF qualitative comparison — COS-E predictions

Architecture & Model Overview

CAREF fine-tunes only attention query projections (6.43% of parameters), encouraging sparse and calibrated attention patterns that concentrate on causally relevant input spans.

CAREF qualitative predictions — COS-E and ECQA

Sample Predictions Across Benchmarks

CAREF-AQ (bottom rows) produces concise, factually precise answers—e.g., correctly identifying "Atlantic Ocean" and "air conditioning" with tighter, more faithful explanations than BASE or DEC variants.

Human Evaluation Results

0.69

ECQA

Fact-grounded commonsense — highest faithfulness score

0.58

e-SNLI

Well-defined entailment structure aids legibility

0.53

ComVE

World knowledge plausibility discrimination

Strong Yes

CAREF yields 44 "Strong Yes" labels on ECQA — no rationale supervision

Existing methods require rationale annotations or full fine-tuning.

CAREF gets both accuracy and faithfulness for free.

Citation

BibTeX

BibTeX Citation

@inproceedings{panboonyuen2026caref,
  title     = {CAREF: Calibration-Aware Regularization for
              Explanation Faithfulness Without Rationale
              Supervision},
  author    = {Panboonyuen, Teerapong},
  journal   = {arXiv preprint arXiv:2605.27835},
  note      = {Under ACL Rolling Review (ARR) May 2026},
  year      = {2026},
  url       = {https://kaopanboonyuen.github.io/CAREF}
}