Contrastive Attribution in the Wild:
An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Rongyuan Tan^1*, Jue Zhang^2†, Zhuozhao Li^1†, Qingwei Lin², Saravan Rajmohan², Dongmei Zhang²

¹Southern University of Science and Technology ²Microsoft

^*Work done during an internship at Microsoft. ^†Corresponding authors.

juezhang@microsoft.com, lizz@sustech.edu.cn

Abstract

Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as contrastive attribution, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis.

Video

Demo of Contrastive Attribution in the Wild.

Key Contributions

Problem Formulation: We formulate LLM failure analysis as a token-level contrastive attribution problem, enabling the application of model interpretability methods to analyze failures in more realistic settings than prior work.

Efficient Extension: We introduce an efficient extension of LRP-based attribution that enables scalable construction of cross-layer hidden-state attribution graphs for long-context inputs.

Systematic Study: We conduct a systematic interpretability-based failure attribution analysis across common benchmarks, characterizing attribution patterns across datasets, model sizes, and training checkpoints.

Framework Overview

We formulate failure analysis as explaining why a model prefers an incorrect token over a correct alternative. The contrastive logit difference between an incorrect target token and a correct contrast token is attributed backward through the transformer stack using AttnLRP, producing both input-level attribution heatmaps and cross-layer attribution graphs.

1

Failure Collection

Collect failure cases from standard benchmarks (IFEval, GAIA2, MATH, EvalPlus)

2

Token Pair ID

Identify target–contrast token pairs via semi-automated annotation with strict recovery criterion

3

Contrastive Attribution

Apply AttnLRP to attribute the logit difference to input tokens and internal hidden states

4

Analysis

Analyze failure patterns via heatmaps and attribution graphs at coarse-to-fine granularity

Our method extends AttnLRP with batch-packed multi-target backpropagation, which exploits GPU vectorization to efficiently recover relevance propagation between hidden states across layers, reducing the cost from O(n) to O(⌈n/B⌉) backward passes.

Benchmarks & Models

We study failures across four diverse benchmarks, spanning instruction following, agentic tasks, math reasoning, and code generation:

Benchmark	Target Capability	Model	Failure Cases (Clean Rate)	Avg. Tokens
IFEval	Instruction following	Qwen3-0.6B/1.7B/4B	265 (20.8%)	54
GAIA2	Agentic; long-context	Qwen3-4B	300 (17.0%)	12,374
MATH	Math reasoning	Qwen3-0.6B	91 (40.7%)	116
EvalPlus	Code generation	Qwen3-0.6B	270 (19.6%)	169

Results

Attribution Analysis of Incorrect Token Preference

We categorize attribution outcomes into three types: Manifested by Input Attribution (M-IA), Manifested by Attribution Graph (NC-IA + M-AG), and No Clue from Both (NC-IA+AG). For explainable cases, we further identify failure patterns:

Underweight Relevant Tokens (URT) Overweight Irrelevant Tokens (OIT) URT + OIT

Distribution of attribution outcomes and failure patterns across benchmarks

Figure 1: Distribution of attribution outcomes (top) and failure patterns (bottom) across benchmarks. IFEval, GAIA2, and EvalPlus show that input attribution alone suffices in the majority of cases. MATH has a larger fraction of unexplained failures requiring finer-grained analysis.

Figure 2: Examples of input attribution heatmaps. (a) URT: Qwen3-0.6B underweights the instruction token “commas”. (b) OIT: Qwen3-0.6B overweights an irrelevant token “(a”. (c) NC-IA: input attribution alone offers limited explanation. Color intensity shows each token’s positive (red) or negative (blue) influence on the target–contrast preference.

Attribution Graph Analysis

For cases where input attribution is insufficient, attribution graphs reveal informative internal dynamics by tracing how relevance propagates across layers:

Figure 3: Sample ablated attribution graph for a failure case where input attribution alone was insufficient. The graph reveals layer-wise relevance interactions—the token “First” accumulates larger cumulative relevance than the correct token “<<” through critical layers 18–24, ultimately producing the erroneous output.

Attribution Shifts with Model Scaling

We investigate whether scaling model size resolves failures and whether such improvements are supported by interpretability evidence.

Logit difference comparisons across model sizes

Figure 4: Logit difference comparisons between Qwen3-0.6B and larger models. Most points lie below the y=x line, indicating that larger models correct failures in the smaller model.

Relevance score breakdown across model sizes

Figure 5: Input attribution relevance breakdown by prompt segment. Larger models show increasingly negative relevance, especially in the Instruction segment.

Evolution of Failure Attribution Across Training

We analyze how attribution patterns evolve across post-training checkpoints of the Olmo-3-7B-Think model series.

Figure 6: Evolution of logit differences across training checkpoints. Logit differences steadily decrease as training progresses through SFT, DPO, and RLVR stages.

Figure 7: Relevance breakdown across training checkpoints. The most substantial changes occur during early SFT, with DPO further pushing relevance toward more negative values.

Practical Implications

Targeted Prompt Tuning

Masking a small number of top-attributed input tokens (~2 on average) suffices to flip the model's prediction away from the error token, enabling efficient prompt debugging.

Training Monitoring

Attribution patterns can track how failures evolve during training, revealing whether corrections reflect genuine reasoning improvements.

Alignment Signals

Attribution scores provide a natural token-importance signal for alignment training, enabling token-level preference optimization (e.g., with DPO).

Key Findings

Finding 1: Contrastive attribution reveals meaningful failure patterns in the majority of cases, with underweighting relevant tokens (URT) as the dominant failure mode across all benchmarks.

Finding 2: Attribution graphs expose layer-wise relevance dynamics invisible to input-level analysis, revealing how internal biases (e.g., BOS attention sink) can override contextual signals.

Finding 3: Larger models correct failures through systematic, interpretable attribution shifts—specifically, increased negative relevance on instruction-related tokens.

Finding 4: Training progressively corrects failures through attribution shifts, with the largest changes during early SFT and notable further improvements from DPO.

Limitation: A non-trivial fraction of failures (especially in math reasoning) remain unexplained at the coarse-grained hidden-state level, highlighting the need for finer-grained neuron-level analysis.

BibTeX

@misc{tan2026contrastiveattributionwildinterpretability,
      title={Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks},
      author={Rongyuan Tan and Jue Zhang and Zhuozhao Li and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang},
      year={2026},
      eprint={2604.17761},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.17761},
}