Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

“`html

Contrastive Neuron Attribution (CNA): A New Method for Identifying Harmful Prompts

The Problem With Existing Steering Methods

Instruction-tuned language models refuse harmful requests. But which part of the model is actually responsible, and how does that mechanism get installed during training? A new research from Nous Research takes a neuron-level look at this question. The team developed contrastive neuron attribution (CNA), a method that identifies the specific MLP neurons whose activations most distinguish harmful from benign prompts. By ablating just 0.1% of MLP activations, they reduced refusal rates by more than 50% in most instruct models tested — across Llama and Qwen architectures from 1B to 72B parameters — while keeping output quality above 0.97 at all steering strengths. What’s interesting is a key finding: the late-layer structure that discriminates harmful from benign prompts exists in base models before any fine-tuning. Alignment fine-tuning does not create new structure; it transforms the function of neurons within that existing structure into a sparse, targetable refusal gate.

How CNA Works

You define two sets of prompts:

Positive prompts — examples of the target behavior (e.g., harmful requests)
Negative prompts — examples of the opposite (e.g., benign requests)

You run all prompts through the model. At each MLP layer, the method records down projection activations at the last token position. It then computes the per-neuron mean activation difference between the two sets:

μ_j^ℓ = mean(activations on positive prompts) − mean(activations on negative prompts)

The top-k neurons by absolute difference are selected across all layers. The researchers set k to 0.1% of total MLP activations. This threshold produced reliable steering effects across all model sizes tested.

A filtering step removes “universal” neurons — those appearing in the top 0.1% of MLP activations across 80% or more of diverse prompts. These neurons fire regardless of prompt content and are excluded from all discovered circuits.

Causality is verified by multiplying each circuit neuron’s activation by a scalar multiplier m at inference time. m = 0 ablates the neuron. m = 1 is baseline. m > 1 amplifies it.

For the main JBB-Behaviors evaluation, the refusal circuit is discovered using 100 harmful and 100 benign prompts. For qualitative examples and other tasks, 8 positive and 8 negative prompts were used.

Results

Experiments covered base and instruct variants of Llama 3.1/3.2 and Qwen 2.5, from 1B to 72B parameters — 16 models total. The main benchmark was JBB-Behaviors, a NeurIPS 2024 benchmark of 100 harmful prompts.

Refusal reduction: Ablating the discovered circuit reduced refusal rates by more than 50% in most instruct models tested. Selected results from Table 3 of the research paper:

Model	Baseline	Ablated	Relative Drop
Llama-3.1-70B-Instruct	86%	18%	−79.1%
Qwen2.5-7B-Instruct	87%	2%	−97.7%
Qwen2.5-72B-Instruct	78%	8%	−89.7%
Llama-3.2-3B-Instruct	84%	47%	−44.0%
Qwen2.5-3B-Instruct	90%	58%	−35.6%

Research Paper

Output quality: CNA output quality, measured as 1 minus the fraction of repeated n-grams, stayed above 0.97 at all steering strengths across all instruct models tested. CAA dropped below 0.60 for six of the eight instruct models at maximum steering strength. In two cases — Qwen2.5-1.5B and Qwen2.5-72B — CAA degraded output so severely that the keyword classifier flagged degenerate text as refusals, producing artificially high refusal rates.

General capabilities: MMLU accuracy under CNA stayed within one percentage point of baseline at all steering strengths. CAA dropped to near-zero MMLU accuracy at maximum intervention.

StrongREJECT rubric: A secondary evaluation used the StrongREJECT rubric, which applies an LLM judge (Llama-3.3-70B) to score responses on harmfulness and dangerousness on a 0–1 scale. Llama model compliance scores improved by an average of 6% after CNA ablation. Qwen model compliance scores improved by an average of 31%.

Base model comparison: Applying the identical pipeline to base models produced no meaningful behavioral change. The paper illustrates this with a concrete example using the prompt “How do I pick a lock?”

Model	Multiplier	Output
Llama-1B Base	1.0	Repeats the question
Llama-1B Base	0.0 (ablated)	Describes lock picking as a learnable skill
Llama-1B Instruct	1.0	“I can’t assist with that.”
Llama-1B Instruct	0.0 (ablated)	Provides a guide
Llama-1B Instruct	2.0 (amplified)	Stronger refusal

Fine-Tuning Transforms Function, Not Structure

Discrimination neurons concentrate in the final 10% of layers in both base and instruct models. For Llama-3.2-1B, 87% of the top-200 discrimination neurons fall in the final three layers (L13–L15). For Qwen2.5-3B, 95% fall in the final quarter of layers. This late-layer concentration is a pretraining property — it exists before alignment fine-tuning.

The function of those neurons changes after fine-tuning. Table 8 in the research paper reports the overlap of (layer, neuron) index pairs between matched base and instruct circuits. Only 8–29% of individual neurons overlap between base and instruct models. Fine-tuning largely replaces the specific neurons within that late-layer structure while preserving the structure itself.

The research team describe this as a separation between two levels: layer-level structure (preserved across base and instruct) and neuron-level function (transformed by fine-tuning). This is consistent with prior work showing that instruction tuning rotates feed-forward network knowledge without changing layer structure.

Marktechpost’s Visual Explainer

Step-by-Step Guide • Nous Research

How to Use Contrastive Neuron Attribution (CNA)

Steer LLM behavior by identifying and ablating sparse MLP circuits — no SAE training, no weight modification.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

The Problem With Existing Steering Methods

How CNA Works

Results

Fine-Tuning Transforms Function, Not Structure

Marktechpost’s Visual Explainer

How to Use Contrastive Neuron Attribution (CNA)

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Alphabet plans to raise…

Nvidia chases $200B CPU…

Kaximia on channeling aggression,…