Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

“`html

Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

Key Takeaways

Command A+ has 218B total / 25B active parameters in a Sparse MoE architecture, released under Apache 2.0.
W4A4 applies NVFP4 quantization to MoE experts only with QAD post-training, running on 2× H100s.
τ²-Bench Telecom improved from 37% to 85%; Terminal-Bench Hard from 3% to 25% vs. Command A Reasoning.
TOPS increased up to 63% and TTFT reduced up to 17% vs. Command A Reasoning at matching quantization.
Command A+ is Cohere’s first multimodal reasoning model, expanding language support from 23 to 48 languages.

Architecture

Cohere just released Command A+, a mixture-of-experts (MoE) model targeting enterprise agentic workflows. Available under an Apache 2.0 license, Command A+ is optimized for reasoning, agentic workflows, RAG, multilingual, and multimodal document processing. It unifies capabilities from four prior models — Command A, Command A Reasoning, Command A Vision, and Command A Translate — into a single scalable model.

Hardware Requirements and Quantization

Cohere provides three quantization variants with minimum GPU requirements: BF16 (16-bit) requires 4× B200 or 8× H100 GPUs; FP8 (8-bit) requires 2× B200 or 4× H100 GPUs; W4A4 (4-bit) runs on a single B200 or 2× H100 GPUs. All three quantizations show negligible differences in benchmark quality. Cohere recommends W4A4 for most deployments.

W4A4 Quantization Methodology

Cohere applies NVFP4 W4A4 quantization, a 4-bit weights and activations with two-level scaling, to the MoE experts only. The attention path, including Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision. To close residual quality gaps, Cohere uses Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student model is trained to match the full-precision teacher’s output distribution, using fake quantization operators in the forward pass and straight-through estimators on the backward pass.

Performance vs. Prior Command A Models

Cohere’s new model, Command A+, significantly improves performance over its predecessors. On τ²-Bench Telecom, scores improved from 37% to 85% over Command A Reasoning, and Terminal-Bench Hard agentic coding performance reached 25% from 3%. In internal North platform evaluations, all models using LLM-as-a-judge techniques saw improvements: Agentic Question Answering accuracy increased by 20%, Spreadsheet Analysis quality improved by 32%, and Memory Usage Quality scored 54% with Command A+ compared to 39% with Command A Reasoning.

Speed and Latency

Command A+ delivers up to 63% higher Output Tokens per Second (TOPS) and reduces Time To First Token (TTFT) by up to 17% compared with Command A Reasoning. The W4A4 quantization contributes an additional 47% increase in speed and a 13% reduction in latency.

Tokenizer

Cohere’s latest tokenizer is used, reducing the number of tokens required to generate the same response. Tokenization efficiency improved by 20% for Arabic, 16% for Korean, and 18% for Japanese.

Getting Started

The model is supported by vLLM and Transformers. Tool use is handled through chat templates in Transformers using JSON schema for tool descriptions. When reasoning is enabled, the model generates thinking traces between <|START_THINKING|> and <|END_THINKING|> tags before producing a final answer.

Getting Started

The W4A4 variant requires vLLM ≥0.21.0 and cohere_melody>=0.9.0 for accurate response parsing. Cohere recommends the following sampling parameters: temperature=0.9, top_p=0.95, and repetition_penalty=1.04.

The post Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs appeared first on MarkTechPost.

“`

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

Key Takeaways

Architecture

Hardware Requirements and Quantization

W4A4 Quantization Methodology

Performance vs. Prior Command A Models

Speed and Latency

Tokenizer

Getting Started

Getting Started

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

TinyFish Launches BigSet: An…

Microsoft’s Project Solara is…

Google’s Phone app will…