“`html
Key Takeaways
- Command A+ has 218B total / 25B active parameters in a Sparse MoE architecture, released under Apache 2.0.
- W4A4 applies NVFP4 quantization to MoE experts only with QAD post-training, running on 2× H100s.
- τ²-Bench Telecom improved from 37% to 85%; Terminal-Bench Hard from 3% to 25% vs. Command A Reasoning.
- TOPS increased up to 63% and TTFT reduced up to 17% vs. Command A Reasoning at matching quantization.
- Command A+ is Cohere’s first multimodal reasoning model, expanding language support from 23 to 48 languages.
Architecture
Cohere just released Command A+, a mixture-of-experts (MoE) model targeting enterprise agentic workflows. Available under an Apache 2.0 license, Command A+ is optimized for reasoning, agentic workflows, RAG, multilingual, and multimodal document processing. It unifies capabilities from four prior models — Command A, Command A Reasoning, Command A Vision, and Command A Translate — into a single scalable model.
Hardware Requirements and Quantization
Cohere provides three quantization variants with minimum GPU requirements: BF16 (16-bit) requires 4× B200 or 8× H100 GPUs; FP8 (8-bit) requires 2× B200 or 4× H100 GPUs; W4A4 (4-bit) runs on a single B200 or 2× H100 GPUs. All three quantizations show negligible differences in benchmark quality. Cohere recommends W4A4 for most deployments.
W4A4 Quantization Methodology
Cohere applies NVFP4 W4A4 quantization, a 4-bit weights and activations with two-level scaling, to the MoE experts only. The attention path, including Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision. To close residual quality gaps, Cohere uses Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student model is trained to match the full-precision teacher’s output distribution, using fake quantization operators in the forward pass and straight-through estimators on the backward pass.
Performance vs. Prior Command A Models
Cohere’s new model, Command A+, significantly improves performance over its predecessors. On τ²-Bench Telecom, scores improved from 37% to 85% over Command A Reasoning, and Terminal-Bench Hard agentic coding performance reached 25% from 3%. In internal North platform evaluations, all models using LLM-as-a-judge techniques saw improvements: Agentic Question Answering accuracy increased by 20%, Spreadsheet Analysis quality improved by 32%, and Memory Usage Quality scored 54% with Command A+ compared to 39% with Command A Reasoning.
Speed and Latency
Command A+ delivers up to 63% higher Output Tokens per Second (TOPS) and reduces Time To First Token (TTFT) by up to 17% compared with Command A Reasoning. The W4A4 quantization contributes an additional 47% increase in speed and a 13% reduction in latency.
Tokenizer
Cohere’s latest tokenizer is used, reducing the number of tokens required to generate the same response. Tokenization efficiency improved by 20% for Arabic, 16% for Korean, and 18% for Japanese.
Getting Started
The model is supported by vLLM and Transformers. Tool use is handled through chat templates in Transformers using JSON schema for tool descriptions. When reasoning is enabled, the model generates thinking traces between <|START_THINKING|> and <|END_THINKING|> tags before producing a final answer.
Getting Started
The W4A4 variant requires vLLM ≥0.21.0 and cohere_melody>=0.9.0 for accurate response parsing. Cohere recommends the following sampling parameters: temperature=0.9, top_p=0.95, and repetition_penalty=1.04.
The post Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs appeared first on MarkTechPost.
“`
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




