Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents This project originated in the Pytorch OpenEnv Hackathon and is still evolving, follow us…

By AI Maestro May 10, 2026 5 min read
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents


Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

This project originated in the Pytorch OpenEnv Hackathon and is still evolving, follow us for updates πŸ”₯

Why RL for shopping agents?

Large language models can hold fluent conversations, yet deploying them as shopping assistants reveals a persistent gap: fluency β‰  task completion. A customer who asks “find me a USB-C charger under $25 that ships in two days” needs an agent that invokes the right catalog search, filters on three hard constraints, avoids hallucinating product IDs it never retrieved, and handles follow-ups when the top result goes out of stock.

Supervised fine-tuning can teach surface-level tool use from demonstrations, but it cannot scale to the combinatorial space of constraint configurations, partial-information dialogues, and multi-step transactional workflows that real e-commerce demands.

Reinforcement learning with verifiable rewards (RLVR) offers an alternative: the agent optimises for outcomes β€” did the products satisfy the constraints? Was the cart correct? Was the return initiated for the right order line? The challenge is constructing reward functions that are both verifiable (no LLM-as-a-judge subjectivity) and adaptive (difficulty that grows with the policy’s capability).

From RLVE-Gym to EcomRLVE-GYM

RLVE-Gym provides 400 environments for sorting, multiplication, Sudoku, and other algorithmic-reasoning tasks; however, those are all single-turn, text-in / text-out puzzles β€” extending to agentic domains was left as future work.

EcomRLVE-GYM fills that gap: we stay in the verifiable regime (e-commerce outcomes can be checked algorithmically) while extending to multi-turn, tool-augmented, agentic conversations β€” environments where the agent must act (call tools, modify world state) rather than merely reason (produce a text answer) and compensates for the deficiency of the search system.

EcomRLVE-GYM transforms customer-service outcomes structurally verifiable:

  • Every signal above can be evaluated by a program with access to the hidden ground-truth goal. No human annotation or LLM-as-a-judge is needed.

The eight environments

EnvironmentWhat the agent must do
Product DiscoveryFind products that satisfy all the user’s constraints
SubstitutionAn item is out of stock β€” find a similar, compatible alternative
Cart BuildingAdd the exact products, variants, and quantities the user asked for
Return + ReplacementIdentify the right order line, initiate a return, suggest a replacement
Order TrackingResolve which order the user means and report its current status
Policy QAAnswer a deterministic question about store policy (return window, shipping rules, etc.)
Bundle PlanningRecommend a complete shopping list for a project within a budget
Multi-Intent JourneyHandle a conversation that chains 2–5 of the above tasks in sequence

Every environment uses the same three-part reward signal:

  • Task reward β€” did the agent actually complete the goal? (e.g., were the right products recommended, was the cart correct, was the right order tracked?)
  • Efficiency reward β€” did the agent complete it without wasting turns? Turns the user caused (asking a follow-up, confirming an action) don’t count against the agent β€” only turns caused by agent mistakes do.
  • Hallucination penalty β€” did the agent only recommend products it actually retrieved during the session? Recommending product IDs that were never looked up is penalised, so the agent cannot invent results from memory.

Invalid outputs (malformed JSON, illegal tool calls) trigger an immediate failure score, creating a strong incentive for well-formed responses from step one.

Adaptive difficulty curriculum

A single difficulty number

d

controls 12 independent aspects of a task simultaneously. This is important because e-commerce conversations are hard in many different ways at once β€” not just along one dimension.

What changesEasy (

d = 0

)

Medium (

d = 6

)

Hard (

d = 12

)

How many constraints the user has258
How often the user omits a constraint5%70%~80%
Fraction of search results that are distractors0%12%24%
Items that go out of stock mid-conversation0%30%50%

The other eight axes cover turn budget, input noise (typos, slang), context switches, retrieval depth, order-history size, policy complexity, and tool budget. The full breakdown is in the technical report.

Adaptive scheduling. Each environment tracks the agent’s success rate independently and only advances to harder problems once the agent is passing the current level reliably. This keeps every environment training at the agent’s capability frontier β€” avoiding both “too easy to learn from” and “too hard to make progress on”.

Deep dive: Cart Building (E_CART)

Cart building is a good showcase because it requires the full search β†’ inspect β†’ clarify β†’ act loop, has a binary ground truth, and introduces a challenge absent from most recommendation benchmarks: variant selection.

SkillWhat it means in practice
Product DiscoverySearch the catalog with well-formed queries to find the right items
Variant SelectionIdentify the correct color, size, or connector type β€” not just the right product
Cart ManagementAdd items with the exact variant and quantity the user asked for
Clarification DialogueAsk the user a focused follow-up when a request is ambiguous (e.g., missing size)
Multi-Item OrdersHandle shopping lists with several different products in a single conversation

The agent uses six tools to accomplish this:

ToolWhat it does
catalog_search
Searches the product catalog with a natural-language query
catalog_get_variants
Returns available variants (color, size, connector, etc.) for a product
cart_add
Adds a product to the cart with a specific variant and quantity
cart_view
Reads the current cart so the agent can verify it matches the request
user_get_visit_history
Fetuses recently viewed products by user
ask_user
Sends a clarification question to the customer when a detail is missing

The problem

The generator samples 1–5 target products (scaling in difficulty with

d

), each potentially requiring a specific variant (USB-C vs Lightning, Matte vs Glossy) and a quantity > 1. The agent must:

  • Search the catalog to find each product
  • Call
    catalog.get_variants

    to see available options

  • Add the correct
    (product_id, variant_id, qty)

    tuples to the cart

Why variants matter

Real product catalogs have sparse variant data β€” many products have none, and those that do typically vary only by colour or size. To create a richer discrimination task, we synthesize variants at episode initialization:

  • A per-category priority list picks the most natural attribute to vary (electronics β†’
    connector_type

    ; clothing β†’

    size

    ; kitchen β†’

    material

    ).

  • For each target product, we generate 3 variants: 1 target + 2 plausible distractors. An “Anker 65W USB-C Charger” produces
    {USB-C, Lightning, HDMI}

    .

  • The verifier checks composite keys
    (product_id, variant_id)

    Originally published at huggingface.co. Curated by AI Maestro.

    Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

    Name
Scroll to Top