Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

For makers and artists building creative tools, the lesson here is clear: stop guessing. When you need an AI to follow strict output rules-like generating code, audio stems, or structured data-you cannot rely on a single, static instruction. You need a system that learns from failure. The framework GEPA (Generative Prompt Evolution) treats prompt engineering not as a one-off task, but as an iterative process. It takes a weak starting instruction, tests it against a known set of problems, and uses a separate “reflection” model to diagnose exactly why the output failed. By feeding this specific feedback back into the system, the prompt evolves until it consistently produces the desired result, ensuring your creative tools are robust and reliable.

Setting up the Evolutionary Framework

To demonstrate this approach, we configure the environment to handle arithmetic word problems, a task that requires precise logic. We install the necessary libraries and set up two distinct models: one to act as the “task solver” and another as the “critic” or reflection engine. This separation is crucial; it ensures that the model generating the solution is not biased by the model grading it.

Copy Code

!pip install -q gepa litellm
import os, re, json, random, getpass, textwrap
import litellm
import gepa.optimize_anything as oa
from gepa.optimize_anything import (
   optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig,
)
litellm.suppress_debug_info = True
if not os.environ.get("OPENAI_API_KEY"):
   os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
TASK_LM        = "openai/gpt-4o-mini"
REFLECTION_LM  = "openai/gpt-4.1"
MAX_METRIC_CALLS = 100

We restrict the budget for metric calls to prevent runaway costs and define the specific models to be used. The task model handles the actual generation, while the reflection model is tasked with analysing errors and suggesting improvements.

Creating a Deterministic Benchmark

Before the AI can learn, we must have a reliable test set. We generate a deterministic dataset of arithmetic word problems covering four specific scenarios: discounts, travel distance, wallet calculations, and chained mathematical operations. Because these problems are generated programmatically, we know the exact correct answer for every single instance, removing ambiguity from the evaluation process.

Copy Code

def make_problems(n, seed=0):
   rng = random.Random(seed)
   out = []
   for _ in range(n):
       t = rng.choice(["discount", "travel", "wallet", "chain"])
       if t == "discount":
           unit  = rng.choice([40, 60, 80, 120])
           qty   = rng.choice([5, 6, 8, 10])
           disc  = rng.choice([10, 20, 25, 50])
           total = unit * qty
           gold  = total - total * disc // 100
           q = (f"A shop sells notebooks at {unit} rupees each. You buy {qty} "
                f"notebooks and get a {disc}% discount on the total bill. "
                f"How many rupees do you pay in total?")
       elif t == "travel":
           s1, h1 = rng.choice([40, 50, 60]), rng.choice([2, 3])
           s2, h2 = rng.choice([30, 45, 70]), rng.choice([1, 2, 3])
           gold = s1 * h1 + s2 * h2
           q = (f"A car drives at {s1} km/h for {h1} hours, then at {s2} km/h "
                f"for {h2} hours. What is the total distance travelled, in km?")
       elif t == "wallet":
           tens   = rng.choice([3, 5, 7, 9])
           fifties= rng.choice([2, 4, 6])
           spent  = rng.choice([50, 80, 110, 150])
           gold = tens * 10 + fifties * 50 - spent
           q = (f"You have {tens} ten-rupee notes and {fifties} fifty-rupee "
                f"notes. You spend {spent} rupees. How many rupees are left?")
       else:
           x = rng.choice([6, 9, 12, 15]); y = rng.choice([4, 7, 10]); z = rng.choice([3, 8, 11])
           gold = x * 2 - y + z
           q = (f"Start with the number {x}. Double it, then subtract {y}, "
                f"then add {z}. What number do you end with?")
       out.append({"question": q, "answer": gold})
   return out
all_problems = make_problems(18, seed=42)
random.Random(1).shuffle(all_problems)
trainset = all_problems[:12]
valset   = all_problems[12:]
print(f"Dataset: {len(trainset)} train / {len(valset)} val problems\n")

We split this dataset into a training set of twelve problems to drive the optimisation and a validation set to test generalisation later. This structure mimics real-world development where you optimise on a subset before testing on unseen data.

Building the Evaluator and Structured Feedback Loop

The core of GEPA is how it interprets failure. We build a system that takes a candidate prompt and runs it against a question. The output is then parsed to check two things: the mathematical accuracy and the formatting rules. If the model fails, the evaluator does not just say “wrong”; it generates a specific, structured feedback message explaining the error.

Copy Code

def build_system_prompt(candidate: dict) -> str:

return (f"{candidate['instructions']}\n\n"

f"OUTPUT FORMAT RULES:\n{candidate['format_rules']}")

def call_task_lm(system_prompt: str, question: str) -> str:

for attempt in range(3):

try:

r = litellm.completion(

model=TASK_LM,

messages=[{"role": "system", "content": system_prompt},

{"role": "user",   "content": question}],

temperature=0, max_tokens=600, timeout=60,

)

return r["choices"][0]["message"]["content"] or ""

except Exception as e:

if attempt == 2:

return f"[LM_ERROR] {e}"

return ""

def parse_answers(text: str):

formatted = re.search(r"####\s*(-?\d+)", text)

all_nums  = re.findall(r"-?\d+", text)

fmt_val   = int(formatted.group(1)) if formatted else None

last_val  = int(all_nums[-1]) if all_nums else None

return fmt_val, last_val

def evaluate(candidate: dict, example: dict):

system = build_system_prompt(candidate)

raw    = call_task_lm(system, example["question"])

gold   = example["answer"]

fmt_val, last_val = parse_answers(raw)

if fmt_val is not None and fmt_val == gold:

score, fb = 1.0, "Correct and correctly formatted."

elif fmt_val is not None and fmt_val != gold:

score, fb = 0.0, (f"WRONG ANSWER. You output '#### {fmt_val}' but the "

f"correct answer is {gold}. Re-check the arithmetic and "

f"the order of the steps.")

elif last_val == gold:

score, fb = 0.5, (f"Right number ({gold}) but FORMAT VIOLATION: the final "

f"line was not exactly '#### {gold}'. Always end with a "

f"line of the form '#### <integer>' and nothing else.")

else:

score, fb = 0.0, (f"WRONG. Correct answer is {gold}. The model's final "

f"number was {last_val}. Likely a multi-step reasoning "

f"slip; show each step and verify before answering.")

oa.log(f"score={score
Source Read original →
Related reading
Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration
A Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison
A Coding Implementation to Portfolio Optimization with skfolio for Building Testing, Tuning, and Comparing Modern Investment Strategies
The SignalThe Signal: Edition 03Read this edition →Every Friday: the one AI story that actually mattered, plus the tools worth your time.

AM
AI Maestro is an independent British AI publication. We test what we recommend, and we write it the way we would say it. More about us

Building Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation

Setting up the Evolutionary Framework

Creating a Deterministic Benchmark

Building the Evaluator and Structured Feedback Loop

`Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`Sakana AI Releases Fugu-Cyber:…`

`Ruff v0.16.0`

`Meet Open Dreamer: A…`

Setting up the Evolutionary Framework

Creating a Deterministic Benchmark

Building the Evaluator and Structured Feedback Loop

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Sakana AI Releases Fugu-Cyber:…

Ruff v0.16.0

Meet Open Dreamer: A…

`Related articles`

`Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`Sakana AI Releases Fugu-Cyber:…`

`Ruff v0.16.0`

`Meet Open Dreamer: A…`