For makers and artists building creative tools, the lesson here is clear: stop guessing. When you need an AI to follow strict output rules—like generating code, audio stems, or structured data—you cannot rely on a single, static instruction. You need a system that learns from failure. The framework GEPA (Generative Prompt Evolution) treats prompt engineering not as a one-off task, but as an iterative process. It takes a weak starting instruction, tests it against a known set of problems, and uses a separate “reflection” model to diagnose exactly why the output failed. By feeding this specific feedback back into the system, the prompt evolves until it consistently produces the desired result, ensuring your creative tools are robust and reliable.
Setting up the Evolutionary Framework
To demonstrate this approach, we configure the environment to handle arithmetic word problems, a task that requires precise logic. We install the necessary libraries and set up two distinct models: one to act as the “task solver” and another as the “critic” or reflection engine. This separation is crucial; it ensures that the model generating the solution is not biased by the model grading it.
!pip install -q gepa litellm
import os, re, json, random, getpass, textwrap
import litellm
import gepa.optimize_anything as oa
from gepa.optimize_anything import (
optimize_anything, GEPAConfig, EngineConfig, ReflectionConfig,
)
litellm.suppress_debug_info = True
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
TASK_LM = "openai/gpt-4o-mini"
REFLECTION_LM = "openai/gpt-4.1"
MAX_METRIC_CALLS = 100We restrict the budget for metric calls to prevent runaway costs and define the specific models to be used. The task model handles the actual generation, while the reflection model is tasked with analysing errors and suggesting improvements.
Creating a Deterministic Benchmark
Before the AI can learn, we must have a reliable test set. We generate a deterministic dataset of arithmetic word problems covering four specific scenarios: discounts, travel distance, wallet calculations, and chained mathematical operations. Because these problems are generated programmatically, we know the exact correct answer for every single instance, removing ambiguity from the evaluation process.
def make_problems(n, seed=0):
rng = random.Random(seed)
out = []
for _ in range(n):
t = rng.choice(["discount", "travel", "wallet", "chain"])
if t == "discount":
unit = rng.choice([40, 60, 80, 120])
qty = rng.choice([5, 6, 8, 10])
disc = rng.choice([10, 20, 25, 50])
total = unit * qty
gold = total - total * disc // 100
q = (f"A shop sells notebooks at {unit} rupees each. You buy {qty} "
f"notebooks and get a {disc}% discount on the total bill. "
f"How many rupees do you pay in total?")
elif t == "travel":
s1, h1 = rng.choice([40, 50, 60]), rng.choice([2, 3])
s2, h2 = rng.choice([30, 45, 70]), rng.choice([1, 2, 3])
gold = s1 * h1 + s2 * h2
q = (f"A car drives at {s1} km/h for {h1} hours, then at {s2} km/h "
f"for {h2} hours. What is the total distance travelled, in km?")
elif t == "wallet":
tens = rng.choice([3, 5, 7, 9])
fifties= rng.choice([2, 4, 6])
spent = rng.choice([50, 80, 110, 150])
gold = tens * 10 + fifties * 50 - spent
q = (f"You have {tens} ten-rupee notes and {fifties} fifty-rupee "
f"notes. You spend {spent} rupees. How many rupees are left?")
else:
x = rng.choice([6, 9, 12, 15]); y = rng.choice([4, 7, 10]); z = rng.choice([3, 8, 11])
gold = x * 2 - y + z
q = (f"Start with the number {x}. Double it, then subtract {y}, "
f"then add {z}. What number do you end with?")
out.append({"question": q, "answer": gold})
return out
all_problems = make_problems(18, seed=42)
random.Random(1).shuffle(all_problems)
trainset = all_problems[:12]
valset = all_problems[12:]
print(f"Dataset: {len(trainset)} train / {len(valset)} val problems\n")We split this dataset into a training set of twelve problems to drive the optimisation and a validation set to test generalisation later. This structure mimics real-world development where you optimise on a subset before testing on unseen data.
Building the Evaluator and Structured Feedback Loop
The core of GEPA is how it interprets failure. We build a system that takes a candidate prompt and runs it against a question. The output is then parsed to check two things: the mathematical accuracy and the formatting rules. If the model fails, the evaluator does not just say “wrong”; it generates a specific, structured feedback message explaining the error.
def build_system_prompt(candidate: dict) -> str:
return (f"{candidate['instructions']}\n\n"
f"OUTPUT FORMAT RULES:\n{candidate['format_rules']}")
def call_task_lm(system_prompt: str, question: str) -> str:
for attempt in range(3):
try:
r = litellm.completion(
model=TASK_LM,
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": question}],
temperature=0, max_tokens=600, timeout=60,
)
return r["choices"][0]["message"]["content"] or ""
except Exception as e:
if attempt == 2:
return f"[LM_ERROR] {e}"
return ""
def parse_answers(text: str):
formatted = re.search(r"####\s*(-?\d+)", text)
all_nums = re.findall(r"-?\d+", text)
fmt_val = int(formatted.group(1)) if formatted else None
last_val = int(all_nums[-1]) if all_nums else None
return fmt_val, last_val
def evaluate(candidate: dict, example: dict):
system = build_system_prompt(candidate)
raw = call_task_lm(system, example["question"])
gold = example["answer"]
fmt_val, last_val = parse_answers(raw)
if fmt_val is not None and fmt_val == gold:
score, fb = 1.0, "Correct and correctly formatted."
elif fmt_val is not None and fmt_val != gold:
score, fb = 0.0, (f"WRONG ANSWER. You output '#### {fmt_val}' but the "
f"correct answer is {gold}. Re-check the arithmetic and "
f"the order of the steps.")
elif last_val == gold:
score, fb = 0.5, (f"Right number ({gold}) but FORMAT VIOLATION: the final "
f"line was not exactly '#### {gold}'. Always end with a "
f"line of the form '#### <integer>' and nothing else.")
else:
score, fb = 0.0, (f"WRONG. Correct answer is {gold}. The model's final "
f"number was {last_val}. Likely a multi-step reasoning "
f"slip; show each step and verify before answering.")
oa.log(f"score={scoreSource Read original →Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




