Notebook 02 — Prompts That Work¶
Premise: Prompts are a form of code. A bad prompt buries the model. A good prompt is short, specific, and structured. The same model gives a 9/10 answer to one prompt and a 3/10 to another — the difference is craft.
You'll learn:
- Anatomy of a good prompt — role, task, constraints, format, examples.
- The five anti-patterns that wreck most prompts.
- Plan · Do · Check — split a task into three separate prompts and watch quality jump. This is the foundation of a self-correcting agent.
By the end you will have:
- Compared 4 versions of the same prompt and seen output quality climb
- Recognized vague / kitchen-sink / contradictory / buried-lede / hidden-context prompts
- Built a 3-stage Plan → Do → Check pipeline by hand
- Made the agent catch its own mistakes
Setup¶
from textwrap import dedent
from dotenv import load_dotenv
load_dotenv()
from arcllm import load_model, Message
from rich import print
from rich.panel import Panel
from rich.console import Console
console = Console()
model = load_model('anthropic')
async def ask(prompt: str, *, system: str | None = None) -> str:
msgs = []
if system:
msgs.append(Message(role='system', content=system))
msgs.append(Message(role='user', content=prompt))
resp = await model.invoke(msgs)
return resp.content or ''
1. Good vs bad — same task, four prompts¶
We'll ask for a paper summary four ways, from worst to best. Watch the output get more useful with each iteration.
PAPER = dedent('''
We present a method for accelerating tokamak plasma simulations using
a graph-neural-network surrogate trained on 12,000 hours of HPC runs.
The surrogate predicts edge-plasma density 200x faster than the
reference fluid solver, with mean absolute error under 4% across
held-out shots. We do not address pellet injection or RMP coil
configurations. Code available; data released under DOE-cleared
embargo until 2027.
''').strip()
PROMPTS = {
'BAD (vague)':
'Summarize this:\n\n' + PAPER,
'MEDIUM (specifies length)':
f'Summarize this in 2 sentences:\n\n{PAPER}',
'GOOD (structured ask)': dedent(f'''
Summarize this paper as 3 bullets:
- Contribution (what is new)
- Method (how)
- Limitations (what it does NOT do)
Paper:
{PAPER}
''').strip(),
'GREAT (role + format + bound)': dedent(f'''
You are a national-lab program manager skimming proposals.
Read the paper and output exactly this Markdown table:
| Field | Value |
|---|---|
| Contribution | _≤12 words_ |
| Method | _≤12 words_ |
| Speedup | _quote the number_ |
| Accuracy | _quote the number_ |
| Out of scope | _≤12 words_ |
| Availability | _quote the wording_ |
Paper:
{PAPER}
''').strip(),
}
for label, prompt in PROMPTS.items():
out = await ask(prompt)
console.print(Panel(out, title=f'[bold]{label}[/bold]', border_style='cyan'))
2. The five anti-patterns¶
If you can name what's wrong with a prompt, you can fix it. These are the five wrecks I see most often:
| Anti-pattern | Example | Fix |
|---|---|---|
| Vague | "make it better" | Specify what better means: shorter? more cited? more concrete? |
| Kitchen-sink | 30 instructions in one prompt | Split into stages. The model can't track 30 things at once. |
| Contradictory | "be exhaustive but terse" | Pick one. Or rank: "as terse as you can while still naming each gene." |
| Hidden context | "as we discussed earlier" | Restate the context. The model didn't read your last meeting. |
| Buried lede | actual task in line 47 of background | Put the task in the first line. Move background to the end. |
Try it: take the BAD prompt above and rewrite it. Which anti-pattern was it? What did you fix?
3. Plan · Do · Check — three prompts, one task¶
Most quality failures come from asking the model to think and execute and validate in one shot. Split them. Use a different prompt for each stage.
- PLAN: think first, output a structured plan (no execution yet)
- DO: take the plan, execute it
- CHECK: compare the output against the plan, flag any gaps
We'll do structured data extraction as the example. The same pattern applies to anything: code generation, paper review, log analysis, experimental design.
import json
INCIDENT = dedent('''
Around 09:02 on April 12, run 42 hit a DRAM ECC error on node-7.
The job auto-rerouted to node-8 and finished by 09:05.
Operator on shift was Singh. No data loss reported.
''').strip()
Stage 1 — PLAN¶
We don't extract anything yet. We ask the model to list the fields it would extract, and what type each is. This forces it to think before it speaks.
PLAN_SYSTEM = (
'You are a planning step. Do NOT extract data. '
'Output ONLY a JSON array of {"field": ..., "type": ..., "required": ...} objects '
'describing what should be extracted from the incident report.'
)
plan_text = await ask(
f'Plan extraction fields for this incident report:\n\n{INCIDENT}',
system=PLAN_SYSTEM,
)
console.print(Panel(plan_text, title='PLAN', border_style='yellow'))
# parse the JSON (strip code fences if the model used them)
import re
json_text = re.search(r'\[.*\]', plan_text, re.DOTALL).group(0)
plan = json.loads(json_text)
print(f'\nplan parsed: {len(plan)} fields')
Stage 2 — DO¶
Now extract, using the plan as the spec. The model can't drift to a different schema — the plan locks it in.
DO_SYSTEM = (
'You are an extraction step. Extract values for EXACTLY the fields listed in the plan. '
'Output ONLY a JSON object keyed by field name. Use null when a value is missing.'
)
do_text = await ask(
dedent(f'''
Plan:
{json.dumps(plan, indent=2)}
Incident report:
{INCIDENT}
''').strip(),
system=DO_SYSTEM,
)
console.print(Panel(do_text, title='DO', border_style='green'))
extracted = json.loads(re.search(r'\{.*\}', do_text, re.DOTALL).group(0))
print(f'\nextracted {len(extracted)} fields')
Stage 3 — CHECK¶
A fresh prompt with no investment in the prior output. It compares the extracted JSON against the plan and reports any drift. This is your validation.
CHECK_SYSTEM = (
'You are a validation step. Compare extracted output against the plan. '
'Output ONLY JSON: {"ok": bool, "missing": [field...], "extra": [field...], "type_mismatches": [...]}'
)
check_text = await ask(
dedent(f'''
Plan:
{json.dumps(plan, indent=2)}
Extracted:
{json.dumps(extracted, indent=2)}
''').strip(),
system=CHECK_SYSTEM,
)
console.print(Panel(check_text, title='CHECK', border_style='red'))
verdict = json.loads(re.search(r'\{.*\}', check_text, re.DOTALL).group(0))
print(f'\nok={verdict.get("ok")}')
4. Why this matters¶
Three observations from the run above:
- Each stage has a narrow job. PLAN doesn't extract. DO doesn't plan. CHECK doesn't trust either of them. Narrow jobs = fewer failure modes per prompt.
- The output of each stage is structured (JSON, not prose). That's what makes it composable — the next stage can consume the previous stage instead of re-reading it.
- CHECK is independent. It doesn't see the original task — only the plan and the output. That breaks the model's tendency to defend its own answer.
When we put this into a loop in the next notebook, the agent will
automatically run plan → do → check → do → check → … until check
passes. That's a self-correcting agent, and it's just three
prompts and a while.
Takeaway¶
- Prompts are code. Treat them with the same care.
- Anatomy of a good prompt: role · task · constraints · format · examples.
- The five anti-patterns: vague, kitchen-sink, contradictory, hidden context, buried lede.
- Plan · Do · Check turns one fragile prompt into three robust ones. This is the single highest-leverage pattern you'll learn this week.
Next: 03 — The Loop. We hand the planning
and the doing to arcrun.run() and let the model decide which tool
to call when.