Notebook 02 — Prompts That Work¶

Premise: Prompts are a form of code. A bad prompt buries the model. A good prompt is short, specific, and structured. The same model gives a 9/10 answer to one prompt and a 3/10 to another — the difference is craft.

You'll learn:

  1. Anatomy of a good prompt — role, task, constraints, format, examples.
  2. The five anti-patterns that wreck most prompts.
  3. Plan · Do · Check — split a task into three separate prompts and watch quality jump. This is the foundation of a self-correcting agent.

By the end you will have:

  • Compared 4 versions of the same prompt and seen output quality climb
  • Recognized vague / kitchen-sink / contradictory / buried-lede / hidden-context prompts
  • Built a 3-stage Plan → Do → Check pipeline by hand
  • Made the agent catch its own mistakes

Setup¶

In [ ]:
from textwrap import dedent
from dotenv import load_dotenv
load_dotenv()

from arcllm import load_model, Message
from rich import print
from rich.panel import Panel
from rich.console import Console

console = Console()
model = load_model('anthropic')

async def ask(prompt: str, *, system: str | None = None) -> str:
    msgs = []
    if system:
        msgs.append(Message(role='system', content=system))
    msgs.append(Message(role='user', content=prompt))
    resp = await model.invoke(msgs)
    return resp.content or ''

1. Good vs bad — same task, four prompts¶

We'll ask for a paper summary four ways, from worst to best. Watch the output get more useful with each iteration.

In [ ]:
PAPER = dedent('''
    We present a method for accelerating tokamak plasma simulations using
    a graph-neural-network surrogate trained on 12,000 hours of HPC runs.
    The surrogate predicts edge-plasma density 200x faster than the
    reference fluid solver, with mean absolute error under 4% across
    held-out shots. We do not address pellet injection or RMP coil
    configurations. Code available; data released under DOE-cleared
    embargo until 2027.
''').strip()

PROMPTS = {
    'BAD (vague)':
        'Summarize this:\n\n' + PAPER,
    'MEDIUM (specifies length)':
        f'Summarize this in 2 sentences:\n\n{PAPER}',
    'GOOD (structured ask)': dedent(f'''
        Summarize this paper as 3 bullets:
        - Contribution (what is new)
        - Method (how)
        - Limitations (what it does NOT do)

        Paper:
        {PAPER}
    ''').strip(),
    'GREAT (role + format + bound)': dedent(f'''
        You are a national-lab program manager skimming proposals.
        Read the paper and output exactly this Markdown table:

        | Field | Value |
        |---|---|
        | Contribution | _≤12 words_ |
        | Method | _≤12 words_ |
        | Speedup | _quote the number_ |
        | Accuracy | _quote the number_ |
        | Out of scope | _≤12 words_ |
        | Availability | _quote the wording_ |

        Paper:
        {PAPER}
    ''').strip(),
}

for label, prompt in PROMPTS.items():
    out = await ask(prompt)
    console.print(Panel(out, title=f'[bold]{label}[/bold]', border_style='cyan'))

2. The five anti-patterns¶

If you can name what's wrong with a prompt, you can fix it. These are the five wrecks I see most often:

Anti-pattern Example Fix
Vague "make it better" Specify what better means: shorter? more cited? more concrete?
Kitchen-sink 30 instructions in one prompt Split into stages. The model can't track 30 things at once.
Contradictory "be exhaustive but terse" Pick one. Or rank: "as terse as you can while still naming each gene."
Hidden context "as we discussed earlier" Restate the context. The model didn't read your last meeting.
Buried lede actual task in line 47 of background Put the task in the first line. Move background to the end.

Try it: take the BAD prompt above and rewrite it. Which anti-pattern was it? What did you fix?

3. Plan · Do · Check — three prompts, one task¶

Most quality failures come from asking the model to think and execute and validate in one shot. Split them. Use a different prompt for each stage.

  • PLAN: think first, output a structured plan (no execution yet)
  • DO: take the plan, execute it
  • CHECK: compare the output against the plan, flag any gaps

We'll do structured data extraction as the example. The same pattern applies to anything: code generation, paper review, log analysis, experimental design.

In [ ]:
import json

INCIDENT = dedent('''
    Around 09:02 on April 12, run 42 hit a DRAM ECC error on node-7.
    The job auto-rerouted to node-8 and finished by 09:05.
    Operator on shift was Singh. No data loss reported.
''').strip()

Stage 1 — PLAN¶

We don't extract anything yet. We ask the model to list the fields it would extract, and what type each is. This forces it to think before it speaks.

In [ ]:
PLAN_SYSTEM = (
    'You are a planning step. Do NOT extract data. '
    'Output ONLY a JSON array of {"field": ..., "type": ..., "required": ...} objects '
    'describing what should be extracted from the incident report.'
)

plan_text = await ask(
    f'Plan extraction fields for this incident report:\n\n{INCIDENT}',
    system=PLAN_SYSTEM,
)
console.print(Panel(plan_text, title='PLAN', border_style='yellow'))

# parse the JSON (strip code fences if the model used them)
import re
json_text = re.search(r'\[.*\]', plan_text, re.DOTALL).group(0)
plan = json.loads(json_text)
print(f'\nplan parsed: {len(plan)} fields')

Stage 2 — DO¶

Now extract, using the plan as the spec. The model can't drift to a different schema — the plan locks it in.

In [ ]:
DO_SYSTEM = (
    'You are an extraction step. Extract values for EXACTLY the fields listed in the plan. '
    'Output ONLY a JSON object keyed by field name. Use null when a value is missing.'
)

do_text = await ask(
    dedent(f'''
        Plan:
        {json.dumps(plan, indent=2)}

        Incident report:
        {INCIDENT}
    ''').strip(),
    system=DO_SYSTEM,
)
console.print(Panel(do_text, title='DO', border_style='green'))

extracted = json.loads(re.search(r'\{.*\}', do_text, re.DOTALL).group(0))
print(f'\nextracted {len(extracted)} fields')

Stage 3 — CHECK¶

A fresh prompt with no investment in the prior output. It compares the extracted JSON against the plan and reports any drift. This is your validation.

In [ ]:
CHECK_SYSTEM = (
    'You are a validation step. Compare extracted output against the plan. '
    'Output ONLY JSON: {"ok": bool, "missing": [field...], "extra": [field...], "type_mismatches": [...]}'
)

check_text = await ask(
    dedent(f'''
        Plan:
        {json.dumps(plan, indent=2)}

        Extracted:
        {json.dumps(extracted, indent=2)}
    ''').strip(),
    system=CHECK_SYSTEM,
)
console.print(Panel(check_text, title='CHECK', border_style='red'))

verdict = json.loads(re.search(r'\{.*\}', check_text, re.DOTALL).group(0))
print(f'\nok={verdict.get("ok")}')

4. Why this matters¶

Three observations from the run above:

  1. Each stage has a narrow job. PLAN doesn't extract. DO doesn't plan. CHECK doesn't trust either of them. Narrow jobs = fewer failure modes per prompt.
  2. The output of each stage is structured (JSON, not prose). That's what makes it composable — the next stage can consume the previous stage instead of re-reading it.
  3. CHECK is independent. It doesn't see the original task — only the plan and the output. That breaks the model's tendency to defend its own answer.

When we put this into a loop in the next notebook, the agent will automatically run plan → do → check → do → check → … until check passes. That's a self-correcting agent, and it's just three prompts and a while.

Takeaway¶

  • Prompts are code. Treat them with the same care.
  • Anatomy of a good prompt: role · task · constraints · format · examples.
  • The five anti-patterns: vague, kitchen-sink, contradictory, hidden context, buried lede.
  • Plan · Do · Check turns one fragile prompt into three robust ones. This is the single highest-leverage pattern you'll learn this week.

Next: 03 — The Loop. We hand the planning and the doing to arcrun.run() and let the model decide which tool to call when.