Notebook 01 — From Chat to Coworker¶

Premise: A chat window is one prompt at a time. A harness is the structure around the model — the message thread, the system prompt, the tools, the budget, the audit trail — that turns a one-shot model into something you can hand a real job to.

In this notebook we build the simplest possible harness — just structured calls — and prove that the model is interchangeable. Anthropic, OpenAI, Ollama, vLLM: same code, different brain.

By the end you will have:

  • Made a raw LLM call with arcllm.load_model()
  • Steered behavior with a system prompt
  • Held a multi-turn conversation by managing the message list
  • Swapped models without changing your code
  • Inspected token usage and cost

Setup¶

In [ ]:
import os
from dotenv import load_dotenv
load_dotenv()  # picks up ANTHROPIC_API_KEY etc. from ../.env

from arcllm import load_model, Message
from rich import print

1. The simplest possible call¶

load_model(provider) returns an adapter. invoke(messages) returns an LLMResponse. That is the whole API at this layer.

In [ ]:
model = load_model('anthropic')

resp = await model.invoke([
    Message(role='user', content='In one sentence: what is plasma confinement?')
])
print(resp.content)

2. Steering with a system prompt¶

The system prompt sets the role. Same model, very different output.

In [ ]:
SYSTEM_GENERIC = 'You are a helpful assistant.'
SYSTEM_SCIENTIST = (
    'You are a senior plasma physicist at a national lab. '
    'Answer in 2-3 dense sentences. Use specifics. No hedging.'
)

QUESTION = 'What is the biggest open problem in fusion plasma confinement?'

for label, sys_prompt in [('generic', SYSTEM_GENERIC), ('scientist', SYSTEM_SCIENTIST)]:
    resp = await model.invoke([
        Message(role='system', content=sys_prompt),
        Message(role='user', content=QUESTION),
    ])
    print(f'\n[bold]{label}[/bold]')
    print(resp.content)

3. Multi-turn — the model has no memory¶

Models are stateless between calls. The 'conversation' is just the list of messages you pass in each time. That is what a harness manages: the running thread.

In [ ]:
history: list[Message] = [
    Message(role='system', content=SYSTEM_SCIENTIST),
]

async def turn(user_text: str) -> str:
    history.append(Message(role='user', content=user_text))
    resp = await model.invoke(history)
    history.append(Message(role='assistant', content=resp.content or ''))
    return resp.content or ''

print('Q1:', await turn('What is ITER?'))
print('\nQ2:', await turn("What's the most important milestone left for it?"))
print('\nQ3:', await turn('Why does that matter for stellarators?'))

print(f'\n[dim]history now has {len(history)} messages[/dim]')

4. Swap the model — same code, different brain¶

If you have an OpenAI key, this just works. If you don't, skip — the point is the code didn't change, only the provider string.

For air-gapped labs: change 'openai' to 'ollama' (with model='llama3.1') or 'vllm'. The harness is the same.

In [ ]:
from contextlib import suppress

QUESTION = 'In one sentence, what is plasma?'
msgs = [Message(role='user', content=QUESTION)]

for provider in ['anthropic', 'openai']:
    with suppress(Exception) as _:
        m = load_model(provider)
        r = await m.invoke(msgs)
        print(f'[bold]{provider}[/bold] ({r.model}): {r.content}')

5. Inspect what came back¶

Every response carries usage, cost, stop reason, and provider metadata. This is what a harness uses to enforce budgets, retry on truncation, and audit the call.

In [ ]:
resp = await model.invoke([Message(role='user', content='Name one thing.')])

print('model:       ', resp.model)
print('stop_reason: ', resp.stop_reason)
print('usage:       ', resp.usage)
print('cost_usd:    ', resp.cost_usd)
print('content:     ', resp.content)

Takeaway¶

  • A chat call is one shot. A harness is the code that manages many shots — message list, system prompt, model choice, budget, retries, observability.
  • The model is interchangeable. Anthropic, OpenAI, Ollama, vLLM — same Message types, same invoke() call.
  • This is already enough to build something useful. Most 'AI features' in production are exactly this loop, with a little plumbing.

Next: 02 — Prompts That Work. Same model, very different output. The craft of prompting.