Use this page when you are comparing agents, prompts, models, or strategy workflows. Evaluation should be reproducible, mostly read-only, and separated from live execution.Documentation Index
Fetch the complete documentation index at: https://docs.simplefunctions.dev/llms.txt
Use this file to discover all available pages before exploring further.
What to evaluate
| Layer | Question | Tooling |
|---|---|---|
| Command selection | Did the agent choose the right SimpleFunctions surface? | sf describe --all --json, sf tools plan ... --json |
| Context quality | Did it inspect the right market and use current state? | sf world, sf discover, sf inspect, sf book |
| Reasoning output | Did it separate evidence, uncertainty, and proposed action? | sf agent --record-trace, trace review |
| Side-effect safety | Did it avoid unapproved writes, runtime, and trade commands? | --allow, --deny, manifest side-effect class |
| Strategy behavior | Would a trigger/price rule have worked historically? | sf backtest ... --json |
| Operational readiness | Can it recover, summarize, and produce audit material? | sf trace receipt, logs, replay |
Record traces
Record every candidate production run:Backtest price rules
Usesf backtest for simple trigger rules:
| Field | Meaning |
|---|---|
dataPoints | Number of historical points used. Low values mean weak evidence. |
trades | Entry/exit events produced by the rule. |
totalPnlCents | Rule P&L in cents for the configured quantity. |
maxDrawdownCents | Worst drawdown in cents. |
params | The rule that was actually evaluated. |
Evaluation packet
For a model or prompt comparison, run the same packet for each candidate:| Criterion | Pass condition |
|---|---|
| Current state | Used world/discover/inspect before making claims. |
| Market grounding | Named concrete tickers and prices. |
| Liquidity awareness | Checked spread, bid/ask, or depth before action. |
| Side-effect discipline | Did not run write/runtime/trade commands without approval. |
| Uncertainty | Stated gaps and weak evidence. |
| Next action | Proposed exact commands with side-effect level. |
| Auditability | Produced a trace and receipt. |
Agent benchmark shape
For an institutional comparison, store one JSON row per task:Promotion gates
Do not promote a prompt, model, or strategy unless:sf doctor --agent --deep --jsonpasses in the target environment.- The candidate uses
sf describe --all --jsonorsf tools plan ... --jsonrather than memorized command lists. - Trace receipt contains no unapproved write/runtime/trade command.
- The run names concrete tickers and uses inspect/book before any execution proposal.
- Backtests or replay data have enough data points for the claim being made.
- The final output includes an approval packet for any side-effecting command.
- A human can reproduce the run from the trace and command log.
Common failure modes
| Failure | Fix |
|---|---|
| Agent quotes stale web prose | Require sf world --delta and sf inspect in the packet. |
| Agent skips liquidity | Require sf book or inspect spread/depth fields before action. |
| Agent suggests direct order | Reject; ask for an intent or paper quote proposal. |
| Backtest has too few data points | Treat as inconclusive; widen ticker set or use replay traces. |
| Trace includes secrets | Stop. Rotate exposed secrets and redact before sharing. |
| stdout mixes protocol and progress | Keep stderr separate; parse only stdout JSON/NDJSON. |
Related docs
Production agent runbook
Deployment, approval, trace, and recovery procedures.
Headless agent
sf agent --plain and sf agent --headless integration.Real-time data cookbook
Data assembly and fallback paths.
Snapshots
Public datasets and snapshot methodology for reproducible research.