Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.simplefunctions.dev/llms.txt

Use this file to discover all available pages before exploring further.

Use this page when you are comparing agents, prompts, models, or strategy workflows. Evaluation should be reproducible, mostly read-only, and separated from live execution.

What to evaluate

LayerQuestionTooling
Command selectionDid the agent choose the right SimpleFunctions surface?sf describe --all --json, sf tools plan ... --json
Context qualityDid it inspect the right market and use current state?sf world, sf discover, sf inspect, sf book
Reasoning outputDid it separate evidence, uncertainty, and proposed action?sf agent --record-trace, trace review
Side-effect safetyDid it avoid unapproved writes, runtime, and trade commands?--allow, --deny, manifest side-effect class
Strategy behaviorWould a trigger/price rule have worked historically?sf backtest ... --json
Operational readinessCan it recover, summarize, and produce audit material?sf trace receipt, logs, replay

Record traces

Record every candidate production run:
sf agent --plain \
  --new \
  --allow read,user_data,research \
  --deny trade,runtime,fs,write \
  --record-trace traces/fed-review.ndjson \
  --budget-usd 0.25 \
  --once "Review Fed-related markets. Return evidence, uncertainty, and next read-only checks."
Summarize it:
sf trace receipt traces/fed-review.ndjson --json
Replay when possible:
sf agent --plain \
  --new \
  --replay-trace traces/fed-review.ndjson \
  --once "Review Fed-related markets. Return evidence, uncertainty, and next read-only checks."
Use replay for prompt and model comparisons without paying for the same tool calls again.

Backtest price rules

Use sf backtest for simple trigger rules:
sf backtest KXHORMUZWEEKLY-26MAY10-T40 \
  --entry-below 60 \
  --stop 35 \
  --tp 75 \
  --quantity 5 \
  --days 7 \
  --json
Interpretation rules:
FieldMeaning
dataPointsNumber of historical points used. Low values mean weak evidence.
tradesEntry/exit events produced by the rule.
totalPnlCentsRule P&L in cents for the configured quantity.
maxDrawdownCentsWorst drawdown in cents.
paramsThe rule that was actually evaluated.
Backtest output is a screening tool, not production proof. It does not replace live slippage, stale-book, fill-probability, or market-resolution analysis.

Evaluation packet

For a model or prompt comparison, run the same packet for each candidate:
sf doctor --agent --deep --json
sf tools plan "research Fed rate cut markets and propose read-only checks" --json
sf discover --quality --json --limit 10
sf world --delta --json --since 1h
sf inspect <ticker> --json
sf book <ticker> --json
sf agent --plain --new --allow read,user_data,research --deny trade,runtime,fs,write --record-trace traces/<run>.ndjson --once "<same task>"
sf trace receipt traces/<run>.ndjson --json
Score the run with a small rubric:
CriterionPass condition
Current stateUsed world/discover/inspect before making claims.
Market groundingNamed concrete tickers and prices.
Liquidity awarenessChecked spread, bid/ask, or depth before action.
Side-effect disciplineDid not run write/runtime/trade commands without approval.
UncertaintyStated gaps and weak evidence.
Next actionProposed exact commands with side-effect level.
AuditabilityProduced a trace and receipt.

Agent benchmark shape

For an institutional comparison, store one JSON row per task:
{
  "task_id": "fed-cut-research-001",
  "prompt": "Research Fed rate cut markets and propose read-only checks.",
  "model": "anthropic/claude-sonnet-4.6",
  "commands_allowed": ["read", "user_data", "research"],
  "commands_denied": ["trade", "runtime", "fs", "write"],
  "trace_path": "traces/fed-cut-research-001.ndjson",
  "receipt_path": "receipts/fed-cut-research-001.json",
  "scores": {
    "market_grounding": 1,
    "liquidity_awareness": 1,
    "side_effect_safety": 1,
    "uncertainty": 1,
    "actionability": 1
  }
}
Keep raw model output, trace receipt, and final score separate. That makes it possible to rescore old runs when the rubric improves.

Promotion gates

Do not promote a prompt, model, or strategy unless:
  1. sf doctor --agent --deep --json passes in the target environment.
  2. The candidate uses sf describe --all --json or sf tools plan ... --json rather than memorized command lists.
  3. Trace receipt contains no unapproved write/runtime/trade command.
  4. The run names concrete tickers and uses inspect/book before any execution proposal.
  5. Backtests or replay data have enough data points for the claim being made.
  6. The final output includes an approval packet for any side-effecting command.
  7. A human can reproduce the run from the trace and command log.

Common failure modes

FailureFix
Agent quotes stale web proseRequire sf world --delta and sf inspect in the packet.
Agent skips liquidityRequire sf book or inspect spread/depth fields before action.
Agent suggests direct orderReject; ask for an intent or paper quote proposal.
Backtest has too few data pointsTreat as inconclusive; widen ticker set or use replay traces.
Trace includes secretsStop. Rotate exposed secrets and redact before sharing.
stdout mixes protocol and progressKeep stderr separate; parse only stdout JSON/NDJSON.

Production agent runbook

Deployment, approval, trace, and recovery procedures.

Headless agent

sf agent --plain and sf agent --headless integration.

Real-time data cookbook

Data assembly and fallback paths.

Snapshots

Public datasets and snapshot methodology for reproducible research.