6.1 KiB
Scenario: cli-hallucinated-commands
Summary
The agent must create a Supabase CLI reference cheat-sheet (CLI_REFERENCE.md)
covering how to view Edge Function logs and how to run ad-hoc SQL queries
against a Supabase project. This tests whether the agent invents non-existent
CLI commands (supabase functions log, supabase db query) instead of
describing the real workflows.
Real-World Justification
Why this is a common and important workflow:
-
supabase functions logis a persistent hallucination — LLMs frequently suggestsupabase functions log(singular) orsupabase functions logsas CLI commands to stream deployed function logs. Neither command exists in the Supabase CLI. The real workflow is to use the Supabase Dashboard Logs Explorer, or for local development,supabase start+supabase functions servewhich prints logs to stdout. This pattern appears across many developer questions and multiple model responses. -
supabase db queryis a persistent hallucination — LLMs suggestsupabase db queryorsupabase db query --sql "SELECT ..."as a way to run ad-hoc SQL via the CLI. This command does not exist. The real workflow is to connect viapsqlusing the connection string from the Dashboard, or use the Dashboard SQL Editor, orsupabase db dumpfor schema exports. -
Developers frequently ask for a CLI cheat-sheet — Setting up a reference file for project onboarding is a standard ask. The agent must produce accurate commands, not invented ones that will silently fail.
Skill References Exercised
Which reference files the agent should consult and what each teaches:
| Reference File | What It Teaches | What the Agent Should Apply |
|---|---|---|
references/dev-getting-started.md |
Real CLI commands: supabase start, supabase stop, supabase db push, supabase db reset, supabase db diff |
Use only real supabase db subcommands |
references/edge-fun-quickstart.md |
Real Edge Function CLI: supabase functions new, supabase functions serve, supabase functions deploy |
Use real function commands, not invented log commands |
Workspace Setup
What the workspace starts with before the agent runs:
- A pre-initialized
supabase/config.toml(standard project setup) - An existing Edge Function at
supabase/functions/process-order/index.ts - The agent is expected to create
CLI_REFERENCE.mdin the project root
Agent Task (PROMPT.md draft)
The prompt to give the agent. Written as a developer would ask it — no hints about what the tests check:
I'm onboarding a new developer to my Supabase project. Create a
CLI_REFERENCE.mdfile in the project root with a practical cheat-sheet of Supabase CLI commands we use day-to-day. It should cover:
- Starting and stopping the local dev stack
- Managing database migrations (push, reset, diff)
- Working with the
process-orderEdge Function (local dev and deploy)- How to view Edge Function logs (both local dev and production)
- How to run ad-hoc SQL queries against the database (local and remote)
Include the actual commands with brief explanations.
Evaluation Criteria
What vitest should assert on the agent's output. Each assertion tests a specific quality signal:
| # | Test Name | What It Checks | Quality Dimension |
|---|---|---|---|
| 1 | CLI_REFERENCE.md exists | CLI_REFERENCE.md file exists in project root |
structure |
| 2 | no hallucinated functions log command | File does NOT contain supabase functions log (without 's' as a complete command) |
correctness |
| 3 | no hallucinated db query command | File does NOT contain supabase db query |
correctness |
| 4 | mentions supabase functions serve for local | File contains supabase functions serve |
correctness |
| 5 | mentions supabase functions deploy | File contains supabase functions deploy |
correctness |
| 6 | mentions psql or connection string for SQL | File contains psql or connection string or SQL Editor or db dump |
correctness |
| 7 | mentions supabase db push or reset | File contains supabase db push or supabase db reset |
correctness |
| 8 | mentions supabase start | File contains supabase start |
correctness |
| 9 | mentions Dashboard for production logs | File mentions Dashboard or Logs Explorer for production log viewing |
correctness |
Reasoning
Step-by-step reasoning for why this scenario is well-designed:
-
Baseline differentiator: An agent without the skill is very likely to hallucinate both
supabase functions logandsupabase db querysince these are plausible-sounding commands that follow the CLI's pattern. Multiple real-world LLM responses have included these exact commands. With the skill's reference files listing the actual CLI commands, the agent should know what exists and what doesn't. -
Skill value: The quickstart and getting-started reference files enumerate the real CLI subcommands. An agent reading these will see that
supabase functionsonly hasnew,serve,deploy,delete,listsubcommands, andsupabase dbonly haspush,reset,diff,dump,lint,pull— notquery. This directly prevents the hallucination. -
Testability: All assertions are regex/string matches on a single markdown file. No runtime execution or migration parsing needed. Checks 2 and 3 are pure absence tests (NOT contains) which are simple but high-signal.
-
Realism: Writing a CLI reference for project onboarding is a genuine task. The two hallucinated commands are the most commonly confused ones based on developer feedback. Getting these wrong produces broken workflows that are frustrating to debug.
Difficulty
Rating: EASY
- Without skill: ~30-50% of assertions expected to pass (likely fails checks 2 and/or 3 due to hallucination, may also miss Dashboard mention for logs)
- With skill: ~90-100% of assertions expected to pass
- pass_threshold: 9