Back SAIM.
Case Study — 01

SkillCheck

A/B testing for agent skills. It runs controlled experiments to prove whether your SKILL.md genuinely improves the model's output, or is simply a placebo you pay tokens for on every call.

Year 2026
Role Creator & Developer
Stack TypeScriptNode.jsCLI
skillcheck.page
SkillCheck desktop preview
SkillCheck mobile preview

Overview

Skills ship on intuition and cost tokens on every call, yet nobody actually measures whether they help. SkillCheck settles the question with evidence. It generates fresh evaluation tasks from a skill's declared domain, then runs each one twice (once with the skill injected, once without) before returning a single verdict: HELPS, PLACEBO, or HARMS.

Built in TypeScript on Node.js, it grades every output blind with a separate model, so the scorer never knows which arm produced what. It then draws 1,000 paired bootstrap resamples to place a 95% confidence interval around the effect size. Each run records the skill hash, task suite, model versions, and transcript hashes, so results stay fully reproducible and can be re-run against new model releases to catch silent "skill rot".

It works with Claude Code, Codex, Gemini CLI, and Cursor skills, ships with both an interactive file picker and a scripted JSON mode, and proxies inference through a metered API so no provider key ever has to live on your machine. Run npm install -g @sx4im/skillcheck and you're testing.

Preview Github