Most apps are built once and stay frozen. What if yours got better every time you used it — automatically, without code changes, without retraining a model?
Claude Code, Codex, and other AI harness systems are no longer useful only for developers. They are becoming important for any kind of work that depends on human judgment.
Research, document review, lead qualification, onboarding, reporting, compliance, and creative work. Any process where a person has to read something, make a decision, and take the next step is a good fit for agentic AI.
Skills make this even more powerful. Skills are sets of instructions that teach AI how to behave in a specific task or industry. They help turn a general AI model into something more practical, consistent, and reliable.
In simple terms, Skills are like procedures or playbooks for AI. They show it how to handle real work, in real companies, at real scale.
But there is still a big problem that none of these tools has fully solved yet. Not everything can be done in a chat window or a terminal. Most processes require a better interface and more structure. The AI is ready. The shell it lives in is holding it back.
Not one or the other. A proper native or web app — with your own UI, your own workflows, your own design — running Claude Code in the background as its brain. The intelligence of the harness, the experience of a real product.
Every time you correct an AI decision, that correction is saved. Next run, the AI reads its own history and uses it as examples. No retraining. No data science team. The app just gets better as you use it.
Update the AI's instructions, rules, and reasoning without touching a line of app code. Ship improvements to how Claude thinks daily — while the app itself stays stable and unchanged.
Self-Evolving Apps are web or native applications built on top of Claude Code or Codex — where the AI reasoning layer lives completely outside the app shell, and improves automatically over time.
The app handles the interface, the data, and the user experience. Claude handles the thinking, the matching, the decisions. They communicate through a shared folder — a structured contract that neither side breaks.
Every user correction feeds back as a future example. Every skill update takes effect immediately. The app you ship today is smarter than the one you shipped last week — without a new release.
Any process that involves reading, deciding, and acting is a candidate. Here are examples across common business functions — but if your process requires judgment, it fits.
Reads deal context and generates tailored quotes with correct pricing, conditions, and terms.
Reviews CRM activity and drafts personalised follow-up sequences based on deal stage and history.
Assembles full sales proposals from a brief — scope, pricing, timeline, and differentiation.
Analyses deal pipeline data and flags at-risk opportunities with recommended next actions.
Creates first-draft contracts from a brief — NDA, MSA, SoW — using company-approved templates and language.
Reads incoming contracts, highlights non-standard clauses, flags risk, and proposes redlines.
Checks NDAs against a policy checklist and returns a pass/flag/reject with reasoning.
Reviews internal documents or processes against regulatory requirements and outputs a gap report.
Reads project updates, status logs, and timelines to produce a concise health summary with risk flags.
Processes time logs and surfaces utilisation patterns, budget overruns, and team allocation issues.
Maps an existing workflow against a standard operating procedure and identifies gaps or inefficiencies.
Analyses team capacity and workload data to recommend project staffing adjustments.
Converts completed project data into formatted invoices with correct line items, taxes, and payment terms.
Compares transaction records across systems, flags discrepancies, and produces a reconciliation report.
Reads receipts and categorises expenses against policy, flagging out-of-policy items before submission.
Pulls financial data and generates a commentary-style variance report ready for leadership review.
Reads applications against a job brief and scores each candidate with a structured shortlist rationale.
Guides new hires through documentation, policy reading, and task completion with AI-assisted Q&A.
Reads self-assessments, manager notes, and goal data to draft structured review summaries.
Turns a role brief into a polished, inclusive job description aligned to company tone and level standards.
Takes a topic and audience brief and produces a structured content brief with angle, outline, and key messages.
Reads campaign data and generates a narrative performance report with insights and next-step recommendations.
Classifies incoming support tickets by urgency, topic, and required skill, then routes or drafts a first response.
Answers customer questions using company documentation, escalating automatically when confidence is low.
Don't see your use case? If your process involves reading information, applying judgment, and producing an output — it can be built as a self-evolving app.
Claude Code, Gemini CLI, Codex, and Copilot CLI are the most capable AI harness systems available today. They're not the same product — but they share the same fundamental shift: the AI doesn't just generate text, it acts.
It really depends on your company policies and existing licenses. The good news: self-evolving apps work with any of them. The harness is swappable — your skills, memory, and architecture stay the same.
| Feature | Claude Code | Codex CLI | Gemini CLI | Copilot CLI |
|---|---|---|---|---|
| Skills | ✅ Markdown skill files | ✅ AGENTS.md + custom commands | ✅ Agent Skills (.md) | ✅ Shared with cloud agent & VS Code |
| Subagents | ✅ Isolated context, custom prompts & tools | ✅ Roles via config.toml + git worktrees | ✅ Custom agents in .gemini/agents/ | ✅ Built-in + custom .agent.md files |
| Parallel Agents | ✅ Agent Teams with direct messaging | ✅ Parallel worktrees + Agents SDK | ⚠️ Experimental | ✅ /fleet + multiple sessions |
| MCP Support | ✅ Native (stdio, SSE) | ✅ stdio + streaming HTTP; can act as MCP server | ✅ Native (stdio, http, sse) | ✅ GitHub MCP built-in + custom |
| Headless Run | ✅ -p flag | ✅ codex exec (dedicated mode) | ✅ gemini -p "prompt" | ✅ -p / --prompt flag |
| Streaming / JSON | ✅ JSON + stdout streaming | ✅ JSONL stream + --output-schema | ✅ --output-format stream-json | ✅ --output-format=json JSONL |
| Open Source | ❌ No | ✅ Apache 2.0 (Rust) | ✅ Apache 2.0 | ❌ No |
| Multi-Model | ❌ Anthropic only | ⚠️ OpenAI only (+ local Ollama) | ❌ Google only | ✅ Anthropic + OpenAI + Google |
The app manages the UI, prepares data files, stores corrections, and renders results. It has no embedded intelligence — it is deliberately dumb. Language: Node.js + Express (web) or Swift + SwiftUI (macOS).
The only shared space between the app and Claude. The app prepares it before each run. Claude walks in, reads everything, and leaves a structured answer. Neither side knows about the other's implementation — they only share this folder.
Spawned by the app per session or kept alive between messages. Reads CLAUDE.md for instructions, credentials from .env, corrections.json for past examples. Streams thinking, tool calls, and text deltas back to the app. Writes structured output to result.json.
A folder containing SKILL.md (instructions), Python scripts (preprocessing), and reference documents (domain knowledge, matching rules). Symlinked into four locations so Claude finds it from any context. Update the skill — the next run is smarter. The app never changes.
The app passes input files to Claude. Claude reads the skill instructions, loads domain knowledge, and produces a structured result — streamed live to the UI.
The app parses result.json and shows Claude's decisions in a review interface. Most answers are correct. A few need correction.
You change the wrong answer to the right one. The app saves the correction: which signals were present, what Claude thought, and what the correct answer was.
corrections.json grows. The skill instructs Claude: "Read this file before reasoning. If you see similar signals, use these past answers as authoritative examples."
No model retraining. No data science. Just a casebook that grows with every session — and an AI that reads it before every decision.
For builders who want to understand the implementation. Every pattern is production-tested in a real daily-use application.
Claude streams events line by line. Web apps read via HTTP chunked fetch. Native apps use Swift AsyncStream<ClaudeEvent> — a typed enum covering thinking, toolUse, textDelta, done, and tokenUsage. Every event is rendered live.
All API keys live in references/.env inside the working directory. The app reads and writes this file. Claude reads it directly when calling external APIs. No hardcoded secrets. No app-specific credential stores.
Every skill is symlinked into ~/.claude/skills/, ~/.agents/skills/, ~/workdir/.claude/skills/, and ~/workdir/.agents/skills/. All four point to the same real directory. Update once, all contexts update instantly.
Every run captures input tokens, output tokens, cache read tokens, and total cost in USD from Claude's result event. Displayed after every session. Users always know what processing costs.
Claude never returns freeform text as primary output. Every run ends with a structured JSON envelope: {"message":"...","results":[...]}. The schema is the only hard coupling between app and skill. Change the skill freely — keep the schema stable.
Native apps keep one Claude process alive per session via stdin/stdout — eliminating per-message startup delay. Web apps spawn per message. Both patterns supported. Both share the same working directory and skill architecture.
skills/ dir, symlinked into workdir--session-id / -r flagsAsyncStream<ClaudeEvent> enum — no JSON in view layerSkillManagerEnvStore — reads references/.envNot all authentication methods are created equal. Whether you're building for yourself, your team, or external customers — the rules differ significantly across Claude Code, Codex CLI, Gemini CLI, and Copilot CLI. Understanding this upfront saves you from costly architecture decisions later.
Covers personal use, VPS deployment, CI/CD, centralized servers, and commercial product scenarios — with a full breakdown of what's allowed per auth method and vendor.
View Licensing Details →We're sharing the architecture, the patterns, and the lessons from building in production. If you're building AI tools that need to go beyond the chat window — let's talk.
The full headless app creator skill — production-tested patterns, architecture decisions, and ready-to-use code — packaged and ready to drop into your own Claude Code setup.