AI Agents Need Sandboxes, Not Permissions
AI agents like Claude Code and Codex try to be “safe” by limiting what the agent can do to a small set of operations and asking the user to approve everything else. In practice, the approval flow is so noisy and the requests so complex that it provides almost no real protection. We’ve reinvented Windows Vista’s never-ending UAC dialogs, and just like Vista, the only thing it actually trains users to do is click through.
The permission model is fundamentally broken. The fix isn’t a better dialog — it’s a sandbox.
The Approval Treadmill
When the agent edits a file, the permission UI can show you the exact diff. Approving that is fine. The problem is that most of what an agent actually wants to do isn’t a clean file edit. It’s a multi-step bash invocation chaining find into xargs into sed, or an inline Python script, or a jq pipeline that the UI truncates after the first couple of hundred characters.
Reviewing those properly takes real time. Reading a 30-line shell pipeline, working out what each step does, checking that nothing is being deleted that shouldn’t be, tracing where stdout is going to end up — that’s not a click, that’s a code review. And the agent fires off another one twenty seconds later.
There are two outcomes. Either you do the careful review every time and spend more time vetting the agent than the agent saves you, or you stop reading and click yes. In practice everyone ends up at “click yes” within a couple of hours of real use (if that). That’s exactly where Vista left users — except now the thing on the other side of the dialog is actively generating novel commands you’ve never seen before.
Sandboxes Over Approvals
The alternative is to give the agent access to a safe but sufficient set of resources up front, and then let it work freely inside that boundary. No prompts, no review of every command — this is the box, do whatever you need inside it.
claude-docker is one way to do this — run the agent in a container so the side effects are contained on a layer you can throw away. There are plenty of other options, many/most of them more polished and with much finer-grained controls over what the agent can reach. The built-in sandbox features in agent executors could work just as well too, but they all seem to default to “ask permission” rather than “deny”, and the sandbox they create is usually too small to actually get work done. So you end up turning the protections off or whitelisting things case by case, and you’re back where you started.
The bar to set is: everything the agent has access to should be safe enough that you’d happily run it under --dangerously-skip-permissions without a second thought. And the sandbox needs to be big enough that the agent rarely runs out of room to operate. When it does hit the wall, doing the missing step yourself outside the sandbox should be fine — because the alternative, loosening the sandbox so it doesn’t have to ask, defeats the whole point.
Hooks Aren’t a Security Boundary
A tempting middle ground is to use hooks to inspect the bash commands the agent is about to run and reject the dangerous ones. This seems like a terribly insecure approach because there are too many ways to write the same operation. git push, git -c remote.origin.pushurl=… push, git push origin HEAD:main, a shell variable expansion that hides the word entirely, a Python script that calls the GitHub API directly, a local helper invoked through bash -c. A hook that searches for git push will miss most of these, and a malicious agent — or a benign agent that has been prompt-injected — will find a route through.
Hooks are still useful, just not as security. If you want the agent to be able to push but want a confirmation step on the way out, a hook will catch the common case and that’s a perfectly nice workflow as long as you don’t mind it failing every now and then. The mistake is treating that workflow nicety as a boundary that holds against an adversary. Anything that needs a real boundary needs to live at the sandbox layer, not in a regex over command lines.
Fine-Grained Tokens
For sandboxes to be genuinely useful, the things the agent reaches into them — GitHub, CI, package registries, chat — need to support narrow credentials. “Read” and “write” aren’t enough resolution. What can be read, what can be written?
I want to be able to give an agent a token that can create issues but not push to protected branches. Or one that can read CircleCI build results from a couple of specific projects but not change their settings or look at unrelated repos. Or a Slack token that can post to one channel and nowhere else. The granularity I’d give a junior engineer for a focused task is roughly the granularity I want to give an agent — and that’s much finer than most platforms expose today.
GitHub’s fine-grained personal access tokens are a step in the right direction, but the granularity is uneven and the UI is painful enough that most people fall back to broad-scope tokens anyway. If we want agents to operate inside a tight sandbox without the human constantly stepping in to do the bits the sandbox can’t reach, the platforms the agent talks to need to make narrow, scoped credentials easy to issue and obvious to use.
The “ask permission for everything” model isn’t going to get fixed with a nicer dialog — it’s the wrong shape. Real protection looks like a sandbox the agent can move freely inside and credentials that can’t do much damage even if the agent goes off the rails. The sandbox tooling is rough today and the credential story is worse, but those are the levers worth pushing on, not another permission prompt.
Moolah Diaries: Vibe Coding a New Moolah
I got Claude to crunch the numbers on the moolah-native project — eight days of vibe coding a SwiftUI replacement for my long-running personal finance tracker, with zero Swift experience. The stats are interesting but they don’t really capture what the experience has been like, so here’s the human version.
Skipping the Learning Curve
Using AI to just skip the initial learning curve of a new language and platform has been really freeing. I don’t know Swift. I don’t know SwiftUI. Normally I’d spend a ton of time working out the basic setup and finding what patterns/libraries to use before I could do anything useful and that’s often enough to kill the motivation entirely. Instead I was working on actual domain problems from day one — straight to the interesting part. The learning curve didn’t go away, I just got to avoid it. Will this come back to bite me? Quite possibly.
Brainstorming with a Pretty Smart Duck
The brainstorming has been the standout. Claude Code has a brainstorming skill that walks you through refining designs interactively, and it’s been great. It found data sources for exchange rates, stock prices, and crypto prices that I’d missed even after searching pretty thoroughly myself. That opened up features I’d been uncertain about.
But the bigger win was at the design level. I’d previously looked at adding multi-currency support and even wrote a lot of the code but never really felt the design hit the complexity/benefit trade off. Working through the brainstorming process came up with a multi-leg transaction model that handles multicurrency much more cleanly and simplifies complexity around the existing transfer support. It’s rubber ducking, but with a duck reasonably smart duck (that has opinions and can pull up documentation you didn’t know existed).
The Cost Problem
AI usage limits are incredibly annoying. A Claude Pro plan is basically useless for anything beyond toy code — you’ll hit limits within an hour of real work. Gemini’s free tier is surprisingly capable for casual use, but if you want to actually sustain any kind of productivity you wind up needing Claude’s Max 20x plan and that’s a lot of money for a side project which may be dormant for long periods. For a business the cost is a no-brainer - it can make expensive engineers way more productive so pays for itself. For pesonal use when you have a family and look at costs in AUD, not so much.
The Bugs
Oh my god, the bugs. The analysis post puts the fix rate at 31% compared to 2.6% for the hand-written server. That tracks and maybe understates it. I’m pretty damn impressed with the low bug rate for moolah-server though given its a side project and I always felt like I was cutting corners and not writing tests that I should have.
Today was the worst day yet. At some point recently, AI had split iCloud sync profiles into separate CloudKit zones but they were actually still in a single zone, overlapping with each other. We’d talked about that and it was super confident it would work even when pressed on it and told to do deep research. Fixing that required migrating to a lower-level sync API (CKSyncEngine), which then triggered a cascade of stability problems and has taken pretty much the whole day to fix.
AI has the same instinct as a lot of engineers (including plenty of senior ones): when something breaks, keep throwing bandaids at the symptoms. Track more state. Add another deduplication pass. Cache another field. Each fix is locally reasonable but the complexity just ratchets up. I wound up having to step in and push it to throw away all the state tracking and just model the problem better so it automatically did the right thing. Less code, fewer bugs, simpler to reason about. But AI won’t get there on its own — it optimises for making the current test pass, not for finding a design where the test wouldn’t have failed in the first place.
Shipping is Still Hard
I’ve started entering transactions through the SwiftUI app but I’m still running the original moolah-server as the backend because I don’t trust the iCloud sync layer. Today is a pretty good example of why.
There’s also a subtler problem: vibe coding makes it very tempting to just keep churning out features because it’s fun to watch AI produce things. But every new feature is more unvalidated code, and that works against ever actually shipping. At some point you have to stop adding and start using. But I have nearly 15 years of historic data at risk and detecting subtle corruption or a few messed up transactions can be really hard.
It’s Fun Though
Coding was my hobby before it became my job. I still enjoy the job but it’s serious work with serious consequences. Vibe coding has brought back some of the fun of just playing with technology for its own sake — partly the quick results, but also just experimenting with a powerful new tool and learning how to get the most out of it.
The code quality concerns are real, the costs are high, and the bugs are maddening. But I built a functional multi-platform finance app in eight days with no platform experience, and I had a good time doing it.
Moolah Diaries: Letting AI Analyse Its Own Work
I used custom software called Moolah to track my personal finances for many years now. Originally it was written in JavaScript with moolah-server providing the backend and moolah the frontend. This holidays I’ve been vibe coding a replacement written entirely in Swift. It seems like a somewhat useful experiment to learn more about the trade offs of extreme AI usage, so I had Claude dig into the available stats from GitHub and its session logs to compare the two projects and see what we can learn. It’s not a controlled experiment so lots of room for interpretation, but an interesting data point none the less.
The initial version it came out with was a bit absurdly positive about AI, along the lines of I wrote 10x the lines of code in 0.2% of the time - more code is better! But with a bit of prompting and providing additional background it came out with the report below. I’ll write up some more human thoughts on the experience later, but I think the AI-crunched numbers are worth sharing by themselves to set the scene and because the key learnings are genuinely useful.
The Cast
| Project | Tech | Purpose | Active dev days | Commits | LOC (prod) |
|---|---|---|---|---|---|
| moolah (web) | Vue.js | Web frontend | 149 days over 8.7 yrs | 763 | ~8,400 |
| moolah-server | Node.js/Hapi | REST API backend | 87 days over 8.8 yrs | 405 | ~2,800 |
| moolah-server-go | Go | Learning exercise, abandoned | 5 days | 21 | ~500 |
| moolah-native | SwiftUI | Native iOS/macOS app | 8 days | 369 | ~20,600 |
Active dev days = days with at least one non-dependency-maintenance commit. The web and server are one project across two repos — 70 days overlap — so combined unique effort is 180 days.
How Each Project Was Built
moolah & moolah-server were started June 24, 2017 by Adrian Sutton, with Brett Henderson contributing the initial web import. Every line across 1,168 combined commits was written by hand with zero AI involvement. Adrian had deep JavaScript experience and was inventing the domain model, API, database schema, and UX simultaneously — greenfield design work. The code is self-documenting: no significant documentation exists because none is needed. The code speaks for itself, making the projects easy to pick up after months of absence with no risk of stale docs.
moolah-native was entirely AI-generated starting April 5, 2026. Adrian directed the work but has not read the code and has no Swift, SwiftUI, iOS, or macOS experience. Multiple AI agents were used, switching between them as rate limits were hit. 327 of 369 commits carry a Claude co-author tag; the remaining 42 “solo” commits are manual commits of AI-written code. Effectively 100% AI-authored by someone with no ability to review the output.
moolah-server-go was a 5-day learning exercise started during the holiday between jobs — a way to learn Go before a new role that required it. The goal was achieved regardless of the project being abandoned.
The Effort Question
Raw calendar span is misleading for side projects with multi-month dormancy periods. Active development days varied enormously in intensity:
| Session type | Web+Server (combined) | Native |
|---|---|---|
| Full day (6+ hrs) | 22 days | 8 days |
| Half day (3-5 hrs) | 55 days | 0 |
| Quick (1-2 hrs) | 140 days | 0 |
| Estimated total hours | ~600 hrs | ~84 commit-hours |
But these numbers aren’t comparable. The web+server hours are a developer actively writing and reasoning about code. The native app’s commit-hours are largely the AI working autonomously.
What the Session Logs Reveal
Claude Code keeps local session logs, giving a clearer picture:
| Metric | Value |
|---|---|
| Sessions | 105 |
| Human prompts | 1,496 |
| AI responses | 15,019 |
| AI responses per human prompt | 10:1 |
| Hours with 2+ concurrent sessions | 79% |
| Peak concurrent sessions | 12 |
For every human prompt, the AI averaged 10 responses — reading files, writing code, running tests, fixing issues, committing. Claude Code’s remote-control functionality allowed multiple agents to work in parallel while the human directed new sessions and reviewed completed ones.
Estimated human effort: 37-75 hours (at 1.5–3 minutes per prompt for reading output, thinking, and typing). That’s 6-12% of the web+server’s ~600 hours for 1.8x the code output — though the 600 hours produced code the developer understood and could maintain, while the 37-75 hours produced code no human has read.
Development Patterns
The Original Build: Power-Law Decay
2017 ████████████████████████████ Explosive start (363 web / 134 server)
2018 ████████████ Feature completion (154 web / 73 server)
2019 █████ Category reports sprint, then silence
2020 ▌ Near-dormant (6 web / 14 server)
2021 ██ Sporadic revivals
2022 █ Sporadic
2023 ██ Investment features
2024 ████ Vue 3 migration / server modernization
2025 █ Maintenance mode
2026 ▌ (web) / █████ (native) The native app takes over
~50% of web commits in the first 6 months, ~70% in the first 18 months. Dormancy periods align across both repos — both go quiet and revive together, driven by holidays and life.
moolah-native: An Accelerating Curve
Day 1 (Apr 5) ██ 20 commits — scaffolding, CI, auth
Day 2 (Apr 6) █ 10 commits — accounts, transactions
Day 3 (Apr 7) █ 13 commits — currency, categories
Day 4 (Apr 8) ███ 28 commits — planning, CRUD, iCloud
Day 5 (Apr 9) ██████ 57 commits — profiles, investments, UI
Day 6 (Apr 10) ███████ 68 commits — contract tests, backend alignment
Day 7 (Apr 11) ███████ 70 commits — stock prices, performance
Day 8 (Apr 12) ██████████ 103 commits — crypto, multi-instrument, analysis
Each day produced more than the last. Day 8 alone exceeds most entire months of the original projects.
Are AI Commits Just More Granular?
No — they’re actually larger. The median native commit is 93 lines vs. 24 (web) and 36 (server). Squashing all commits within 1-hour windows gives 66 logical sessions, compared to 90 for the web app’s first 2 months. The high commit count reflects real throughput, not artificial granularity.
28 fix commits changed fewer than 10 lines — micro-patches a human would fold into the parent commit. These represent ~8% of commits. But even excluding them, the fix rate remains high, and the question of how many bugs were introduced and fixed within a session (never appearing in commit history) remains unanswered.
Is It Bloated? Language vs. Real Bloat
The native app is 1.8x the size of web+server combined. How much is language overhead vs. genuine bloat?
API calls show the starkest language difference:
| Operation | Swift (repo + DTO) | JS (client.js) |
|---|---|---|
| Fetch all accounts | ~67 lines | 5 lines |
| Create account | ~32 lines | 6 lines |
Swift requires DTO structs, Codable conformance, explicit mapping functions, typed error handling, and an explicit decode step. JS just calls fetch().
Models are closer than expected — Swift models are only ~30% larger than server DAOs. Web stores are sometimes larger because they mix model shape with mutation logic.
Breaking Down the 20,600 Lines
| Category | Lines | % | Notes |
|---|---|---|---|
| Language/platform overhead | ~5,400 | 26% | Types, inits, DTOs, CodingKeys, #Preview, platform conditionals |
| CloudKit offline backend | ~2,990 | 14% | Offline-first local computation; no web equivalent |
| Native-only features | ~1,750 | 9% | Crypto prices, data export, multi-platform layout |
| Equivalent application logic | ~10,460 | 51% | Would be ~6,500-7,500 lines in JS |
Rewriting only the web-equivalent functionality in JS would yield ~8,000-9,000 lines — close to the actual 11,200. The 1.8x multiplier is mostly language overhead and the offline backend, not AI-generated bloat.
The Remote backend is properly thin: 1,490 lines, of which only ~40 are business logic (2.7%). It constructs requests, decodes responses, maps to domain models. The CloudKit backend (2,990 lines) necessarily contains real logic — it must replicate server-side computation for offline use.
Defect Rates
| Project | Fix Commits | Fix Rate | Organic Fix Rate |
|---|---|---|---|
| moolah-native | 126 | 31% | 31% |
| moolah (web) | 80 | 10.5% | 6.2% (excl. migration breakage) |
| moolah-server | 15 | 3.5% | 2.6% (logic bugs only) |
What Drives Each Project’s Bugs
Native — the generate-and-patch cycle. CategoryPicker: 7 fix commits + 2 complete rewrites in one day. Budget API: two consecutive fixes (wrong endpoint, then wrong UUID format). Empty budget: fixed to “top” alignment, immediately re-fixed to “center”. The pattern: AI generates → breaks → fixes → fix is wrong → fixes the fix. This can consume 5-10 commits for one feature.
Web — dependency breakage. 22 of 80 fixes from the 2024 Vue 3 migration. 7 of 10 reverts were failed dependency upgrades. Organic fix rate excluding migrations: 6.2%.
Server — remarkably stable. 11 logic fixes in 8.8 years, 5 in the same file (dailyBalances.js). Simple CRUD has essentially zero bugs.
The Unvalidated Iceberg
The 31% fix rate only counts bugs found during development. Much functionality remains unvalidated with no production usage and no human code review. The original projects have been in actual use for years — their bugs are known quantities.
Can We Trust AI-Written Tests?
The native app’s test suite is large (13,653 lines, 0.66:1 ratio) but size doesn’t equal value. When AI writes both implementation and tests, both can encode the same wrong assumption.
Five Cases Where Tests Validated Bugs
- Expense sign convention — Tests asserted expenses as positive; server uses negative. Both implementation and test had to change.
- Investment daily balances — Tests computed from value snapshots; correct behavior is cumulative from transactions. Entire test rewritten. The AI built a wrong mental model and tests faithfully encoded it.
- Scheduled transaction filtering — Tests expected scheduled transactions in regular lists. They should be excluded.
- Category deletion — Tests expected child reparenting; server orphans them. AI guessed “reasonable” behavior instead of checking.
- Return type mismatch — Tests asserted
Int; API returnsMonetaryAmount.
33 fix commits (27% of fixes) required changing test expectations alongside the production fix — 33 times the test suite said “this is correct” when it wasn’t.
The TDD That Wasn’t
TDD was instructed from day 1. The AI ignored this for 5 days. Actual test-first behavior only appeared on day 6, when structured “superpowers” skills were installed — enforcement mechanisms stricter than plain-text instructions. Even then, TDD doesn’t help when the AI’s understanding of correct behavior is wrong: it just writes a wrong test first instead of second.
Where Confidence Actually Comes From
| Source | Confidence | Why |
|---|---|---|
| The server | High | 8.8 years of human-written tests and real-world use. When the native app talks to the server, correctness comes from the server. |
| Test architecture | Medium | Real backends (CloudKitBackend + in-memory SwiftData), not mocks. Structurally sound, but can still assert wrong expected values. |
| Manual testing | Medium | 60% of fixes were production-only (no test changes), meaning bugs were found through use, not tests. |
| Test expectations | Low-Medium | Strong regression protection, weak correctness verification. At least 33 demonstrated cases of tests encoding wrong behavior. |
| CloudKit backend | Low | Reimplements server logic with no human review. All 5 test-encoding-bugs were in this layer. |
The Dependency Divide
The native app has zero third-party packages. Everything comes from Apple’s SDK: SwiftUI, SwiftData, CloudKit, URLSession, Charts, XCTest, etc.
The JS projects have ~570 installed packages across ~25 direct dependencies, and 259 commits (22%) touch package.json. Libraries get abandoned (Vuex → Pinia, webpack → Vite, moment → date-fns), major versions break APIs (Vuetify 1→2→3→4 required 8+ commits with reverts), and transitive vulnerabilities create perpetual maintenance.
This directly killed momentum. The 287-day dormancy starting Dec 2018 follows a reverted dependency upgrade. The 303-day gap after Oct 2019 follows a failed migration. A weekend producing only a partially-working upgrade with no new features makes it hard to come back.
The native app avoids this entirely — for now. Apple’s SDK evolves on a predictable annual cycle, not the constant churn of the JS ecosystem.
The Rhythm of a Side Project
| Time Pattern | Web+Server | Native |
|---|---|---|
| Weekend commits | 37-51% | 52% |
| Longest gap | 303-331 days | 13 hours (sleep) |
The dormancy periods align across web and server — both go quiet and revive together, driven by holidays. The native app hasn’t hit its first dormancy yet.
The question isn’t whether it will slow down, but what happens when it does. The original projects are self-documenting — you pick them up after 10 months and the code tells you how it works. The native app is AI-generated and unread. AI might make re-entry easier (it can explain the codebase), but the owner has no independent ability to verify those explanations.
Key Insights
1. AI Changed Who Can Build, Not What Gets Built
The native app was built by someone with zero platform experience. AI made platform expertise optional for initial construction — but the resulting codebase is opaque to its owner in a way the original projects never were.
2. Speed and Quality Traded Off at 12:1
31% fix rate (native) vs. 2.6% (server). The generate-and-patch cycle reflects genuine instability, not just frequent commits.
3. AI Ignores Instructions Without Enforcement
TDD was instructed from day 1, ignored for 5 days. Only structured skill enforcement changed actual behavior. Plain-text instructions are suggestions, not constraints.
4. AI-Written Tests Can Validate Bugs
33 fix commits required changing test expectations — the tests were asserting buggy behavior was correct. When AI writes both sides from the same wrong model, tests provide false confidence. Good test architecture (real backends, no mocks) helps but doesn’t solve the problem.
5. The 1.8x Size Ratio Is Mostly Language, Not Bloat
~26% is Swift type system overhead, ~14% is the offline CloudKit backend (which the web app doesn’t have), ~9% is native-only features. The Remote backend is properly thin. Feature-level code is comparable to the web equivalents.
6. Plans Are a Supervision Mechanism, Not Documentation
The original projects need no documentation — the code is self-documenting. The native app has 46,700 lines of plans because AI-directed development needs an external record of intent. The AI frequently fails to fully execute plans, so keeping them lets you audit completeness. Plans aren’t documentation — they’re a quality control mechanism for an unreliable implementer.
7. The JS Dependency Treadmill Is a Real Cost
22% of all web+server commits are dependency maintenance. Failed upgrades killed momentum and contributed to dormancy. The native app’s zero-dependency approach avoids this entirely, though Apple’s evolution will eventually impose its own (more predictable) tax.
8. The Risk Is Opacity, Not Size
20,600 lines is a manageable codebase. The risk is that zero of those lines have been read by a human. If AI tools remain capable, this may work. If they don’t — or the codebase outgrows what AI can reason about — the project is stranded. The original projects carry no such risk: self-documenting code that anyone with JS experience can pick up.
9. Side Projects Have a Heartbeat Regardless of Tooling
Dormancy cycles are driven by life, not technology. AI may change the revival cost, but it doesn’t change the fundamental constraint that side projects compete with the rest of life for time and energy.
Making Claude Code Tell You What It's Doing
Claude Code has a status line that sits at the bottom of the terminal showing things like the current directory, git branch, model, and context window usage. It’s driven by a shell script that receives JSON on stdin and prints whatever it wants. I wanted to add one more thing: a short description of what the session is actually working on.
The Simple Way: /rename
The built-in /rename command sets a session name that Claude Code displays above the prompt. Type /rename fix auth bug at the start of each session and you’re done — no scripts needed.
The downside is that it’s manual, and /rename can’t be invoked programmatically by Claude. If you want Claude to automatically describe what it’s working on and update that description as the focus shifts, you need the automated approach below.
The Automated Approach
The goal is for Claude to write a short status like “fix auth bug” that shows up in the status line, updated automatically as the session’s focus changes:
op-claude (main) Opus ctx:8% · fix auth bug
This turns out to be harder than it should be. The status line script receives a JSON blob on stdin that includes the session_id. Claude’s bash tool calls don’t. There’s no $SESSION_ID environment variable, and $PPID differs between the two because they’re spawned through different process trees.
So we need a way for the status line side (which knows the session ID) to leave a breadcrumb that the bash side (which doesn’t) can find.
The Breadcrumb
Both the status line script and Claude’s bash calls have a common ancestor: the claude process. They just reach it through different paths. The trick is to walk up the process tree until you find a process named claude, then use its PID as a shared key.
A UserPromptSubmit hook runs on every user message and receives the session_id in its input. It walks the process tree to find the ancestor claude PID and writes a breadcrumb file mapping one to the other:
#!/usr/bin/env bash
# ~/.claude/hooks/session-status.sh
input=$(cat)
session_id=$(echo "$input" | jq -r '.session_id // empty')
[ -z "$session_id" ] && exit 0
# Write breadcrumb mapping ancestor claude PID -> session_id
pid=$PPID
while [ "$pid" -gt 1 ]; do
comm=$(ps -o comm= -p "$pid" 2>/dev/null)
if [ "$comm" = "claude" ]; then
echo "$session_id" > "/tmp/claude-sid-${pid}"
break
fi
pid=$(ps -o ppid= -p "$pid" 2>/dev/null | tr -d ' ')
done
# If no status file exists yet, remind Claude to create one
if [ -f "/tmp/claude-status-${session_id}" ]; then
exit 0
fi
jq -n '{
"hookSpecificOutput": {
"hookEventName": "UserPromptSubmit",
"additionalContext": "STATUS LINE REMINDER: Run ~/.claude/update-status.sh \"short summary\" to set what this session is working on (under 30 chars)."
}
}'
That last part is important. You can’t just tell Claude in your CLAUDE.md to “please update the status line” and expect it to reliably happen. The hook injects a reminder into the conversation context on every user message until a status file exists. Belt and suspenders.
Writing the Status
Claude calls a small helper script that does the same process-tree walk in reverse — finds the claude ancestor PID, reads the breadcrumb to get the session ID, then writes the status:
#!/usr/bin/env bash
# ~/.claude/update-status.sh "short summary"
msg="$1"
[ -z "$msg" ] && exit 1
pid=$$
while [ "$pid" -gt 1 ]; do
comm=$(ps -o comm= -p "$pid" 2>/dev/null)
if [ "$comm" = "claude" ]; then
sid=$(cat "/tmp/claude-sid-${pid}" 2>/dev/null)
[ -n "$sid" ] && echo "$msg" > "/tmp/claude-status-${sid}"
exit 0
fi
pid=$(ps -o ppid= -p "$pid" 2>/dev/null | tr -d ' ')
done
The Status Line Script
The full status line script reads the JSON from stdin, extracts the fields it cares about, and builds the output. The session status is just another part appended at the end:
#!/usr/bin/env bash
# ~/.claude/statusline-command.sh
input=$(cat)
cwd=$(echo "$input" | jq -r '.cwd // .workspace.current_dir // ""')
model=$(echo "$input" | jq -r '.model.display_name // ""')
used_pct=$(echo "$input" | jq -r '.context_window.used_percentage // empty')
vim_mode=$(echo "$input" | jq -r '.vim.mode // empty')
session_id=$(echo "$input" | jq -r '.session_id // empty')
# Per-session status from temp file keyed by session_id
session_status=""
if [ -n "$session_id" ]; then
session_status=$(cat "/tmp/claude-status-${session_id}" 2>/dev/null || true)
fi
# Directory: basename of cwd
dir=$(basename "$cwd")
# Git branch (skip optional locks)
branch=""
if git_out=$(GIT_OPTIONAL_LOCKS=0 git -C "$cwd" symbolic-ref --short HEAD 2>/dev/null); then
branch="$git_out"
fi
# Build status line parts
parts=()
# Directory in cyan
parts+=("$(printf '\033[36m%s\033[0m' "$dir")")
# Git branch in yellow if present
if [ -n "$branch" ]; then
parts+=("$(printf '\033[33m(%s)\033[0m' "$branch")")
fi
# Model
if [ -n "$model" ]; then
parts+=("$(printf '\033[90m%s\033[0m' "$model")")
fi
# Context usage with color thresholds
if [ -n "$used_pct" ]; then
used_int=${used_pct%.*}
if [ "$used_int" -ge 80 ] 2>/dev/null; then
color='\033[31m'
elif [ "$used_int" -ge 50 ] 2>/dev/null; then
color='\033[33m'
else
color='\033[32m'
fi
parts+=("$(printf "${color}ctx:%s%%\033[0m" "$used_int")")
fi
# Session status (per-session work summary)
if [ -n "$session_status" ]; then
parts+=("$(printf '\033[90m· %s\033[0m' "$session_status")")
fi
# Vim mode
if [ -n "$vim_mode" ]; then
parts+=("$(printf '\033[90m[%s]\033[0m' "$vim_mode")")
fi
printf '%s' "${parts[*]}"
Wiring It Up
Make both scripts executable:
chmod +x ~/.claude/statusline-command.sh ~/.claude/update-status.sh ~/.claude/hooks/session-status.sh
Register the status line and hook in ~/.claude/settings.json:
{
"statusLine": {
"type": "command",
"command": "bash ~/.claude/statusline-command.sh"
},
"hooks": {
"UserPromptSubmit": [
{
"hooks": [
{
"type": "command",
"command": "~/.claude/hooks/session-status.sh"
}
]
}
]
}
}
And add the instruction to your CLAUDE.md that tells Claude when to update:
## Session Status Line
Update the session status line so the user can see what each session
is working on at a glance.
- **After the first user prompt**: run `~/.claude/update-status.sh "short summary"`
as part of your first response
- **Periodically**: run it again every ~5 interactions or when focus shifts
- Keep summaries under 30 chars
What I Learned
The interesting constraint here is that Claude Code’s extensibility points — status line scripts, hooks, and bash tool calls — all run as separate processes with no shared environment. There’s no session ID in the environment, no shared memory, no IPC channel. The process tree walk is a hack, but it’s a reliable one. Every subprocess of a Claude Code session shares a common claude ancestor, even if the paths diverge.
The other lesson is that CLAUDE.md instructions alone aren’t enough for “always do X” behaviors. Claude follows them inconsistently, especially across sessions. Hooks that inject reminders into the conversation context are much more reliable. The CLAUDE.md instruction tells Claude what to do; the hook makes sure it actually does it.
Claude Docker
I’ve been using Claude Code a lot lately. It’s become a core part of how I work — planning changes, exploring unfamiliar codebases, writing and reviewing code. But giving an AI agent the ability to run arbitrary shell commands on your machine does make you think a bit more carefully about what’s happening on your host system.
The natural answer is to run it in a container. Not as a security boundary — Claude still needs access to your code, your git config, a GitHub token, and the internet — but as a way to keep all the side effects contained. If it installs random packages, creates temp files, or leaves build artifacts scattered around, that’s all happening inside the container rather than on your actual machine. It also makes the environment completely reproducible and disposable. Something goes wrong? Tear it down and rebuild.
So I built claude-docker to do exactly that.
How It Works
An Ubuntu container runs an SSH server. Your code directory is bind-mounted at the same path inside the container so file references are identical on both sides — Claude can say “edit /Users/aj/Documents/code/foo/bar.go” and it works whether you’re looking at it from inside or outside the container. Your git config, Claude config, and known hosts are all mounted in too, so everything just works as expected.
The container comes pre-loaded with the usual development tools: Go, Node.js, mise, gopls, gh, ripgrep, fzf, tmux, and a bunch of others. There’s an EXTRA_PACKAGES option if you need anything else — set it in your .env and it gets installed on the next build.
A be-claude helper script SSHs into the container and launches Claude Code in whatever directory you’re currently in. Symlink it onto your PATH and it works from anywhere. It automatically passes through a GitHub token (from gh auth token or the environment) so Claude can interact with GitHub inside the container.
Build Caches
One thing I wanted to get right was build cache persistence. Rebuilding the container shouldn’t mean re-downloading every Go module and Cargo crate. A single named Docker volume is mounted at ~/.cache and environment variables redirect the various tool caches into it:
- Go module cache via
GOMODCACHE - Cargo registry via
CARGO_HOME - Solc binaries via
SVM_HOME - Foundry and mise already use
~/.cacheby default
So you get fast rebuilds without the volume shadowing any binaries installed in the image (like gopls). The distinction matters — you want caches persisted but binaries to come fresh from each build.
Getting Started
The setup is pretty minimal:
git clone git@github.com:ajsutton/claude-docker.git
cd claude-docker
cp .env.example .env
# Edit .env — set CODE_PATH to your code directory
./run.sh
./be-claude
If you have SSH keys loaded in your agent, you don’t even need to configure SSH_AUTHORIZED_KEYS — run.sh picks them up automatically.
If your network requires custom root CAs (corporate proxies, internal domains, etc.), drop .crt files into the certs/ directory and they get installed into the container’s trust store on the next build. The directory is gitignored so your certificates stay local.
What It Isn’t
This is a convenience layer, not a security sandbox. Claude has read/write access to your mounted code, a GitHub token, and unrestricted network access. It’s useful for keeping your host system clean and making the environment reproducible, but don’t treat the container boundary as a trust boundary.
The code is up at github.com/ajsutton/claude-docker — it’s intentionally simple and easy to customise for your own setup.
Yes, I was too lazy to write this post myself and got Claude to do it for me. The whole world is just AI slop now.
Types of Tech Debt
The Optimism blog has published an article I wrote discussing the various types of tech debt. I’ve been finding it very useful lately to be able to “put words to things” better, whether that’s naming things better or just better ways to explain things.
Teku Event Channels
Teku uses a really nice framework for separating different components - Event Channels. It’s based on similar patterns in the Sail library used at LMAX for sending network messages between services. In Teku though, it’s designed to work in-process while still decoupling the components in the system. Turns out I never wrote about it here, so I’m very belatedly catching up.
Event channels are defined by declaring a pretty standard interface:
public interface SlotEventsChannel extends VoidReturningChannelInterface {
void onSlot(UInt64 slot);
}
There are a few simple restrictions:
- It must extend from
VoidReturningChannelInterface(orChannelInterfacebut we’ll get to non-void returning cases later) - All methods must return
void - Methods cannot throw any exceptions
There can be any number of methods on the same interface and any number of subscribers to the channel.
The implementing side simply implements the interface and the calling side simply has an implementation of the interface injected and calls it as normal. So far, this isn’t actually providing any real separation - it’s just using a Java interface. You can pass the concrete implementation of the interface to the calling side and it will all work. The interface provides some decoupling between the caller and receiver, but they’re still coupled temporally because the call is synchronous, and exceptions on the receiving side would propagate back up through the calling side. Both can be fixed to isolate the components fully, but then you have to do that at every call-site.
Instead, the event channel system uses reflection to generate an implementation of the interface that ensures complete isolation between caller and
receiver. The generated implementation is passed to the caller and it implements each method by passing the work to a thread pool, then calling the actual
implementation. It also provides error handling and records metrics to give visibility into the event system. While reflection is used to generate the
implementation most of the code is in abstract classes that the generated implementations extend so it’s easy to maintain. Importantly, the complexity of
that reflection is abstracted away from the code using them framework - it’s just like an interface where part of the API contract is that calls are always
asynchronous and never throw any exceptions. The code for the framework is quite small, all in the infrastructure.events package.
Calls to the interface are added to the queue the thread pool takes work from in call order. So if the thread pool has a single thread, the calls will all be processed in exactly the same order they were made. In most cases there are multiple threads in the thread pool so processing happens in parallel (but starts in order), but
for cases like the StorageUpdateChannel where event order is important, a single thread is used.
The VoidReturningChannelInterface is an ideal case for maximum decoupling of components - the sender is just notifying when events happen and forgetting about them.
But often we need to request data from another component or be able to handle failures. The storage system in Teku is a decoupled component for example. In that case
we use an interface that just extends ChannelInterface. Then methods are allowed to return SafeFuture - the promise type used in Teku.
Exceptions are still not allowed, but the returned SafeFuture can be used to return error information as part of the result. The same implementation applies,
reflection is used to generate an implementation that calls the real implementation via a thread pool, but now when the real implementation completes, the result
is used to complete the originally returned SafeFuture. For example:
public interface Eth1DepositStorageChannel extends ChannelInterface {
SafeFuture<ReplayDepositsResult> replayDepositEvents();
SafeFuture<Boolean> removeDepositEvents();
}
Note that the actual implementation still provides a method that returns SafeFuture which allows it to use an asynchronous implementation when suitable. It
can also just use SafeFuture.completedFuture(value) to return a value synchronously easily. The event system will now only allow a single subscriber to the
topic to ensure it knows where the result value should come from. Since publishers and subscribers are created at startup, if multiple subscribers are added
it means Teku fails to start, a lot of tests fail and it won’t go unnoticed.
There’s a bunch of nice things about this framework:
- EventChannels have “click-throughability”. You can easily jump from a call to the interface to the actual implementation (or see all implementations) using the go to implementation functionality of an IDE. The details of how the decoupling is implemented are all abstracted away.
- The ability to return a value asynchronously is much easier to reason about than having to send responses via a separate event. The request/response is clearly coupled together in the interface rather than piecing together two independent events.
- For testing, the channel interface can be easily mocked, a synchronous event channel passed or a custom stub provided.
One particularly neat trick in Teku is that the validator client can run either within the Teku beacon node process or as a separate process. It’s the event
channel system that makes that work. The validator client was originally built in-process but as it’s own component so all calls to or from it were completely
asynchronous and decoupled through the event channel interfaces. To make it run as an external process we simply wrote an implementation of the channels it called
that worked by sending HTTP requests to the beacon node API rather than using the in-process generated ones. The calls to the validator client were all
timing information like the SlotEventsChannel above. For most of those we simply wrote a new publisher that ran on an independent timer inside the validator
client. The few that actually depended on the state of the beacon node were produced by subscribing to the beacon node API event stream and sending events
based off of that.
The main downside is that the asynchronicity of the call isn’t visible in the actual code (only in the reflection generated implementation). That’s why by
convention in Teku channel interfaces and the variables for them are always suffixed with Channel so it is clear that asynchronicity is part of the API contract.
It isn’t immediately obvious to people new to the codebase, but it’s quick to learn and easy to remember so I don’t recall it ever causing any problems in
practice.
Ultimately event channels are a pretty simple system that provides a lot of power and flexibility.
Home Lab
One of the downsides of moving from working on the Ethereum consensus layer is that you often need a real execution node sync’d, and they don’t have the near instantaneous checkpoint sync. So recently I bit the bullet and custom built a PC to run a whole bunch of different Ethereum chains on. I’m really quite happy with the result.
There’s actually a really good variety of public endpoints available for loads of Ethereum-based chains these days so while running your own is maximally decentralised, it’s not just a choice between Infura or your own node now. Public Node provide very good free JSON-API and consensus APIs. Alchemy and Quicknode both have quite usable free tiers etc. The downside with all of them though is that their servers are in the Americas or Europe and that’s a whole lot of latency away from Australia. When you’re syncing L2 nodes or particularly running fault proof systems, you wind up making a lot of requests and that latency becomes very painful very quickly. More than anything it was wanting to avoid that latency that drove me to want to run my own nodes locally.
To be useful though, I really want it to run quite a few different chains. Currently it’s running:
- Ethereum MainNet
- Ethereum Sepolia
- OP Mainnet
- OP Sepolia
- Base Mainnet
- Base Sepolia
I’m quite tempted to add a Holeksy node just so I can run some validators again - shame most of the L2 stacks and apps use Sepolia but it has a locked down validator set.
Hardware-wise running this many nodes is primarily about disk space so I wound up with an MSI Pro Z790-P motherboard which has a rather ridiculous number of ports that you can plug SSDs into - not all at full speed but plenty at fast enough speeds. It’s been nearly 20 years since I built a custom PC so there’s likely a bunch of things that aren’t the perfect trade offs but I’m quite happy with the overall result. One of the mistakes which I’m actually happy about was that I mistook the case size names and wound up with a much larger case than I expected. That does give it capacity to shove a heap of spinning rust drives into it and leverage that for things like historic data that doesn’t need the fast disk. Its got a Intel Core i7 CPU which is barely being used. I had wanted to add 128Gb of RAM since Ethereum nodes do like to cache stuff but apparently using 4 sticks of RAM can cause instability so I’ve stuck to just 64Gb for now. It seems to be plenty for now but is probably the main limiting factor at the moment. For disk it currently has two 4Tb NVME drives.
For software, the L1 consensus nodes are obviously all Teku and they’re doing great. The team has done a great job continuing to improve things since I left so even with the significant growth in validator set, its running very happily with less memory and CPU than it had been “back in my day”. The L1 Mainnet execution client is a reth archive node which has been quite successful. I did try a reth node for sepolia but hit a few issues (which I think have now been fixed) so I’ve wound up running executionbackup and have both geth and reth for sepolia.
The L2 nodes are all op-node and op-geth - always good to actually run the software I’m helping build. For OP Sepolia, I’m also running op-dispute-mon and op-challenger to both monitor the fault proof system and participate in games to ensure correct outcomes. I really do like the fact that OP fault proofs are fully permissionless so anyone can participate in the process just like my home lab now does.
For coordination, everything is running in docker via docker-compose which made it much easier to avoid all the port conflicts that would otherwise occur. Each network has its own docker-compose file, though there’s a bunch of networks shared between chains so the L2s can connect to the L1s and everything can connect to metrics. All the compose files and other config is in a local git repo with a hook setup to automatically apply any changes. So I’ve wound with a home grown gitops kind of setup. I did try using k8s with ArgoCD to “do it properly” at one point but it just made everything far more complex and less reliable so switched back to simple docker compose.
For monitoring, I’ve got Victoria Metrics capturing metrics and Loki capturing logs - both automatically pick up any new hosts. Then there’s a grafana instance to visualise it all. I even went as far as running ethereum-metrics-exporter to give a unified view of metrics when using different clients.
The final piece is a nginx instance that exposes all the different RPC endpoints at easy to remember URLs, ie /eth/mainnet/el, /eth/mainnet/cl, /op/mainnet/el etc. All the web UIs for the other services like Grafana are exposed through the same nginx instance. My initial build exposed all the RPCs on different ports and it was a nightmare trying to remember which chain was one which port, so the friendly URLs have been a big win.
Overall I’m really very happy with the setup and it is lightning fast even to perform quite expensive queries like listing every dispute game ever created. Plus it was fun to play with some “from scratch” system admin again instead of doing everything in the cloud with already existing templates and services setup.
Moving On From ConsenSys
After nearly 5 years working with the ConsenSys protocols group, I’ll be finishing up at the end of January.
So what happens with Teku? It will carry on as usual and keep going from strength to strength. There’s an amazing team of people building Teku and I have complete confidence in their ability to continue building Teku and contributing to the future of the Ethereum protocol. Teku started well before I was involved with it and has always been the work of an amazing team of people. I just wound up doing a lot of the more visible stuff - answering discord questions and reacting to the ad-hoc stuff that popped up.
My time at ConsenSys actually started by working on Besu, back before it’s initial release when it was called Pantheon. I was part of the team adding the initial support for private networks and then later moved over to join the team focussed on MainNet compatibility with work on things like fast sync, core EVM work and all that kind of fun. After that I got to help build a new team to focus on setting up tooling to make development and testing easier - modernising build and release systems, automated deployment and monitoring of test nodes and so on.
Then this “Ethereum 2.0” thing seemed like it might actually be ready to move out of the research phase and move towards production. So I joined the research team that was building “Artemis” to start bringing it out of research and to a real production-ready client. Most of the research team moved on to other research topics and we built a mostly new team around what we then called Teku. And so began one heck of a journey leading to the beacon chain launch, Altair and then The Merge. Hearing the crowd cheering in support of the merge at DevCon this year is one of the great highlights of my career.
I’m so lucky to have gotten to work with some truly amazing people. The folks who have been part of the Teku team along our journey share a truly special place in my heart though and I will always be grateful for the shared knowledge, persistence and dedication they have all contributed but even more so the caring, friendly way they contributed it. It’s not just the teams in ConsenSys but right across the Ethereum eco-system. The way the different consensus client teams have come together to push Ethereum forward is particularly amazing. These are ostensibly teams that are competing with each other and yet actively share knowledge to improve both the protocol and other team’s clients.
As I leave ConsenSys, I do so knowing that there are teams of incredible people who will carry on with the work I’m so privileged to have been able to contribute to.
So why the change? Mostly because this is a good time for me personally. As I mentioned, I started working on Teku to bring it out of research and into production. Getting The Merge done is a natural endpoint of that mission and a natural place to start looking for new challenges and opportunities. Obviously there are plenty of remaining things to improve in the Ethereum protocol and clients like Teku, but I’m keen to get a bit further out of my comfort zone.

So what’s next? I’ll be taking up a role as Staff Protocol Engineer with OP Labs to work on Optimism. I started looking at opportunities at Optimism because I’ve seen some of the great work they’ve been doing and I really like their retroactive public goods funding - it shows they’re investing in Ethereum, not just taking what they can get from it. Primarily though for me, finding a great place to work is about finding a great team of people doing interesting work. As I talked with various people from the Optimism team, I found them to be smart, curious, welcoming people who not only wanted to build great software but also wanted to keep improving the way they went about that. Plus I’ll be staying in the Ethereum eco-system so still get to work with all those amazing people. I can already see there’s a ton of stuff I can learn from the Optimism team and I think there’s places where I can bring some useful skills and experience beyond just writing some code.
In fact, given they mostly use Go and I have no real Go experience, “just writing some code” will be one of the first fun challenges. Java has kind of followed me for my career, not entirely deliberately though I do like it as a language, so I’m actually excited to really dig into writing production grade Go code.
Philosophically, one of the things I dislike about Ethereum (and blockchains in general) is that the high cost of transactions means it often becomes a rich person’s game and it often feels like people just throwing play money around. L2 solutions like Optimism are a big part of solving that by scaling blockchains and dramatically reducing fees. It feels good to me to be contributing to that. So much of the potential of Ethereum is waiting to be unlocked once it really scales. Besides, having worked on execution and consensus layers so far, moving to Layer 2 seems like an obvious next step.
Overall, I’m excited about the future of Teku and will be cheering the team on, and excited about the future of Ethereum and look forward to being part of delivering The Surge.
DevCon VI Talks
Mostly just so that I can find the recordings more easily later, here’s the recordings of DevCon VI talks I gave in Bogotá.
Firstly, Post-Merge Ethereum Client Architecture:
And a panel, It’s 10pm, do you know where you mnemonic is?