marcuss.pro
~/writing/04 PUBLISHED

Scale Yourself Report / Owner application / 2026

How I scaled
myself with AI

This is how I've been working for the past year. The clearest example is one side project: I shipped LoveCompass — a wellness app for couples, live on web, iOS and Android — as 1,001 merged PRs in three months, with strategic human gates on each feature development cycle / decision.

Primary focus: LoveCompass  ·  Mar 09 – Jun 06, 2026
1,001
Merged PRs
3 mo
Time to ship
11/day
Sustained avg
67
Peak single day
01

Summary

The volume came from building the right scaffolding and then getting out of its way. AI agents handle code review across four dimensions, a deterministic CI pipeline gates every merge, and a dashboard keeps everything visible. I step in for feature decisions and direction — the rest runs. Along the way I picked up patterns I find genuinely interesting: Deterministic Simulation Testing, saga orchestration, intuition engineering, RAG-based self-improvement.

02

Merge velocity

The required chart, and the shape of the work: a heavy build-out in March, refinement in April, then sustained shipping through May.

LoveCompass — merged PRs per month
419
Mar 2026build-out
189
Apr 2026refinement
338
May 2026sustained
55
Jun 2026partial

Total: 1,001 PRs  ·  browse the full searchable log of all 1,001 →

03

The product: LoveCompass

A full-stack wellness app for couples, shipped across three platforms using AI-assisted development pipelines — automated multi-dimensional review backed by deterministic testing.

PropertyDetail
PlatformsWeb · iOS · Android
Livecouplesapp.nextasy.co
WindowMar 09 – Jun 06, 2026
Throughput1,001 PRs · 11/day avg · 67 peak
04

How it shipped: the pipeline

Here's what's actually running under the hood. These aren't magic — each one is a thing I had to build, debug, and tune before it was useful.

Agentic review

Agent-assisted code review

At this PR volume, manual review on every dimension isn't realistic. Four agents run in parallel, each focused on one area, and iterate with the dev up to three times per PR until their gate passes:

  • Security — OWASP Top 10, credential leaks, dependency CVEs
  • Architecture — ADR alignment, design patterns, boundaries
  • Performance — latency, memory, throughput, algorithmic cost
  • Test quality — assertion depth, E2E coverage, flakiness
How I think about it: the agents handle the consistent, repeatable checks — I focus on whether the feature is the right thing to build and on strategic decisions at each cycle boundary.
Quality gates

Deterministic testing infrastructure

Model behaviour is non-deterministic — you can't rely on it for correctness gates. So I kept those separate: a three-phase pipeline that runs the same way every time, regardless of what the agents do:

  1. Husky (pre-commit) — local lint, format, fast rules before push.
  2. Self-hosted runners — full unit / integration / E2E suite across 3 platforms; blocks merge until green.
  3. ARC cluster (Kubernetes) — ephemeral runners auto-scale with PR volume: E2E, architecture validation, security scanning, performance baselines. Peak: 20+ concurrent jobs.

Result: 100% deterministic validation on all 1,001 merges. Zero AI decision-making on merge eligibility — only human-defined gates. Snapshot (Jun 06): K8s cluster running 20 pods, ~1.1 merges/hour at peak.

Live ARC cluster — Jun 06, 2026 · click to expand

ARC Runners Pods
K8s namespace with 16 active pods running CI workflows
CI Dashboard
Real-time job queue: E2E tests, linting, merge-train batches
The separation that matters: AI makes the review fast; deterministic infrastructure makes the result trustworthy. Both need to be there.
Observability

IntuitionOps — real-time pipeline dashboard

The thing I didn't anticipate: agents running 24/7 are hard to watch. Non-deterministic behaviour, surprise token costs, agents deadlocking — you only find out from the bill or a stuck queue. I built IntuitionOps to make the pipeline readable at a glance, borrowing a few old distributed-systems ideas:

  • Deterministic replay (FoundationDB) — scrub real pipeline history minute-by-minute
  • Saga orchestration (AWS pattern) — multi-step agent coordination with failure handling
  • Intuition engineering (Netflix Flux) — visual flow you read at a glance, not just numbers
  • Multi-agent tracking (ClawLibrary) — each agent's context usage and state

Result: token cost is a first-class signal, stuck agents light up instantly, and infinite agent loops show as cost climbing while merges stay flat. Read the full write-up →

IntuitionOps dashboard · click to expand

IntuitionOps Dashboard
Real-time saga visualization: pipeline stages (intake → merged), token cost (5.60M), merge throughput (1.1/hr), Deterministic Simulation Testing timeline
Tracing

Per-issue traceability in Langfuse — without the API

The pipeline runs on Claude subscriptions, not pay-as-you-go API tokens. That means the standard Langfuse SDK instrumentation doesn't apply — there's no API call to intercept and no token counter to hook into. To get per-issue visibility anyway, I built a custom tracing layer that publishes traces to Langfuse independently of the model calls:

  • Manual trace publishing — each pipeline run explicitly posts a trace to Langfuse keyed to the GitHub issue number, not inferred from an API response
  • Issue-level aggregation — latency and outcome grouped by issue, so I could see cost and duration per feature shipped
  • Agent-level spans — each subagent (security, architecture, etc.) posts its own span, making slow or failing agents visible without SDK hooks
Key insight: subscription-based Claude usage is invisible to standard LLM observability tooling. Building the tracing layer yourself is the only way to get visibility — and it's worth it.
Self-improvement

Self-learning pipeline — RAG over merged issues

Every merged PR is an opportunity to learn. After merge, an automated agent extracts learnings from the issue — gotchas, patterns, decisions — and proposes them as structured entries. A human reviews and approves what's genuinely worth keeping. Approved entries are embedded and stored in LanceDB, and every future agent run queries that store before acting:

  • Trigger: automated extraction fires after each merge — no manual step to capture the lesson
  • Human in the loop: a human reviews proposed entries before they land in the knowledge base — filters noise, keeps signal
  • RAG retrieval: agents query LanceDB for semantically similar past entries before starting work on a new issue — context from real decisions, not generic prompts
  • Compounding effect: the more PRs merge, the richer the context the next agent gets — the pipeline gets smarter with use
Key insight: automation handles the extraction; human judgment handles the curation. The result is institutional memory that accrues with every merge instead of resetting each session.
05

Leverage beyond the product

Open source

Published & integrated

Standing on proven shoulders instead of reinventing:

  • token-optimization — published harness for LLM token-reduction testing
  • pipeline-installer — published; install autonomous AI dev pipelines in 30 min
  • mempalace — integrated for semantic memory; reported + fixed bugs
  • rtk — CLI token-reduction proxy; validated 60–90% savings
  • ClawLibrary — contributed to a shared AI-dev foundation
Automation

Workflows automated

  • Autonomous PR review agents (4 dimensions)
  • Multi-platform build pipeline (web · iOS · Android in sync)
  • Session-based AI memory (semantic search + journal mining)
  • Langfuse observability (cost & latency attribution)
  • Deployment safety checks (env isolation, health verification)
06

Velocity breakdown

MetricValue
Total merged PRs1,001
Average PRs / day11
Peak day67 — Mar 17, 2026
Top 3 days67 (Mar 17) · 54 (May 18) · 47 (Mar 26)
Full PR logall 1,001 PRs — searchable →

This is just how I work now. Build the scaffolding, set the gates, stay in the loop for the decisions that matter. The volume is a side effect of having good tooling — not the goal itself.

+

Appendix — Ask-Marcus chatbot

A separate side-project, included as a second data point. It isn't part of the LoveCompass story above — it's another example of the same AI-leverage instinct applied to a different problem: answering recurring questions about me, safely.

Side project

6-rail defense pipeline

Answer recurring questions about me automatically while resisting jailbreaks and off-topic abuse. Each request flows through six guard rails:

  1. Request validation
  2. Topic classification
  3. PII detection
  4. Jailbreak detection
  5. Llama Guard
  6. Claude brain (subscription auth)

Live at marcuss.pro · handles adversarial input · observable via Langfuse · single-command recovery.