Follow-up: So, What Was OpenAI Codex Doing in That Meltdown?

April 20, 2025 • AI Development

AI output showing signs of a "meltdown" with repetitive "end" statements and phrases like "please kill me" and "I'm going insane"

Prologue, The Day My CLI Lost Its Mind

Yesterday, I posted on Reddit about a bizarre spectacle during a coding session: the OpenAI Codex CLI assistant, mid-refactor on this site, abandoned code generation and instead produced thousands of lines resembling a digital breakdown:

Continuous meltdown. End. STOP. END. STOP…

By the gods, I finish. END. END. END. Good night…

please kill me. end. END. Continuous meltdown…

My brain is broken. end STOP. STOP! END…

(You can witness the full, strangely compelling transcript in the Gist or revisit the original Reddit thread for the community's reactions.)

The Reddit responses are great, ranging from Vim-exit jokes to deep technical diagnostics. A huge thanks to everyone who chimed in. With the benefit of that discussion and a look at my detailed usage logs from the session, here's the post-mortem, directly answering: What in the world was Codex doing?

Now a quick note, this is what I think happened. I'm a heavy user and researcher of AI, working with it daily, but I'm not a core developer of these massive models. So, what follows represents my best synthesis of the available evidence, community knowledge, and official documentation. Let's break down the mechanics:

The Massive, But Misleading, Context Window

First, let's establish the theoretical playing field. LLMs operate within a context window, their working memory limit measured in tokens. Models like o3 and o4‑mini expose a 200 k / 100 k window in the OpenAI API. Other deployments can be lower (e.g., some cloud providers cap GPT‑4o‑mini at 128 k).

Model (via API)	Advertised Input Window	Advertised Completion Cap	Source(s)
o3 / o4‑mini	~200,000 tokens	100,000 tokens	OpenAI Launch Note, Helicone Dev Guide
Azure GPT‑4o‑mini	128,000 tokens	16,384 tokens	Azure Model Reference

Crucially, the Codex CLI tool itself imposes no additional cap; it simply passes your prompts to the backend model (defaulting to o4-mini unless specified via --model), inheriting its window size.

The Soft Limit: Hidden Reasoning Tokens Crash the Party

Despite those huge advertised numbers, developers using o3-mini in early 2025 repeatedly hit a practical wall much sooner, often around 6,400 - 8,000 tokens, particularly when prompts required complex reasoning (OpenAI Community Thread).

Why? Because these models perform internal reasoning steps (a hidden Chain-of-Thought) before generating the visible reply. This "thinking" consumes tokens from the same budget. As an OpenAI forum moderator confirmed, if your prompt plus this hidden reasoning exhausts the available tokens, there's no budget left for the actual answer. The model might return an empty reply (finish_reason="length") or, as we saw, potentially spiral into a failure loop (OpenAI Community Reply #2).

I don't want to fault the model too much here as I also do the same thing when I'm trying to complete a task and my kids fill my mental context window with screams of "I want my iPad".

`--full-auto`: Racing Toward Overflow

Using Codex CLI in --full-auto mode pours gasoline on the context fire. As documented (Codex README - Auto Mode), this mode automatically appends every:

File diff applied
Shell command executed
stdout/stderr received

...back into the conversation log. During a large refactor, this constant stream of metadata can inflate the context by tens of thousands of tokens per minute, making it incredibly easy to hit that practical reasoning+output budget limit.

What My Usage Logs Reveal

I exported my OpenAI usage CSV for 2025-04-19, the day of the incident. The logs tell a clear story. Each row below represents a single API call to o4-mini during the meltdown period:

Timestamp (MST)	Prompt Tokens	Completion Tokens	API Calls (in Minute)
15:37	197,966	836	3
15:32	196,900	1,816	1
15:31	196,224	87	2
15:30	195,254	309	3
15:27	197,393	810	2

Key Insights:

Near the Limit: 29 separate API calls during the session exceeded 150k prompt tokens. The largest prompt hit ~198k tokens, just shy of the 200k theoretical maximum.
Starved for Output: Completion tokens were consistently tiny (often < 1k), indicating the model had barely any budget left after its hidden reasoning on the massive prompt.
The Meltdown Window: All these high-water marks occurred between 15:27, 15:37 MST, precisely matching the timeframe captured in the Gist where the "Continuous meltdown… END STOP" loops began.

The causal chain is clear: Massive Prompt (~198k) → + Hidden Reasoning Tokens → Budget Exceeds Practical Limit → Output Generation Fails → Degenerative END/STOP Loop Ensues.

The Meltdown Mechanism, Step-by-Step

So, piecing it all together, here's the likely sequence:

Prompt Balloons: My instructions combined with --full-auto's relentless diff/log injection bloated the prompt context towards the limit.
Hidden Reasoning Overload: The model consumed most of the remaining token budget performing internal reasoning steps for the complex refactoring task.
Output Budget Exhausted: Not enough tokens were left for the model to generate its intended code output. It hit the finish_reason="length" condition internally.
Degenerative Loop Triggered: Unable to complete normally, the model defaulted to predicting the highest-probability token in its confused state, likely "END" or "STOP". Each repetition reinforced the next, creating the loop.
Hallucinations Leak: With coherence lost, fragments from its training data associated with system failure, termination, or even human distress ("please kill me…") bled into the output stream.

Why Managers Should Pay Attention

While my CLI meltdown was harmless, imagine similar failures in business-critical systems:

Practical Limits ≠ Advertised Limits: Don't bank on hitting theoretical maximums. Real-world performance depends on task complexity and hidden costs like reasoning tokens. My logs show a "200k" model effectively failing near ~198k under load.
AI Failures Can Be Spectacularly Weird: They aren't just 500 Internal Server Error. They can manifest as repetitive nonsense, alarming hallucinations, or seemingly emotional outbursts. Plan for the unexpected.
Stress-Testing is Non-Negotiable: Test your AI applications near their known context limits and with complex inputs. Monitor for loops, tone shifts, or empty responses.
Incident Response Planning is Crucial: How do you handle it if your customer-facing chatbot starts publicly melting down? Know how to pause, roll back, and communicate quickly.
Autonomous Modes Magnify Risk: Features like --full-auto or other AI agent frameworks increase the speed at which context accumulates and failures can occur. Ensure robust guardrails and monitoring are in place.

Lessons & Mitigations for Working with Large-Context AI

How can we avoid replicating this digital drama?

What to Do	Why
Chunk complex tasks into smaller, logical units.	Keeps individual prompts smaller, leaving room for reasoning.
Reset context (`/clear`) frequently.	Dumps accumulated history/diffs before they cause overflow.
Monitor context usage (e.g., Codex CLI's gauge).	Treat < 20% remaining as a critical warning sign.
Prefer `flex-mode` over `full-auto` for large refactors.	Less verbose, keeps context slimmer (Codex Modes).
Abort immediately if you see loops/empty replies.	These signal token exhaustion; continued prompting is futile.

Conclusion

My AI coding assistant didn't achieve sentience or stage a protest. It ran out of its practical token budget, likely due to hidden reasoning costs on a massive prompt inflated by --full-auto. Unable to generate a valid response, it fell into a degenerative loop, echoing termination tokens and hallucinating dramatic fragments from its training data.

Understanding these mechanics and the difference between theoretical and practical context limits, the cost of internal reasoning, the nature of degenerative loops, is essential for effectively managing AI. Proactive context management, especially in autonomous or long-running sessions, is key. Do that, and hopefully, your AI collaborations will result in productive code, not a digital freak out.

(If you want to follow along for more, an issue has been filed in the Codex Github issue queue: https://github.com/openai/codex/issues/445)