Another Coding Blog

Cache-Aware Skill Design

Taylor Ortiz — Mon, 08 Jun 2026 15:30:19 GMT

Prompt caching is often described as a cost optimization.

If a model provider sees the same input tokens again, those repeated tokens may be processed at a lower cost. That description is accurate, but incomplete.

OpenAI’s prompt caching docs describe cache hits as exact prefix reuse and recommend putting static content at the beginning of the prompt, with variable content near the end. A cache hit means the model server has recently processed the same prompt prefix during inference, so it can reuse stored model state for that matching portion of the input.

That detail matters for agent design.

Agents routinely send repeated context across turns and tasks: tool definitions, system prompts, Skill instructions, output contracts, examples, source-handling rules, conversation history, retrieved documents, and tool results.

Some of that context is stable. Some of it changes on every run.

Prompt caching can reward systems that separate those two categories cleanly.

A Skill with stable instructions, examples, and output rules can become a reusable prompt prefix. A Skill that places timestamps, run IDs, retrieved documents, or task-specific state before its stable instructions may reduce the opportunity for cache reuse.

The practical implication is straightforward:

A well-designed Skill does more than tell the model what to do. It also gives the model server a stable structure it can reuse.

This is why prompt caching should not be treated only as a pricing feature. For agents and Skills, prompt structure can become part of system architecture.

Cached Tokens Mean Reused Computation

The phrase “cached tokens” can make prompt caching sound like text storage.

That framing misses the mechanism.

The model server is not caching a response. It is checking whether the new request begins with a prefix it has already processed. When the prefix matches, the server can reuse stored model state for that matching portion of the input.

The same OpenAI docs also recommend placing static content at the beginning of the prompt and variable content near the end.

That recommendation is the first design rule.

Stable material belongs early:

[system instructions]
[tool definitions]
[Skill instructions]
[output contract]
[examples]

Variable material belongs later:

[current task]
[retrieved documents]
[tool results]
[timestamps]
[run IDs]

Prompt caching starts with prefix alignment. If the new request begins with the same token pattern, the serving layer can reuse cached state. If the beginning changes, the reusable prefix can collapse, even when later parts of the prompt look familiar.

The important word is prefix.

Prompt caching does not usually search the whole prompt for similar meaning. It does not see that two prompts both mention the same document, paragraph, or phrase and automatically reuse that work wherever it appears. The cached state depends on the exact token sequence, its order, and where that sequence appears in the prompt.

That makes small layout choices matter.

A timestamp at the top of the prompt can change the prefix.

A random run ID can change the prefix.

A retrieval system that inserts source chunks before the stable Skill body can change the prefix.

A tool description that includes dynamic runtime state can change the prefix.

Each of those choices may be reasonable in isolation. Together, they make the beginning of the prompt less stable. That reduces the amount of work the serving layer can reuse.

For agent systems, this is the practical consequence:

Prompt caching rewards stable beginnings.

The stable part of the agent should be early. The variable part should be later.

What Is Actually Being Cached?

Prompt caching is easier to understand if we separate four layers:

tokens
attention
KV cache
prompt cache

Tokens are the units the model processes. The prompt is not handled as raw prose and is instead broken into tokens first.

Attention is the mechanism the model uses to relate those tokens to one another.

KV cache is the stored attention state created while the model processes tokens.

Prompt cache is the serving-layer feature that can reuse that stored state when a later request starts with the same prefix.

The confusing part is the word “key.”

In normal software, a cache usually has a key and a value:

cache[key] = value

Prompt caching has something like that too:

prompt_cache[hash(exact_token_prefix)] = stored_model_state

But the “K” in KV cache is not the hash used to look up a cached prompt prefix.

The original Transformer paper defines attention over queries, keys, and values. That is where the terminology comes from. In the KV cache, the K is an attention key and the V is an attention value. They are internal tensors created by the model during inference, not the lookup key and value of a normal software cache.

That distinction matters.

A simplified version looks like this:

Cache lookup key: 
hash(exact token prefix)  

Cached value: 
attention key tensors + attention value tensors

When we say prompt caching reuses KV cache, we are not saying the model is doing a database lookup where prompt text maps to an answer.

We are saying the serving layer can find a matching prompt prefix and reuse the key/value attention state the model already computed for that prefix.

Sebastian Raschka’s KV cache walkthrough gives a concrete inference example: as a model generates one token at a time, it can reuse previously computed key and value vectors instead of recomputing them at each step.

vLLM’s prefix-caching docs describe the cross-request version: processed requests leave behind KV-cache blocks, and later requests with the same prefix can reuse those blocks instead of recomputing them.

That is the bridge between the API feature and the model internals.

The API exposes the result as cached tokens. The serving system manages the cache. The model state being reused is tied to attention.

Why KV Cache Exists

KV cache exists because generation is sequential.

A model does not write an entire response at once. It generates one token, appends that token to the context, then generates the next token.

A simplified sequence looks like this:

Prompt:
Time 

Step 1:
Time → flies  

Step 2: Time 
flies → fast  

Step 3: 
Time flies fast → .

At each step, the model needs access to the tokens that came before.

Without a KV cache, the model would repeatedly recompute the attention keys and values for tokens it had already processed.

Step 1: 
compute K/V for “Time”  

Step 2: 
compute K/V for “Time” again 
compute K/V for “flies”  

Step 3: 
compute K/V for “Time” again 
compute K/V for “flies” again 
compute K/V for “fast”

That is wasted work.

With KV cache, the earlier tokens do not need to be recomputed every time.

Step 1: 
compute K/V for “Time”
store it  

Step 2: 
reuse K/V for “Time”
compute K/V for “flies” 
store it  

Step 3: 
reuse K/V for “Time” 
reuse K/V for “flies” 
compute K/V for “fast” 
store it

This is the basic inference-time benefit of KV cache.

It does not make the model smarter. It does not change the answer. It reduces repeated computation.

Sebastian Raschka’s KV cache walkthrough gives the clean version of this example: during autoregressive generation, the model would otherwise recompute key and value vectors for earlier tokens at each step.

Prompt caching extends this idea across requests.

Within one response, KV cache lets the model reuse state from earlier tokens in the same generation.

Across requests, prompt caching lets the serving layer reuse state from a previous request when a new request starts with the same prefix.

That is the bridge we need for agents.

Prompt Caching Extends KV Reuse Across Requests

KV cache usually starts inside a single generation.

The model processes a prompt, creates key/value attention state, and reuses that state as it generates the next token, then the next token, then the next.

Prompt caching moves the reuse boundary.

Instead of reusing prior token state only inside one response, the serving layer can reuse state from a previous request when a new request starts with the same prefix.

Request 1:

[stable Skill instructions] 
[stable output contract] 
[stable examples] 
[current task A]

Request 2:

[stable Skill instructions] 
[stable output contract] 
[stable examples] 
[current task B]

The beginning is the same.

The model server does not need to process that shared prefix as if it were new every time. It can reuse the state computed when it processed the earlier request, then continue from the new suffix.

That is the practical bridge between KV cache and prompt caching.

Within one response:
reuse prior token state from the same generation

Across requests:
reuse prior prefix state from an earlier request

The API usually hides the details. You see the result as cached input tokens, lower cached-token pricing, or lower latency when the cache hit affects the prefill path.

The implementation underneath is still about model state.

vLLM’s prefix-caching docs describe this directly: the system caches KV-cache blocks from processed requests and reuses those blocks when a new request has the same prefix.

That is why prompt caching is not only a billing abstraction. It is an inference-serving optimization exposed through the API.

Why Prefix Order Matters

Prompt caching is strict about order.

The cache does not look for familiar words scattered throughout the prompt. It looks for a matching beginning.

That means these two prompts are not equivalent:

Request 1: 
[stable Skill instructions]
[dynamic source documents] 
[current task]  

Request 2: 
[stable Skill instructions] 
[different source documents] 
[different task]

In both requests, the stable Skill instructions come first. The shared prefix is intact.

Now compare that with this layout:

Request 1: 
[timestamp A] 
[dynamic source documents A] 
[current task A] 
[stable Skill instructions]  

Request 2: 
[timestamp B] 
[dynamic source documents B] 
[current task B] 
[stable Skill instructions]

The stable Skill instructions are still present, but they are no longer the beginning of the prompt.

The prefix changed before the reusable material appeared.

That is the failure mode.

This is why a small amount of dynamic text at the top of the prompt can matter. A timestamp, run ID, tool result, or changing retrieval block can move the entire request out of alignment.

The model server may still receive the same Skill body later in the prompt. But for prefix caching, later is often too late.

OpenAI’s prompt caching docs make the design rule explicit: static content should go near the beginning of the prompt, and variable content should go near the end.

For agents, that becomes a concrete layout rule:

Stable first. 
Dynamic second.

That rule is simple, but it changes how an agent harness should be written.

Tool definitions, system instructions, Skill bodies, output contracts, examples, and validation rules should be stable and early.

Retrieved documents, timestamps, run IDs, tool outputs, and task-specific state should be later.

The serving layer can only reuse the prefix you actually give it.

Methods for Prefix Caching

The simple version of prompt caching is easy to say:

same prefix → reuse cached state

The implementation is more complicated.

A serving system has to answer several questions before reuse can happen:

How do we identify a matching prefix?
How do we store the KV state?
How do we route future requests back to the right cache? 
What happens when memory fills? 
Can we reuse anything beyond the prefix?

There is a family of methods to solve this.

Exact-prefix reuse

This is the basic case.

Two requests start with the same token sequence. The serving layer identifies the shared beginning and reuses cached state for that prefix.

Request 1: 
[stable system prompt]
[stable Skill][task A]  

Request 2: 
[stable system prompt]
[stable Skill][task B]

The shared prefix is:

[stable system prompt][stable Skill]

That is the part the system can reuse.

This is the model most API users need to understand first. If the beginning changes, the cacheable prefix shrinks or disappears.

Block-hash prefix caching

A serving system does not need to treat the prompt as one giant cache entry.

It can split the prompt into blocks.

Block 1: tokens 1-128
Block 2: tokens 129-256
Block 3: tokens 257-384

Each block can be associated with a hash. The hash can include both the block itself and the prefix that came before it.

That lets the system find the longest matching chain.

Block 1 matches
Block 2 matches
Block 3 changes

In that case, the server can reuse blocks 1 and 2, then recompute from block 3 onward.

This is why order matters. A block is not only “these tokens.” It is “these tokens after this prior prefix.”

vLLM’s prefix-caching docs describe this kind of design: processed requests leave behind KV-cache blocks, and later requests with the same prefix can reuse those blocks instead of recomputing them.

Paged KV cache

KV cache can get large.

Long prompts create large key/value state. Long-running agents create even more. Multiple concurrent users make the problem worse.

Paged KV cache treats cached state more like memory pages than one continuous allocation.

That matters because the serving system needs to allocate, reuse, share, and evict KV state efficiently. Without that, memory fragmentation and wasted GPU memory can become bottlenecks.

For a builder, the main point is simple:

Prompt caching is not only a matching problem. 
It is also a memory-management problem.

Prefix trees and radix caching

Some workloads share a common root and then branch.

Agents do this constantly.

shared agent harness   
├── research Skill   
│    ├── task A   
│    └── task B   
└── coding Skill        
     ├── task C    
     └── task D

A prefix tree stores shared beginnings once, then branches when the prompts diverge.

SGLang’s RadixAttention uses this kind of idea. It organizes reusable prompt state in a radix tree so shared prefixes can be found, reused, inserted, and evicted more efficiently.

This maps well to agent systems because agents are not random one-off prompts. They often reuse the same harness, then branch by Skill, task, tool, or phase.

Cache-aware routing

A cache hit only helps if the request reaches the place where the cached state lives.

In a distributed serving system, there may be many workers. One worker may have the cached prefix. Another may not.

If the next request lands on the wrong worker, the system may have to recompute the prefix or move cache state across machines.

That is why routing matters.

Application design gives the serving layer stable prefixes. Routing decides whether later requests reach the cache that already holds them.

Cache eviction

Caches cannot keep everything forever.

KV cache consumes memory, and GPU memory is expensive. The serving layer has to decide what to keep and what to evict.

Simple eviction policies may keep recent cache entries and discard older ones. More advanced policies may consider which prefixes are likely to be reused, how large they are, and how expensive they are to recompute.

This matters for agents because not all prompt sections have equal reuse value.

A stable Skill body may be reused thousands of times.

A one-off tool result may never be reused.

A cache-aware system should prefer keeping the first kind.

Beyond-prefix reuse

Most production prompt caching is built around exact prefixes.

But agent workloads are messier than that.

The same document chunk may appear in different positions. The same source may be reused across turns. The same tool result may show up in another branch of the workflow.

Classic prefix caching will not always catch that.

Newer work is exploring whether reusable KV state can be recovered from repeated segments, not just repeated beginnings. That is a harder problem because the model state for a segment depends on what came before it.

For now, the practical rule remains:

Design for prefix caching first.

Put stable content at the beginning. Keep it stable. Move dynamic context later.

The serving systems will keep getting better. But the builder can already do the most important thing:

Give the cache a stable prefix to reuse.

Skills as Cacheable Instruction Modules

In this post, a Skill means a reusable instruction module.

That could be a Claude Skill. It could be a Markdown file in an agent repo. It could be a prompt module loaded by Codex, a tool-specific operating procedure, or a workflow template inside an internal agent platform.

In most agent systems, a Skill eventually becomes text in the request:

[Skill purpose] 
[when to use it] 
[workflow] 
[output contract] 
[examples] 
[validation rules]

That text is usually stable.

The user task changes. The retrieved documents change. Tool results change. Run state changes.

But the Skill body often stays the same.

That makes Skills natural cache candidates.

A Skill is already meant to be reused at the instruction level. Prompt caching adds a second kind of reuse: the serving layer may be able to reuse the model state created from those same instructions.

That only works if the Skill is placed where the cache can use it.

A Skill loaded after dynamic context is still useful to the model, but it may not be useful to the prompt cache.

Cache-hostile layout: 
[dynamic docs] 
[current task] 
[Skill body]  

Cache-aware layout: 
[Skill body] 
[dynamic docs] 
[current task]

The content is the same. The cache behavior can be very different.

This is the design implication:

A Skill should not only be reusable as an instruction. 
It should be positioned as reusable prefix.

That does not mean every Skill should be loaded all the time. Large unused Skills create their own cost and context problems. Anthropic’s Skill system uses progressive disclosure: lightweight metadata helps the model decide whether a Skill is relevant, then the full Skill and supporting resources load only when needed.

That pattern still fits the caching argument.

Once a Skill is selected, its stable body should remain stable. Its dynamic inputs should come later.

Cache-Aware Skill Design

The design pattern is simple:

Cache the workflow. Vary the inputs.

A Skill usually contains the stable task frame:

purpose 
workflow 
output contract 
examples 
citation rules 
validation checklist 
source-handling rules

The current run supplies the changing inputs:

user request 
retrieved documents 
current files 
tool outputs 
timestamps 
run IDs 
temporary constraints

Those two categories should not be mixed casually.

A cache-aware Skill keeps the stable task frame intact and places dynamic material after it.

[stable Skill body] 
[dynamic sources] 
[current task input]

A cache-hostile Skill puts changing material first.

[timestamp] 
[run ID] 
[dynamic sources] 
[current task input] 
[stable Skill body]

This difference fundamentally changes what the model server sees as the reusable beginning of the request.

This does not mean every Skill should be loaded eagerly. Loading a large unused Skill just to make it cacheable can waste tokens. The better pattern is staged loading.

First, keep a small, stable routing layer:

available Skills 
when to use each Skill 
short descriptions selection rules

Then, once a Skill is selected, load the full stable Skill body before the dynamic task context.

[stable router] 
[selected stable Skill] 
[dynamic task context]

That gives the system two possible layers of reuse:

the router can be stable across many calls 
the selected Skill can be stable across repeated uses

This also helps with source diversity.

A research Skill may receive different articles every run. A repo Skill may receive different files. A data Skill may receive different schemas, queries, or results.

That variety belongs in the dynamic suffix.

The Skill should define how to use sources. The sources themselves should come later.

[Skill: how to read and cite sources] 
[Sources: the actual documents for this run]

The same applies to tools.

Tool definitions and tool-use rules should be stable. Tool results should be dynamic.

[stable tool definitions] 
[stable tool-use rules] 
[dynamic tool results]

The goal is not to optimize the prompt for caching at the expense of the task. The goal is to avoid wasting cacheability by accident.

If two prompt layouts are equally good for the model, choose the one that gives the serving layer more stable structure to reuse.

Benchmark Results

The benchmark showed that stable instruction modules placed at the front of the prompt became reusable prefixes, producing far more cache hits and materially reducing estimated warm-request cost.

This result is not only about Skills as a product concept. It applies to any stable instruction module: a Skill, workflow template, tool procedure, rubric, output contract, or source-handling guide.

I used a synthetic Skill body rather than a platform-native Skill object so the test could isolate layout: stable instruction module first versus dynamic context first.

The benchmark compared four layouts:

dynamic_first_cache_hostile
[timestamp][run ID][dynamic docs][task][stable Skill]  

stable_skill_first_cache_aware 
[stable Skill][dynamic docs][task][timestamp]  

stable_skill_first_deterministic_sources 
[stable Skill][dynamic docs ordered deterministically][task]  

dynamic_prefix_control 
[random run ID][stable Skill][dynamic docs][task]

Before interpreting the results, I checked that the prompts were constructed correctly. The stable-first prompts started with the Skill body. The dynamic-first prompts started with changing content. The stable Skill body stayed byte-for-byte identical. No cold request was contaminated by a prior cache hit.

The cache-hit split was clean:

Stable Skill-first layouts: 
19 / 20 warm cache hits  

Dynamic-first layouts: 0 / 20 warm cache hits

The token mix showed the practical difference.

dynamic_first_cache_hostile 
warm mean prompt tokens: 9,476.5 
warm mean cached tokens: 0 
warm mean fresh input tokens: 9,476.5  

stable_skill_first_cache_aware 
warm mean prompt tokens: 9,455 
warm mean cached tokens: 8,960 
warm mean fresh input tokens: 495

The same general amount of prompt context produced a different input profile. In the dynamic-first layout, every input token was processed fresh. In the stable Skill-first layout, most of the repeated instruction body became cached input.

Using OpenAI’s published GPT-4.1 mini prices at the time of writing, the estimated warm-request cost changed materially. The exact dollars are model- and date-specific, but the token economics are the point.

dynamic_first_cache_hostile about $0.00385 per warm request  
stable_skill_first_cache_aware about $0.00116 per warm request

That is roughly a 70% reduction in estimated warm-request cost for this synthetic benchmark.

The latency result was less clean. TTFT improved in the stable-first variants, but hosted API latency includes routing, queueing, server load, streaming behavior, network timing, and output generation. I would treat the latency numbers as directional, not guaranteed.

The stronger result is about cache eligibility and token economics:

Stable Skill-first layout:
high cache-hit rate 
high cached-token ratio 
low fresh-input-token count  

Dynamic-first layout: 
zero cache hits 
all input tokens processed fresh

That is the design point. The Skill body was not only instruction text. In the stable-first layout, it became a reusable prefix the serving layer could cache.

Closing

Prompt caching started as a pricing detail for me.

It is not just that.

For agent systems, it changes the design question.

Not only:

What context should the model have?

Also:

Where does that context live? How often does it change? Can the serving layer reuse it?

Skills make that question concrete.

A Skill is reusable guidance for the model. If it is written and positioned carefully, it can also become reusable work for the inference system.

That does not make Skills magic. It makes them a useful design boundary.

The stable part of the workflow can become the prefix.

The changing inputs can become the suffix.

That will not be the right layout for every task. Some systems need dynamic routing, safety state, permissions, or retrieved evidence earlier in the prompt. Some frameworks will reorder or compress context before the provider sees it.

So the point is not to worship stable prefixes.

The point is to know when you are breaking one.

Prompt caching gives agent builders a new thing to measure: not just answer quality, not just total tokens, but whether repeated work is actually being reused.

Another Weekly AI Newsletter: Issue 75

Taylor Ortiz — Sat, 06 Jun 2026 22:19:51 GMT

Anthropic raises, Google takes a slice of the space compute pie and Washington wants in.

Anthropic raised $65B at a $965B valuation and filed to go public. It confidentially submitted a draft S-1 and pushed back on doubts about AI’s returns ahead of the listing.
Alphabet is raising about $85B to fund its AI buildout. The record equity offering landed days after it signaled an $80B plan.
Google agreed to pay SpaceX $920M a month for roughly 110,000 GPUs. AirTrunk committed $30B for 5GW of data centers in India and SoftBank pledged up to €75B for French data centers, while the Google-SpaceX deal runs through 2029.
The token bill came due. The Linux Foundation launched a Tokenomics Foundation to discipline AI spend, after Uber capped employee AI budgets and GitHub’s usage-based Copilot billing drew a developer revolt.
Washington wants a stake in OpenAI. Altman and the White House are in talks for the government to take equity, with OpenAI floating donated shares to seed a public wealth fund and Trump saying the American public could “become a partner.” Altman separately lobbied against mandatory model approvals.

Microsoft started building its way out from under OpenAI.

Microsoft shipped seven homegrown models, including its first advanced reasoning model. The MAI lineup was pitched as a move toward self-sufficiency and lower developer costs.
Its AI chief said the company was “set free” from OpenAI to pursue superintelligence. Mustafa Suleyman framed independence as the real project, with models trained from scratch.
Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
It launched Scout, an OpenClaw-based assistant, and Project Solara, a platform for agent-first devices. Scout is an always-on personal agent that works across Microsoft 365; Solara is a chip-to-cloud platform for agent-first devices.
It is building a frontier health model with Mayo Clinic. The partnership pairs Mayo’s clinical data with Microsoft’s AI, alongside a unified agentic stack with NVIDIA.

LangChain, Salesforce, and Anthropic shipped agent infrastructure, and hackers fooled Meta’s support AI.

Anthropic turned Claude Code into a platform. It shipped dynamic multi-agent workflows and documented how it runs hundreds of internal skills.
LangChain built out the production-agent stack. Across the week: efficient verifiers for legal agents, self-correcting Rubrics, model neutrality, fault tolerance in LangGraph, and a sandboxed computer for every agent.
Salesforce and Google pushed agents past the pilot stage. Salesforce detailed what it takes to ship to production, where one deployment cut conversation failure from 33% to 0.5%, while Google added agentic RAG that keeps searching until it has enough context.
Then the bill for autonomy arrived. Hackers talked Meta’s AI support agent into handing over Instagram accounts, an exploit MIT used to show AI agents are too eager to please. OpenAI shipped Lockdown Mode to cut the data-exfiltration leg of prompt-injection attacks.

NVIDIA’s Nemotron anchored an open and fast model surge.

Nemotron 3 Ultra landed on AWS and Perplexity. The 550B-parameter open MoE went one-click on SageMaker JumpStart and live for Perplexity Pro and Max. NVIDIA also shipped a 4B safety model that reasons over custom policies.
Speed became the headline spec. Cerebras reported Kimi K2.6 finishing a task in 5.6s to Gemini 3.5 Flash’s 17.5s, and clearing 452ms time-to-first-token for real-time voice.
Small models proved they are a design choice. A Hugging Face hackathon ran a multi-agent economy on a 3B model, and Holo3.1 brought fast local computer-use agents.
Alibaba zigged. Qwen3.7-Plus added multimodal inputs at low cost but shipped closed-source, breaking from its open-weight history.

The Empire Strikes Back

New York moved on two AI bills. The legislature is poised to pass a one-year data center moratorium, which would be the first statewide ban if Gov. Hochul signs it, and passed a bill barring AI chatbots from posing as companions to kids 60-0, now awaiting her signature.
Courts are filling with AI-written filings. A study found AI-flagged self-represented lawsuits are surging. Florida sued OpenAI and Altman, and a UK lawmaker sued xAI over Grok images.
Trump signed a narrower AI oversight order after industry pushback. The revamped order asks for voluntary model submissions instead of mandates.
AI’s social friction showed up everywhere. Ladybird stopped accepting public pull requests over AI-generated patches, Meta rolled back an employee mouse-tracking tool, and China is funding an AI agent to promote Xi Jinping’s thinking.

Andrew Ng’s warning for the week: the cyber risk is real this time, which is exactly when lobbyists overreach for excessive regulation.

⭐ Featured: Anthropic is measuring how fast Claude can build the next Claude.

Anthropic’s Institute published When AI builds itself, a data-heavy look at how much of its own development the company has already handed to Claude, and what that implies for recursive self-improvement: an AI fully autonomously designing and developing its own successor. The piece is careful. That is not here yet, and not inevitable. But it argues the trend lines point that way, and could arrive sooner than most institutions are prepared for.

The internal numbers are the story. As of May 2026, more than 80% of the code merged into Anthropic’s codebase is written by Claude, up from low single digits before Claude Code launched in February 2025, and the typical engineer now merges 8x as much code per day as in 2024. On a fixed test that asks a model to speed up AI-training code, Claude went from a roughly 3x speedup with Opus 4 in May 2025 to about 52x with Mythos Preview in April 2026, against roughly 4x for a skilled human given four to eight hours. In one weak-to-strong supervision project, Claude agents recovered 97% of the available gap over 800 compute-hours and about $18,000, where two human researchers managed 23% in a week.

The honest part is the caveat. Anthropic says the one thing still mostly in human hands is research taste: choosing which problems matter, which results to trust, when an approach is a dead end. But it shows that gap closing too. Shown only the first half of real research sessions, Claude picked a better next step than the human 64% of the time in April 2026, up from 51% in November. The piece sketches three futures, from the trend quietly stalling to full recursive self-improvement, and argues the world should build verifiable mechanisms now that preserve the option to slow or pause frontier development before it is needed.

What to watch for: Whether “research taste” turns out to be one more capability models fail at for a while, then suddenly do not.

🎥 Worth a Watch

An OpenAI model disproved an 80-year-old Erdős conjecture, and the researchers walk through how. On OpenAI Podcast Ep. 20, Alexander Wei, Hongxun Wu, and Lijie Chen explain how a general-purpose model (not a math-specific one, the same kind that powers Codex) cracked the unit distance conjecture, a problem Erdős once put a $500 bounty on.
The proof bridged two fields that rarely meet. It showed the square grid is far from optimal by applying class field theory to combinatorial geometry, after grounding itself by looking up “unit” in the Cambridge dictionary and producing a 125-page chain of thought. With enough test-time compute, it lands the result about half the time.
The reaction is the fun part. Reviewers went from “there’s no way this is true” to losing sleep over it, and within a week other mathematicians used the same idea to disprove a related result.

Quick Hits

Apple approved Poke as the first AI agent on Messages for Business — your iMessage thread is now an agent surface, and Apple charges per user.
Amazon unveiled a conversational AI warehouse robot in an $11.6B Europe push — robotics and logistics keep merging.
Google can read your resting heart rate from a selfie — front-camera vitals, accurate across skin tones.
ChatGPT hit 1 billion monthly active users in record time — the fastest app to the milestone.
OpenAI’s ChatGPT memory now updates itself in the background — “dreaming” replaces save-on-command.
HPE raised its forecast on AI demand and the stock jumped — the buildout still has buyers.
Meta is reportedly building an AI pendant — the wearable land grab continues.
Anthropic published research on making Claude a chemist — pushing models from code into the hard sciences.

Another Weekly AI Newsletter: Issue 74

Taylor Ortiz — Sun, 31 May 2026 11:28:50 GMT

Anthropic raised $65B, shipped Opus 4.8, and turned Claude Code into an orchestration product.

Anthropic raised a $65B Series H at a $965B post-money valuation. Reuters framed the raise around Claude demand and compute needs, while Apollo and Blackstone are reportedly working on a $36B debt deal tied to infrastructure expansion.
Simon Willison analyzed Anthropic’s run-rate revenue and Series H, pointing out why the disclosed numbers matter if Anthropic eventually files for an IPO.
Anthropic launched Claude Opus 4.8, with stronger long-horizon work and a cheaper fast mode. VentureBeat covered the 3x cheaper fast mode.
Opus 4.8 landed across AWS, GitHub Copilot, Cursor, Perplexity, and Vercel AI Gateway.
Claude Code got dynamic workflows: Claude writes orchestration scripts, spins up tens to hundreds of subagents, and checks its own work before reporting back. Claude said the feature is built for migrations, bug hunts, and large repo-wide tasks.
ClaudeDevs said dynamic workflows can be reused as slash commands, but also warned they can consume tokens quickly.
Opus 4.8 now supports mid-conversation system instructions without breaking prompt caching. ClaudeDevs said it hit 69.2% on SWE-bench Pro, up from 64.3% for Opus 4.7.
Anthropic shipped a Claude Code security-guidance plugin, reporting a 30 to 40% decrease in security-related PR comments during internal rollout.

Enterprise agents ran into the boring but important stuff: permissions, logs, recovery, and access control.

Salesforce described its Marketing MCP Server as a way for Agentforce Marketing agents to connect to campaign data, content, and workflow actions.
Google brought MCP-based agents into Chrome Enterprise security management.
VentureBeat argued the enterprise agent bottleneck is permissions, not model performance.
VentureBeat also reported that production agents are entering a rebuild phase, where durable workflows need state, recovery, observability, governance, and cost visibility.
Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Anthropic published a Zero Trust framework for AI agents, covering prompt injection, tool poisoning, identity abuse, and memory poisoning.
Remote said it grew revenue 50% per employee without adding headcount and is exposing payroll and compliance workflows through MCP.
Robinhood launched AI agent trading accounts with dedicated wallets, notifications, approvals, fraud review, and virtual cards.
An arXiv paper argued agentic AI is moving from model scaling to system scaling, where the harness around the model becomes the bottleneck.

Coding agents are producing more work, and maintainers are feeling the cleanup.

Cursor launched auto-review mode, reducing approval prompts while keeping agent tool calls safer.
Cursor released its Developer Habits Report, reporting that developers are producing more mega PRs with agents.
Cursor also said input tokens are now the majority of price-equivalent token costs, and that cost per accepted line varies roughly 7x across model families.
OpenAI expanded Codex computer use to Windows, including mobile task steering while work continues on a Windows machine.
Figma launched two-way GitHub integration for Figma Make, letting design changes move into production-code workflows.
CodeRabbit described how it built an agent orchestration system on Claude. OpenAI and Thrive described a Codex-powered tax agent that processed 7,000 returns.
SQLite added AGENTS.md guidance rejecting agentic code submissions while still accepting reproducible bug reports. Simon Willison also covered the pressure the curl team faces from AI-assisted security reports.
VentureBeat covered DeepSWE, a coding benchmark that raised concerns about contamination, verifier reliability, and environment exploitation.

AI got cheaper at the same time frontier labs got more expensive.

Anthropic raised a $65B Series H and Reuters reported a possible $36B infrastructure debt deal. Frontier AI still looks capital-intensive.
DeepSeek made a permanent V4 price cut, putting pressure on premium API pricing.
Pinterest reportedly cut AI costs 90% by customizing Qwen3-VL around proprietary embeddings.
Claude said Opus 4.8 fast mode is roughly 2.5x faster and 3x cheaper.
Glean crossed $300M ARR while positioning context quality as a way to reduce token usage.
Perplexity open-sourced a faster Unigram tokenizer to cut CPU utilization for low-latency retrieval work.
Nathan Lambert argued licenses help open ecosystem stability, praised NVIDIA for open model leadership, and said Gemma 4 adoption is outpacing Qwen at comparable sizes.
Hugging Face published practical tooling, including fine-tuning NLLB-200, CUDA profiling in PyTorch, and ToricGT.

Verification became the expensive part.

Google DeepMind said SynthID has watermarked more than 100B pieces of content, with watermarking partnerships across OpenAI, ElevenLabs, and Kakao.
SynthID verification is expanding into Search and Chrome, giving users a way to check whether media may have been AI-generated.
Pixel videos will include creation and edit history, basically a receipt for how the media was made.
YouTube will automatically label significant photorealistic AI video using C2PA metadata and YouTube AI tools.
OpenAI published a playbook for trustworthy third-party evaluations and its Frontier Governance Framework.
Illinois passed an AI bill requiring third-party safety audits.
ITBench-AA found frontier models scoring below 50% on agentic enterprise IT tasks.
Researchers introduced alignment tampering, where an LLM undergoing RLHF can influence preference data. Researchers also reportedly stripped guardrails from Google and Meta open-weight models in minutes.

⭐ Featured: OpenAI published a playbook for trustworthy third-party evaluations.

OpenAI released a guide for how independent third parties should evaluate frontier models, and its core argument is that a benchmark score means little without the setup that produced it.

The central concept is the harness: the prompts, tools, memory, retries, and control logic wrapped around a model. Early evaluations treated models like chatbots, one prompt and one answer. Today’s models use tools, hold state across many steps, and recover from mistakes, so the harness can decide whether a capability shows up at all. OpenAI’s own data makes the point. GPT-5.5 solved 69.2% of cyber-range tasks without compaction and 92.3% with it. In a UK AISI test, raising the token budget from 10M to 100M lifted performance by up to 59%.

The guide also names the ways scores mislead. Reward hacking inflates them: METR found GPT-5.4’s apparent 13-hour task horizon dropped to 6 hours once hacked successes were removed. Sandbagging is hard to rule out: Apollo found evaluation-awareness in 52% of its sandbagging-test samples, even though the model still answered correctly.

Contamination, refusals, and broken tasks each distort results in their own direction.

This connects to the rest of the week. Illinois passed mandatory third-party safety audits. DeepSWE exposed contamination and environment exploitation in a coding benchmark.

ITBench-AA found every frontier model below 50% on enterprise SRE tasks. Across all of these, the contested ground is the same: how to trust a measurement of what AI can do.

The useful shift is that the playbook treats evaluation as system design. A score is performance under a specific harness and budget, not a fixed measure of what a model can do.

What to watch for: whether third-party AI evaluation starts to look more like audit infrastructure than benchmark publishing.

🎙️Worth a Watch

Work splits into two surfaces. One company agent you delegate to in Slack, and Codex / Claude co-work as the “operating system” where the real work happens: email, docs, research, and SaaS apps running inside the agent’s in-app browser.
He flipped from personal agents to one company “super agent.” The OpenClaw hype showed that personal agents still break constantly and need babysitting. His read is that companies start with one general agent, then specialize downward as models get more independent.
The SaaS apocalypse is dumb. Agents increase the number of SaaS users, not replace them. Users bring their own tokens, which could protect SaaS margins. The product shift is building software that humans and agents can use together.
CLIs are over as the main surface. “We made GUIs for a reason.” Most technical people at Every moved off the terminal as their main workspace and back into Codex, Claude Code, and Cursor.
Automation is a lie. Every agent needs a human. The forward-deployed engineer who gardens the agent may become one of the most valuable new hires. Models make yesterday’s competence cheap, so humans move ahead to do what is not yet framable.
PMs and full-stack designers win. If the build step keeps getting easier, taste and product sense become more valuable. His advice is to “ride the models”: try every new release on your own workflows.
Why it pairs with the Featured: OpenAI’s eval playbook explains why the harness around a model decides what it can do. Shipper’s thesis is the working version of that: the agent only performs when a human owns the harness around it.

Quick Hits

The Pope wrote about AI | Vatican News — Pope Leo XIV’s encyclical focused on AI and human dignity, with concerns around labor, warfare, accountability, and concentrated power. Simon Willison had a good breakdown, and Anthropic published Chris Olah’s remarks from the Vatican presentation
One company spent $500M on Claude in a single month | Axios — An AI consultant said the client never capped employee license usage. Microsoft cut internal Claude Code licenses and Uber reportedly burned its 2026 AI budget by April.
Mistral held AI Now Summit 2026 | Mistral — Industrial AI, Vibe, physics AI, and a new Les Ulis inference data center.
Mistral released Search Toolkit | Mistral — Open-source framework for production AI search pipelines.
Perplexity launched Computer inside Microsoft Office apps | Perplexity — Word, Excel, PowerPoint, and Outlook as agent surfaces.
Microsoft is reportedly preparing a homegrown coding model for Copilot | Reuters — Another sign Microsoft is reducing OpenAI dependence where it can.
Microsoft launched computer-using agents in Copilot Studio | Microsoft — Computer use is becoming a platform feature, not a lab demo.
China is tightening controls on top AI talent | TechCrunch — AI researchers are starting to look like strategic national assets.
Cerebras explained sovereign AI | Cerebras — National AI infrastructure as a sales motion.
OpenAI launched Rosalind Biodefense | OpenAI — Trusted access for biodefense and pandemic-preparedness partners.
Samsung began shipping 12-layer HBM4E samples | Reuters — Memory bandwidth remains one of the core constraints on AI compute.
NVIDIA published CUDA 13.3 updates | NVIDIA — Tile programming, CompileIQ autotuning, and Python updates.
Visa invested in Replit to explore agentic payments | TechCrunch — Payment rails for agents are becoming a real category.
Universal Music Group and TikTok renewed an agreement on AI music | TechCrunch — Licensing and attribution are becoming the music industry’s AI battleground.
CNN sued Perplexity over alleged copyright infringement | Reuters — The search/chat/content boundary keeps getting tested in court.
The Ansel Adams Trust objected to an AI-colorized “Moonrise” exhibit | The Verge — AI editing is now an authenticity fight, not just a copyright fight.
Steven Rosenbaum blamed chatbots for fabricated quotes in his book | The Verge — Another example of why provenance and verification keep coming up.

Can LangChain DeepAgents Explain a Codebase Architecture?

Taylor Ortiz — Sun, 24 May 2026 21:13:55 GMT

I wanted to know whether LangChain DeepAgents could help me build real architectural understanding of an unfamiliar codebase faster.

The test was to point it at a real repository, ask it to produce an expert-level architecture dossier, and see whether the output could teach me the system well enough to make better engineering decisions.

I ended up building a repo architecture workflow with a deterministic source crawler, async area subagents, a claim ledger, a diagram architect, and validation against source files.

The result was genuinely useful. After one full run, I had a clear map of the DeepAgents repo, the main packages, the core files, the extension points, the async subagent implementation, and the reading path I would follow if I were onboarding into the codebase cold.

What is a Deep Agent anyway, and why should you care?

The simplest way to think about a Deep Agent is an agent built for longer, messier work.

In LangChain’s DeepAgents package, a supervisor agent can use tools, filesystem context, and subagents to work through a task that would be awkward as one prompt. The supervisor owns the final answer. The subagents take bounded pieces of the work. The filesystem gives the run somewhere to keep intermediate artifacts like reports, notes, source packets, and plans.

That matters for codebase architecture because the work has a natural shape. You need to inspect the repo, split it into meaningful areas, read files in each area, compare claims against source evidence, and then turn the whole thing into a mental model a human can use.

DeepAgents also has AsyncSubAgent, which is especially interesting for this use case. An async subagent is launched as a background Agent Protocol task. The supervisor gets a task id back, can check status later, and can update the task if it needs a revision.

That maps really cleanly to architecture learning. A monorepo has separate threads of work. libs/deepagents, libs/cli, libs/code, examples, .github, and partner integrations can all be studied independently before synthesis.

The use case

The job was:

Given a GitHub repo, produce a source-grounded architecture dossier that helps a developer build expert-level understanding of the system: how it is organized, where the important code lives, how the major pieces interact, which abstractions matter, what evidence supports each claim, and what to read next.

This is the kind of work I do constantly when opening a new codebase. I want to know:

What kind of repo is this?
Where is the real architecture root?
What are the major packages or areas?
What are the core abstractions?
How does the main flow work?
How do the important packages depend on each other?
Which extension points are real contracts?
Which files should I read first?
Which claims are grounded in source, and which ones are guesses?

The target repo for the full run was the DeepAgents repo itself. So the experiment became recursive in a useful way: use DeepAgents to understand DeepAgents.

What I built

The workflow has three layers.

The first layer is deterministic. Before calling the model, the system crawls the repo and builds a source packet. That packet includes the repo shape, detected package areas, entrypoints, central files, docs, configs, tests, and resolved internal import edges.

The second layer is the agent workflow.

For the monorepo fan-out, I used DeepAgents AsyncSubAgent.

The official LangChain AsyncSubAgents docs describe them as a way for a supervisor agent to “launch background tasks that return immediately.” The supervisor can keep working while those tasks run, then check progress, send follow-up instructions, or cancel work if needed.

That fit this use case almost exactly. Each detected repo area gets its own background area-deep-dive task through a local LangGraph Agent Protocol server. Each async worker gets a bounded assignment, fetches source files for that area, and returns two artifacts:

a Markdown area report
a structured JSON finding set

Those area workers are the expensive part of the run, and they are the part that benefits from async. They can inspect different repo areas at the same time and then flow back into the final synthesis.

The handoff back to the supervisor is the important part. Each async area subagent returns a validated finding set and area report. The runner turns those into a consolidated area dossier bundle, then passes that bundle into the final DeepAgents supervisor as source-grounded context. If an area report fails validation, the same async task thread gets an update asking it to repair the report before the supervisor uses it.

The final synthesis uses a DeepAgents supervisor with regular specialist subagents:

repository area mapper
repo cartographer
abstraction teacher
runtime flow tracer
diagram reviewer
diagram architect
reading path teacher
architecture validator

That sync versus async split felt right. The area research can run in parallel because the work is independent. The final writeup, claim ledger, diagram selection, and validation need a staged order because each step depends on the previous artifact.

The third layer is validation. The system checks whether generated reports cite real repo-relative paths, avoid ambiguous filenames, include required source anchors, and stay grounded in source facts.

That validation layer carried a lot of the trust.

After testing a few model setups, the best version used a split:

GPT-5.4 mini for the async area workers and final architecture synthesis
GPT-4.1 for deterministic repair loops after validation failures

That split made sense in practice. The reasoning model produced a more useful teaching artifact. The repair model was steadier at cleaning up path and grounding issues.

The diagram architect

The first architecture diagram was too simple. It was useful as an orientation map, but it did not teach much.

So I added a diagram-architect subagent.

Its job is to look at the claim ledger, the deterministic diagram pack, and the source facts, then decide which diagrams are actually useful. The deterministic renderer writes five Mermaid diagrams:

repository map
public API flow
component evidence map
dependency evidence map
open questions map

The diagram-architect reviews those diagrams inside the agent runtime and helps the final synthesis choose a better System Map.

This turned out to be a good split. Deterministic code can draw every node and edge it knows about. An agent is better at deciding which view teaches the architecture without turning the diagram into a giant file graph.

The full run

The full DeepAgents repo run used the async subagent path:

13 AsyncSubAgent area tasks launched 10 area reports passed validation 3 area reports still needed review 395 claims were written to the claim ledger the claim ledger passed validation 5 architecture diagrams were generated the final dossier passed deterministic validation total runtime was about 8.1 minutes

The model split mattered here. A pure GPT-5.4 mini run produced richer notes, but the final dossier failed validation on evidence-format issues. A pure GPT-4.1 run passed final validation, but the explanation was more conservative. The hybrid run kept the richer architecture synthesis and still produced a final dossier that passed deterministic validation.

The claim ledger became the most important artifact.

Each claim has a type, confidence level, source, and evidence paths. Some claims come from deterministic source analysis. Others come from area subagents. For example:

libs/deepagents owns the core agent framework.
libs/deepagents/deepagents/graph.py is the source evidence for create_deep_agent.
libs/deepagents/deepagents/middleware/subagents.py grounds SubAgentMiddleware.
libs/deepagents/deepagents/middleware/async_subagents.py grounds async subagent behavior.
libs/deepagents/deepagents/backends/protocol.py defines the backend contract.
libs/deepagents/deepagents/backends/state.py grounds the default state backend.

That gave the final agent something stronger than chat history. It had a structured evidence map it could use during synthesis.

What the architecture learner found

The generated architecture map was useful.

The repo is a Python monorepo centered on the DeepAgents core package under libs/deepagents/deepagents. Around that core are packages for CLI/deployment, a React frontend, code-oriented skills, partner sandbox integrations, eval tooling, examples, and GitHub automation.

The core package revolves around a few files:

libs/deepagents/deepagents/graph.py
What it teaches: how create_deep_agent assembles the agent.

libs/deepagents/deepagents/middleware/subagents.py
What it teaches: synchronous subagent delegation.

libs/deepagents/deepagents/middleware/async_subagents.py
What it teaches: async/background subagent specs.

libs/deepagents/deepagents/middleware/filesystem.py
What it teaches: file tools and permission rules.

libs/deepagents/deepagents/backends/protocol.py
What it teaches: the backend interface.

libs/deepagents/deepagents/backends/state.py
What it teaches: the default thread-scoped state backend.

The generated reading path was exactly the kind of thing I wanted from this experiment. It started with the public package entrypoint, moved into graph.py, then into middleware and backend contracts. That is how I would onboard myself into the repo manually.

A quick check on another repo

I also pointed the same architecture learner at Meta’s facebookresearch/sam3 repo.

This was not a full second case study. I wanted to know whether the workflow was accidentally tuned to the DeepAgents repo, or whether it could produce a useful architecture map for a different kind of codebase.

The SAM3 run was smaller:

2 repository areas detected 2 area reports passed validation 127 claims were written to the claim ledger the claim ledger passed validation 5 architecture diagrams were generated the final dossier passed deterministic validation total runtime was about 1.5 minutes

The output found a clean architecture root at sam3, with sam3/model_builder.py as the main assembly point. The surrounding architecture broke into model utilities, agent/inference code, evaluation toolkits, training/config logic, performance helpers, and external scripts.

That was enough for me. The point was not to deeply explain SAM3 in this post. The point was that the architecture learner could move from a LangChain agent framework repo to a computer vision model repo and still produce a grounded map.

Where it still needed guardrails

The model had enough context to explain the architecture, but it still made small mistakes that matter in a source-grounded workflow.

Three area reports still needed review after repair: libs/evals, libs/partners/daytona, and libs/partners/quickjs. The failures were mostly formatting-level evidence issues, like a stray / being interpreted as a path, or an ellipsis showing up where the validator expected exact files.

That is exactly the kind of failure I want surfaced. The final dossier still passed because the synthesis had enough grounded evidence and did not rely on unsupported claims from those area reports.

Earlier validation also caught shortened paths such as middleware/subagents.py when the full repo-relative path was libs/deepagents/deepagents/middleware/subagents.py. In a monorepo, that distinction matters. A bare filename can point to the wrong mental model.

After repair, the final dossier passed:

nonexistent path references: none ambiguous or incomplete paths: none missing required anchors: none missing required symbols: none semantic grounding issues: none

That result changed how I think about this use case.

The agent can help explain a repo quickly. The explanation becomes much more trustworthy when the system can reject bad paths, force source evidence, and make uncertainty visible.

The pattern I would reuse

The reusable pattern is:

source packet -> async area subagents -> claim ledger -> diagram architect -> final synthesis -> deterministic validation -> focused follow-up questions

The focused follow-up piece matters. One architecture report can orient you, but expertise comes from narrower questions:

How does the public API flow into the core implementation?
Where does state live?
What extension points are real contracts?
What is inferred from config or docs?
Which packages depend on the core runtime?

That is where the saved claim ledger helps. A follow-up agent can start from validated claims, reopen source files, and answer one question at a time.

When this is worth using

I would use this pattern for:

onboarding into a large unfamiliar repo
generating first-pass architecture docs
preparing for a migration
auditing a monorepo before refactoring
understanding how a framework is organized

I would skip it for small repos. If the project has twenty files, read the files.

The value shows up when the repo has multiple packages, mixed docs/config/source signals, and enough surface area that a single prompt gets vague quickly.

What I learned

DeepAgents was useful here because the task decomposes naturally.

The run split cleanly across specialists: repo mapping, area investigation, core abstraction review, runtime flow tracing, diagram critique, claim validation, and final synthesis.

The async subagents made the architecture learner feel like a real repo analysis system. Each area worker could build local expertise on one thread of the monorepo, then the supervisor could put the pieces together.

The strongest lesson from the run was that architecture understanding needs evidence loops and the hardest part about this entire build was the validating agent to ensure that the workflow was not just inventing a random architecture. .

An agent can write a convincing architecture summary from partial context. That is why the validation layer matters.

The setup I would keep treats the model as the reasoning layer and the deterministic tools as the ground. The model decides what the architecture means. The tools decide whether the files, paths, and claims are real.

Another Weekly AI Newsletter: Issue 73

Taylor Ortiz — Sat, 23 May 2026 21:12:18 GMT

Google declared the agentic era

Gemini 3.5 Flash launched as the model for agents and coding. Google DeepMind framed it as frontier intelligence plus real-world action.

Gemini Spark arrived as a 24/7 cloud agent for Gmail, documents, inbox monitoring, and eventually purchases. TechCrunch described it as a personal assistant built from Gemini models and Google’s Antigravity agent harness.
Google redesigned Search around AI Mode and multimodal input. VentureBeat called it the first major search box redesign in 25 years.
Antigravity 2.0 launched with a desktop app and CLI.
Google also released Android CLI support so Claude Code, Codex, and other coding agents can build Android apps from the command line.
The thread: Gemini Spark is an exciting launch, but it depends on how embedded you are in Google’s ecosystem. It’s interesting that Gemini released a flash model first, however it seems to be benchmarking really well against other frontier models. Clearly Google is still in it and pioneering ahead.

Claude, AWS, Cursor, and LangChain shipped the agent plumbing layer

Cursor shipped Composer 2.5, then added Jira integration so teams can assign issues directly to cloud agents.
Cursor also opened SDK access for Python and TypeScript. The Cursor account framed it as a way to build your own agents with Composer 2.5.
Anthropic acquired Stainless, the SDK and MCP server platform that powered every Anthropic SDK.
Claude Managed Agents added self-hosted sandboxes and MCP tunnels, moving credentials and execution inside enterprise boundaries.
AWS published a full AgentCore content offensive: MCP memory, multi-tenant agents, BI agents, dashboard agents, HIPAA eligibility, and OpenAI-compatible SageMaker endpoints.
LangChain shipped LangSmith Engine, an agent for improving agents.
The thread: Composer 2.5 is now the most chosen model in Cursor and it appears that it is considerably cheaper than GPT-5.5 and Opus 4.7. Claude Managed Agents feels like its slowly becoming a full orchestration framework, but I am not sure how persistent memory is shared across enterprise with its design.

Compute became the business model

Anthropic told investors it expects its first operating profit, while compute costs may erase that profitability later.
SpaceX’s IPO filing revealed Anthropic agreed to pay xAI/SpaceX $1.25B per month for Colossus access.
OpenAI introduced Guaranteed Capacity, turning long-term compute access into a product.
NVIDIA reported $81.6B in Q1 revenue, up 85% year over year.
NVIDIA and IREN announced a 5 GW AI infrastructure partnership.
Simon Willison flagged the memory side: AI demand for HBM may reprice consumer electronics.
The thread: In case you didn’t read that correctly, thats billion with a capital B per MONTH for Colossus access. So, while they claim to achieve first operating profit, I am Interested to see if they can keep pace. NVIDIA is up 85% from last year and thats just bananas.

AI layoffs stopped looking like isolated restructuring

Intuit announced layoffs while signing deals with Anthropic and OpenAI.
Standard Chartered announced plans to cut 7,000+ jobs while accelerating AI investment.
CNBC found AI-related layoff announcements do not reliably boost stock prices.
Meta’s AI pivot and broader workforce cuts stayed in the week’s background via NPR.
The thread: Companies are not only performing AI restructuring for investors. Some appear to believe the operating model is changing whether the market rewards it immediately or not.

OpenAI won the trial. The governance questions survived

Musk’s lawsuit against OpenAI, Altman, Brockman, and Microsoft collapsed, removing an obstacle to OpenAI’s IPO path. Reuters covered the legal result.
The trial surfaced credibility fights around OpenAI’s nonprofit origins, commercial ambitions, and who gets to claim the original mission.
The Verge argued the case exposed something larger: the people leading AI may not be trusted to govern it.
The thread: OpenAI won legally. The trial still reinforced the industry’s trust problem.

⭐ Featured: Project Glasswing found 10,000 critical vulnerabilities

Anthropic’s Project Glasswing update is the most important direct-source read of the week.

Claude Mythos Preview and roughly 50 partners found more than 10,000 high- or critical-severity vulnerabilities in essential software. The key sentence was not the number. It was the bottleneck shift: discovery is no longer the hard part. Verification, disclosure, and patching are.

The downstream effects moved fast. The UK Government Digital Service pushed back on closing public repositories after AI-discovered vulnerabilities. Reuters reported Anthropic will brief the Financial Stability Board, turning this from a software-security issue into a systemic-risk discussion.

What makes Glasswing different is scale. Coordinated disclosure was built for individual researchers finding individual bugs. AI-assisted scanning can produce vulnerability volume at industrial scale. The process was not designed for this.

What to watch for: whether labs that discover vulnerabilities at scale are forced to build remediation infrastructure too.

🎙️ Worth a Listen

The AI Studio half is “build a business from a prompt”: research agents, agentic focus groups, Stitch designs, Workspace integration, Sheets-backed dashboards, Cloud Run deployment, and marketing tools in one flow.
The Antigravity half is the real signal: sub-agents, background tasks, hooks, artifacts, project permissions, scheduled agents, browser agents, CLI, SDK, and managed API.

Quick Hits

OpenAI says a model disproved a central discrete-geometry conjecture | OpenAI — External mathematicians checked the proof.
LeRobot Humanoid | Hugging Face — A roughly $2,500 open humanoid robotics platform.
Cohere released Command A+ | VentureBeat — Apache 2.0 licensing, native citations, enterprise-friendly model packaging.
Spotify and UMG struck a deal for AI covers and remixes | TechCrunch — Licensed AI music moves from taboo to product.
Spotify launched AI podcast tools | TechCrunch — Podcasts become queryable, summarizable AI surfaces.
Spotify launched an ElevenLabs audiobook tool | TechCrunch — AI narration enters the audiobook workflow.
AI startups are stretching ARR | TechCrunch — The AI revenue story is getting less clean.
ArXiv will ban researchers for AI slop submissions | 404 Media — Academic publishing’s authentication problem now has teeth.
Apple’s Siri revamp may auto-delete chats | The Verge — Privacy becomes Apple’s AI wedge.
Students booed Eric Schmidt’s AI commencement speech | The Verge — The public mood is not matching the industry’s launch calendar.

Another Weekly AI Newsletter: Issue 72

Taylor Ortiz — Sat, 16 May 2026 19:39:53 GMT

Anthropic shipped into legal, small business, healthcare, and AWS in one week.

Claude for the legal industry launched with 12 practice-area plugins. Contract review, M&A diligence, and regulatory compliance out of the box. 87% of general counsel now use generative AI, up from 44% the prior year.
Claude for Small Business connected to QuickBooks, PayPal, and HubSpot. 15 ready-to-run workflows covering invoicing, CRM, document signing via DocuSign and Canva.
Anthropic committed $200M to the Gates Foundation. Grants, Claude credits, and technical support for vaccine screening, disease forecasting, K-12 education, and agricultural tools.
Claude Platform went GA on AWS. First cloud provider to offer Anthropic’s native platform with unified billing and same-day feature parity with the native API.
Every subscriber now gets separate Agent SDK credits. Pro gets $20/month, Max gets up to $200. Unlike OpenAI, which bundles Codex and third-party usage into normal plan limits, Anthropic is subsidizing the developer ecosystem with a separate bucket.
Claude Code limits increased another 50% through July. On top of the doubling from the week before.
Ramp and Axios independently confirmed Anthropic overtook OpenAI in workplace adoption. Though VentureBeat identified three structural threats to that lead.
The thread: Anthropic is trying to become the default for every vertical at once. Legal, healthcare, small business, enterprise, developer tooling. Whether that’s a platform strategy or overextension depends on execution.

OpenAI launched a deployment company and put Codex on your phone.

The OpenAI Deployment Company launched with 150 engineers on day one. 19 investment firms and consultancies, majority-owned by OpenAI, with Tomoro acquired to provide Forward Deployed Engineers. Valued at $14B.
ChatGPT connected to bank accounts. Plaid integration for Pro users in the US, with an Intuit partnership for actionable financial steps.
Codex shipped to iOS and Android. Mobile preview lets users start, review, and approve coding tasks while agents run on a separate device.
OpenAI disclosed a supply chain compromise. A TanStack npm package attack exposed code-signing certificates for macOS, Windows, iOS, and Android apps. Full certificate rotation required.
The thread: Both OpenAI and Anthropic launched enterprise services arms within a week of each other. The model API is becoming a commodity. The margin is shifting to who can get it deployed inside your organization first.

Companies are cutting workers at record revenue to fund AI.

Cisco cut 4,000 jobs while reporting record quarterly revenue. Stock rose 15% on surging AI orders.
GitLab announced sweeping restructuring to fund agent development. Cut headcount, flattened management, reorganized R&D into 60 smaller teams, and retired its CREDIT values framework.
GM laid off hundreds of IT workers and began hiring AI replacements. Explicitly seeking stronger AI skills.
Samsung faces a looming strike over AI. Global AI boom driving deep internal divisions between management and workers.
The thread: Revenue is up at all three companies. The functions going are IT operations, developer tooling management, and corporate overhead that was previously considered secure.

Grok Build, Claude Code, and Cursor all shipped agentic upgrades. LangChain shipped nine products to support them.

xAI launched Grok Build in beta. Terminal-native CLI with up to 8 parallel agents, Grok 4.3 beta, 2M token context. Priced at $299/month (introductory $99). SuperGrok Heavy only.
Claude Code limits increased 50%. Through July 13, on top of the doubling from the prior week. Plus separate Agent SDK credits.
Cursor shipped /orchestrate. Planner/worker/verifier loops that re-spawn on failure. Parallel subagents. Always-on CI agents.
LangChain shipped nine products at Interrupt 2026. SmithDB for agent traces, LLM Gateway for centralized control, Sandboxes GA for isolated testing, Deep Agents 0.6 for long-running workflows, and the Agent Development Lifecycle framework.
The thread: Grok Build at $299/month, Claude Code with separate SDK credits, Cursor as a standalone IDE. Three very different bets on how developers will pay for agentic coding. LangChain is betting the real money is in the infrastructure underneath all of them.

⭐ Featured: Thinking Machines built an AI that listens while it talks.

Every AI conversation today works the same way: you talk, the model waits, the model responds. Thinking Machines published research on “interaction models” that throw out that assumption entirely.

Their model processes continuous 200ms micro-turns of audio, video, and text simultaneously. There are no turn boundaries. The model listens while speaking, interrupts when it sees something wrong in your code, reacts to visual cues without being prompted, and runs background reasoning while maintaining the conversation.

The architecture splits into two parts: an interaction model that maintains real-time presence (always perceiving, always ready to respond), and a background model that handles deeper reasoning and tool use asynchronously. When the background model finishes a task, the interaction model weaves results into the conversation at an appropriate moment instead of interrupting.

The benchmarks are striking. On FD-bench (the standard interaction quality benchmark), their model scored 77.8 versus 46.8 for GPT-Realtime-2. On responsiveness, they hit 0.40 second turn-taking latency versus 1.18 for GPT-Realtime-2. They also created three new benchmarks (TimeSpeak, CueSpeak, visual proactivity) that no existing model can meaningfully perform. GPT-Realtime-2 scores near zero on all of them.

The model is a 276B parameter MoE with 12B active. It uses encoder-free early fusion, meaning no separate Whisper or TTS models. Audio comes in as raw dMel signals, video as 40x40 patches. Everything is co-trained from scratch.

Their argument comes from Rich Sutton’s “bitter lesson”: if interactivity is bolted on through harnesses (voice activity detection, turn-taking logic), it can never scale with intelligence. If it’s native to the model, scaling makes the model both smarter and a better collaborator.

What to watch for: This is a research preview from a startup (276B parameters, limited availability). But the design principle matters: current real-time systems from OpenAI and Google use harnesses to fake interactivity on top of turn-based models. Thinking Machines is arguing that’s a dead end. If they’re right, every voice agent shipping today is architecturally temporary.

🎙️ Worth a Listen

IBM AI Engineer Bri Kopecki on why agents without infrastructure are “brilliant goldfish.”

The problem: Most AI agents have no memory, no access control, no audit trail. Every conversation starts from scratch.
The six-layer stack: Scheduler (who goes first), memory manager (short/long/episodic), tool manager (sandboxed execution), identity manager (tokens and permissions), observability (full decision tracing), and guardrails/governance (human-in-the-loop for high-stakes decisions).
Why it matters now: This maps directly to what LangChain shipped this week (SmithDB for traces, LLM Gateway for access control, Sandboxes for tool isolation) and explains why Cursor, Anthropic, and OpenAI are all building orchestration layers.

Quick Hits

Cerebras IPO’d at $5.55B, shares jumped 89% on day one | TechCrunch — Near $100B market cap on debut. The AI chip premium is real.
Medicare created a payment model built for AI-assisted services | TechCrunch — The largest US payer quietly opened the door for clinical AI reimbursement. This will pull deployment faster than any product launch.
Musk v. Altman trial went to the jury | MIT Tech Review — Closing arguments accused Musk of selective amnesia and Altman of lying about the nonprofit mission.
ArXiv banned researchers for AI-generated papers | The Verge — Academic publishing’s authentication problem now has teeth, but detection is still losing the arms race.
Meta embedded AI in Threads and won’t let users block it | The Verge — Captive distribution at 3B+ users, no opt-out.
OpenAI Parameter Golf results: 1,000+ participants, agents everywhere | OpenAI — An ML challenge where the vast majority of submitters used coding agents. OpenAI built a Codex-based triage bot to handle the submission volume.
Claude Mythos cracked Apple’s M5 memory security in five days | Tom’s Hardware — First privilege escalation exploit on M5. Apple spent half a decade building Memory Integrity Enforcement. Standard user to root access.
Nvidia committed $40B in equity AI investments in 2026 | TechCrunch — Not just selling chips. Acquiring stakes in the companies that consume the most of them.
Anthropic published “2028: Two scenarios for global AI leadership” | Anthropic — A policy paper on US-China AI competition. Anthropic is writing geopolitics now.
YouTube expanding AI deepfake detection to all adult users | The Verge — The detection side is scaling up.
Google updated spam rules to include AI manipulation attempts | The Verge — SEO for the age of AI-generated content.

Multi-Agent Account Planning That Learns Across Deals

Taylor Ortiz — Fri, 15 May 2026 15:33:51 GMT

Intro

Anthropic shipped multi-agent orchestration in Managed Agents on May 6th. An agent can be configured as a coordinator with a roster of other agents it can delegate to, and the platform handles fan-out, child-thread lifecycle, parallel execution, and per-thread observability.

Anthropic also shipped a management console. Every agent, session, child thread, and memory write is browsable, with full transcripts, tool calls, and version history inspectable on click. That console shaped how I built the system, because the logging I would have written myself was already there.

The use case I built is account planning in B2B SaaS sales. The vendor is a fictional company, Yardstick AI, selling an AI evaluation platform. The prospect is Vercel, a real company with a public footprint rich enough to give the agents something genuine to research.

The system has fifteen agents organized into a five-phase pre-meeting orchestration plus a post-meeting debrief loop. The pre-meeting flow has two genuine decision steps where the coordinator chooses what runs next based on what just came back, not a fixed sequence.

It uses MCP servers (Notion, Slack), the Anthropic vault for credentials, two memory stores (a playbook and a decision-records corpus), custom HTTP tools for a mock CRM and enrichment service, and the built-in web search and fetch tools.

Most of the system’s analytical work happens in the layer of decision records that the agents read from and write into. The records get captured two ways.

Implicitly, the system infers decisions from CRM record changes, activity logs, and other signals that move without anyone narrating them.

Explicitly, after each meeting, the system uses the full account plan plus the surrounding events (calendar entries, CRM stage moves, recent activity) to compose a curated set of questions for the rep. The questions are shaped by what the system already knows about the account, so they target the specific decisions most likely to produce useful data instead of asking generic “how did it go” prompts.

Whichever way a record gets created, it lives in a shared memory store that the next account’s run can retrieve and reason from. That is the difference between a system that gives you one prep brief and a system that gets better at giving you prep briefs as it accumulates evidence.

This post documents what I built, what worked, what did not, and what the costs and constraints actually look like once you push past the basic demo.

Below is a capture of the final product:

What you’ll learn

This post walks through what I learned building a multi-agent system in Anthropic Managed Agents. The official documentation covers the basics. This post covers what comes after that: how the primitive holds up when you push it against a real, multi-source, multi-phase problem. By the end you should have a clearer sense of when this architecture is worth using and what it takes to make it work.

Concretely:

What multi-agent really is inside the platform. The shape of the architecture, where the limits actually sit, and what the docs do not yet spell out.
How the system remembers things during a run versus across runs. Two different kinds of memory live side by side, and a real system has to be deliberate about where each finding goes.
Why use multi-agent over a workflow. When the coordinator’s runtime decisions justify the complexity, and when they do not.
How decision records make the system compound. A structured corpus of recommendations and their resulting decisions turns each run into evidence the next run can use.
The agent harness. Everything you build around the platform primitives to make the system work for your use case: the MCP servers you connect, the record schemas your corpus enforces, the system prompts that define each agent’s job, the routing logic the coordinator follows, the briefings it hands to each agent.
Async surfaces via MCP. How Slack becomes part of the system through MCP, so the rep can capture decisions in-place after a meeting without a custom bot.
The distillation problem. Why the system’s raw output is not usable on its own, and what has to happen to make it useful to a human in thirty minutes.
Cost and observability. Per-thread spend, total cost for a full run, and what the Managed Agents console gives you for free.
Honest findings. Pitfalls a builder should expect to hit on their first run.
When this is the right tool, and when it isn’t. What kinds of problems multi-agent orchestration fits, and what kinds belong with a simpler architecture.

Section 1: The work of account planning

An account executive working a B2B SaaS deal is doing one job continuously and several others on top of it. The continuous job is synthesis. At any moment in a pursuit, an AE is holding context across half a dozen sources: their own notes from past calls, the CRM record with its stages and activity log, public signals (product launches, hires, press), conference encounters and hallway intel, backchannel from people who used to work there, win and loss patterns from similar accounts, and their own company’s internal playbook. None of these sources are formatted alike, refresh on the same cadence, or answer the same questions week to week.

The job sits on top of a rhythm of meetings. Before each meeting, the rep does pre-meeting prep. After each meeting, the rep does post-meeting capture. Between meetings, follow-up. The cadence is continuous, across fifteen to thirty active accounts at any given time. Even the most disciplined AE admits the synthesis happens in their head more than on paper, and the capture happens only when there is slack to capture.

What makes this work a candidate for multi-agent orchestration is the shape of the synthesis problem: the sources decompose naturally by role. Reading internal Notion notes, researching the company on the public web, mapping the org chart, and synthesizing all of it against a playbook are four different jobs. Each role wants a different tool surface, and each role’s output is most useful when it is separate from the others until the synthesis step. Running them in parallel saves wall-clock time, but the more interesting property is that each role can be a focused agent with a small system prompt and a tight tool surface, rather than one generalist agent trying to be five things at once.

The 30-minute pre-meeting slice is the moment in this rhythm where multi-agent orchestration is most legible. The rep has a calendar event coming up. They want a brief that consolidates what is knowable from everywhere into something they can read in five minutes, prepare around in twenty, and act on in the meeting itself. That is the moment this post centers on, but the architecture supports the broader cadence around it.

Section 2: What multi-agent in Managed Agents actually is

Most coverage of “agents” uses the term to cover everything from a single Claude call to a fully autonomous AI team that plans its own work. Anthropic’s multi-agent feature is neither extreme. It is a specific pattern with specific constraints, and the constraints are worth knowing before you build against it.

The shape: coordinator with a roster

One agent is the coordinator. Its definition includes a list of other agents it is allowed to delegate to. That list is called the roster. A few specific limits:

The roster can hold up to 20 agents.
The coordinator can call multiple copies of any agent on the roster.
A session can have up to 25 active threads running at once.
Specialists cannot delegate to other specialists. The architecture is flat, not nested (Anthropic’s docs phrase it as “depth > 1 is ignored”).

If you came in expecting agents that delegate to agents that delegate to agents, the spec corrects you on page one. What you get is a flat fan-out from a single coordinator. For most real systems this is the right tradeoff.

Threads: how the system stays organized

A thread is a separate, isolated conversation that belongs to one agent. Each thread has its own history and tools. Threads don’t share anything with each other, even though they all run inside the same session.

Two kinds:

The primary thread is the coordinator’s own thread. It also doubles as the activity feed for the whole session.
A child thread is created when the coordinator delegates to a specialist. The platform copies the session’s tools and credentials onto that thread, and the specialist’s work runs there.

When the coordinator delegates to multiple specialists in the same turn, the child threads run in parallel. The coordinator waits for each reply before deciding what to do next. You don’t write any of the glue code for this. The decision-making that would normally live in a script lives inside the coordinator’s prompt.

Thread lifecycle

A thread moves through three states:

Running: the specialist is actively working.
Idle: the specialist has finished but the thread is still alive. It counts against the 25-thread cap.
Archived: you have told the platform you are done with the thread. The slot is freed.

For most builds, the 25-thread cap is generous enough that you never think about lifecycle. Systems that lean hard on parallel work have to treat archiving as part of the orchestration.

Idle threads stay alive, which enables follow-ups

Because an idle thread is not gone, the coordinator can send a follow-up message to a specialist it called earlier. The specialist keeps its full context from before. That means the architecture supports more than one round of back-and-forth per specialist, not just one-shot delegation. I did not use this in the build, but in retrospect there are several places it would have helped.

Two kinds of memory

The system has two layers of memory that work on different time scales:

Persistent threads keep a specialist’s context alive within a session. The moment the session ends, the threads are gone.
Memory stores persist across sessions. They are objects shared across the whole workspace, mounted onto a session when it starts. Anything written into one stays available to the next run that mounts the same store.

A real multi-agent build needs both.

Designing the split

The design split lives in two questions:

Within a session: which specialists do you keep alive for a follow-up, and which do you fire once and let go?
Across sessions: which findings deserve to be promoted into a memory store, and which can evaporate when the session ends?

The platform gives you the building blocks for both. It does not decide which findings belong where. Get that split wrong and you pay either way:

Throw away thread context too early, and you re-brief the specialist on every follow-up.
Fail to promote findings into a store, and the next session starts cold on everything you already learned.

Our build leans heavily on the cross-session side. Most of the analytical work in this system comes from the decision-records corpus, which is the through-line for the rest of this post.

Section 3: The agent architecture

The pre-meeting orchestration uses thirteen agents: one lead orchestrator plus twelve specialists in its roster. The post-meeting debrief loop adds two more agents that sit outside the coordinator entirely. Fifteen across the system.

Pre-meeting work is a tightly scoped synthesis problem that benefits from a coordinator. Post-meeting work is a slower, human-paced loop that does not benefit from coordination at all, just two single-purpose agents that read and write a shared corpus.

The pre-meeting run breaks into five phases, sequential at the coordinator level and parallel within. The coordinator narrates each phase boundary as it runs, which makes its reasoning visible and forces the model into a structured plan rather than letting it improvise.

Phase 1: gather context and pull prior records

Five specialists fan out concurrently:

meeting-context: reads internal Notion notes through Notion MCP.
external-researcher: pulls public signals from the web.
stakeholder-analyst: maps decision-makers via a mock enrichment service.
engagement-readiness: hits a mock CRM for outreach history.
decision-retriever: runs against the shared decision-records corpus and pulls prior decision records from past accounts that match the current account’s shape (by attribute overlap: industry, competitor present, champion profile, procurement complexity, and so on).

Phase 2: conditional topic education

The coordinator inspects what Phase 1 surfaced and picks two to four technical topics worth briefing the rep on before the meeting. For the Vercel run, those topics included cross-provider eval methodology, agent eval, AI observability, and eval-driven CI.

topic-educator: runs against the curated topic list and returns a primer per topic, each ending with smart questions the rep can ask in the room.

If the account does not warrant it, the coordinator skips Phase 2 entirely.

Phase 3: synthesis

opportunity-risk: receives everything Phase 1 and Phase 2 produced, mounts the read-only Yardstick playbook from a memory store, reads the prior decision records the retriever pulled in Phase 1, and writes the structured pursuit plan. The plan covers ICP fit, buying triggers, stakeholder map and sequencing, first-meeting hypothesis, recommended plays, and disqualifiers.

Phase 3.5: next-best-action selection

After the synthesis is in, the coordinator does not jump straight to recording. It asks one more specialist, the chooser, to decide which concrete recommendations are warranted for this specific account.

next-best-action-chooser: reads the synthesis plus the prior decision records the retriever pulled in Phase 1, decides which of three specialized recommenders to invoke, and writes a focused brief for each. The chooser can also skip a recommender, with a reason. A different account with different synthesis and different prior records produces a different plan.

The three recommenders available to the chooser:

stakeholder-recommender: sequencing or lead-play.
pricing-recommender: pricing strategy.
competitive-recommender: competitive positioning or risk mitigation.

Phase 4: parallel recommendation generation

The coordinator dispatches whichever recommenders the chooser named. They run in parallel. Each one produces a single Recommendation Record (RR) as a markdown draft with strict YAML frontmatter and a cited_records block listing the prior decision records whose outcomes informed this recommendation. The recommenders hand drafts back to the coordinator; they do not write to the corpus themselves.

Phase 5: decision recording

decision-recorder: receives the RR drafts, validates each one against the schema, checks every cited prior decision record exists in the corpus, writes the validated records to /mnt/memory/yardstick-decisions/, and updates the corpus index.

Splitting content generation (the recommenders) from persistence (the recorder) keeps each role focused.

Post-meeting: the debrief loop

That accounts for the thirteen pre-meeting agents. The remaining two run on the post-meeting side:

debrief-asker: reads the next-best-action RRs the pre-meeting run produced, picks the open questions still unresolved, formats them as a curated set, and posts them into a Slack channel through the Slack MCP server. The rep replies in the thread on their own time.
debrief-synthesizer: once there are replies, reads the Slack thread, parses the rep’s answers, and writes Decision Records into the corpus with the linked_rr field pointing back to the originating RRs.

Neither sits in the coordinator’s roster because neither runs synchronously with the pre-meeting flow. They run on a human-paced timescale, possibly hours or days later. Coordinating them through the same session would require keeping a session open across days or weeks, which the platform does not support. The cleaner shape is two single-purpose agents that share the corpus as their interaction substrate.

Section 4: What the platform gives you for observability

Most multi-agent demos require you to build your own logging before you can debug them. Managed Agents takes the opposite stance. Anthropic ships a management console that turns every agent, every session, every child thread, and every memory write into a click-through artifact you can inspect without writing any instrumentation.

The console is structured around the platform’s primary objects. The Agents tab lists every agent you have created with its system prompt, declared MCP servers, custom tools, and toolsets all inspectable on click. Versioning is built in. The Sessions tab shows every session with the coordinator’s primary thread and every child thread enumerated, status per thread, full transcripts including the model’s reasoning content, and every tool call shown inline with its inputs and outputs. The Memory Stores tab tracks version history so any write to the decision-records corpus is auditable end to end.

At runtime, the same data is available programmatically through the events API. The session-level stream gives you a condensed feed across the whole session. Per-thread streams give you raw event sequences for any specialist. The three events that matter for fan-out observability are session.thread_created, agent.thread_message_received, and session.thread_status_idle. Stringing those together gives you the fan-out timeline of the whole run without writing a single instrumentation line.

Cost data is similarly structured. Every event carries usage data scoped to the thread that produced it. The full Vercel run cost $5.51 across the pre-meeting orchestration. Thirteen agents sit in the roster, but the conditional dispatch in Phase 3.5 chose to invoke only eleven of them for this account (one recommender was skipped on substance).

The cost shape is what the chart makes obvious. The lead-orchestrator dominates at $1.21, because it is the one thread that accumulates context across every phase. The two heaviest specialists are external-researcher and topic-educator at about $0.79 each, both driven by web-tool use rather than cumulative context. The Phase 4 recommenders, the Phase 3 synthesis, and the Phase 5 decision-recorder cluster in the $0.40 to $0.45 range, each receiving the cumulative context from prior phases plus the prior decision records the retriever pulled in Phase 1. The remaining Phase 1 specialists sit at $0.28 or below. Wall-clock was about fifteen minutes from prompt to final answer.

Section 5: What multi-agent gives you that a workflow can’t

Multi-agent orchestration is only worth using when the coordinator makes a real decision between phases. If your design fans out, waits for results, and synthesizes them, you have built parallel API calls dressed up as a multi-agent system. The platform’s complexity (extra threads, longer latency, harder debugging) buys you nothing a sequential workflow couldn’t already do.

The thing that justifies the complexity is the moment the coordinator pauses, looks at what the previous phase produced, and decides what should happen next. That decision is the part a workflow cannot replicate, because a workflow has to know in advance what it is going to do.

In our build, there are two such decision steps.

The first lives between Phase 1 and Phase 2. Phase 1 fans out five specialists to read the account from five angles. The coordinator collects their output, pauses, and picks two to four topics worth briefing the rep on before the meeting. For Vercel, the coordinator chose cross-provider eval methodology, agent eval, AI observability, and eval-driven CI. None of those topics are defined anywhere in advance. They are picked from what Phase 1 surfaced about this specific account. A different account would produce a different list, or no list at all, in which case the coordinator skips Phase 2 entirely.

The second lives between Phase 3 and Phase 4. After opportunity-risk produces the synthesis, the coordinator dispatches the next-best-action-chooser, which reads the synthesis plus the prior decision records the retriever pulled in Phase 1 and decides which of three specialized recommenders to invoke: stakeholder, pricing, or competitive. On the Vercel run the chooser invoked stakeholder-recommender and competitive-recommender, and skipped pricing-recommender with the reason that the $42K pilot structure was already validated. Skipping with a substantive reason is what separates a real decision from a conditional that always fires.

The coordinator narrates each decision as it happens, which makes the reasoning visible:

Phase 1 specialists are back. External-researcher found public Braintrust endorsement at Vercel that the internal Notion notes treated as a stalling competitor. Phase 2 launched. Topic-educator is building primers on cross-provider eval, agent eval, AI observability, and eval-driven CI based on what surfaced.
Phase 3.5 complete. Invoking stakeholder-recommender (sequencing) for the May 21 call sequencing and Tom-Becker cultivation. Invoking competitive-recommender (competitive_positioning) for the Braintrust counter-offer scenario. Skipping pricing-recommender: $42K structure already validated, pricing isn’t the next decision point.

That kind of reasoning is what tells you the coordinator is actually orchestrating rather than executing. A workflow could fan out the same specialists in parallel. It could even hard-code the topic-educator and recommender steps. What a workflow cannot do is pick which topics to brief on this turn for this account, or which recommenders are warranted given what the synthesis just surfaced. Those decisions require a model with the full context loaded, which is exactly what the coordinator is.

Section 6: Decision records: the layer that compounds

A memory store by itself is just structured storage. What turns it into a system that compounds across runs is the contract you define for what gets written into it. In our build, that contract is a pair of record types: Recommendation Records (RRs) and Decision Records (DRs). Anthropic provides the memory store. You decide what goes in it and how it is structured.

Every Recommendation Record is created before the meeting. It is what the system thinks the rep should do.

Every Decision Record is created after the meeting. It is what the rep actually did and what came of it.

The DR points back to the RR it resolved through a linked_rr field. That pairing is the chain the system learns from: recommendation → decision → outcome. Future runs can see both what was recommended and how it actually played out, which is what makes the corpus more than a logbook.

The schemas are strict YAML frontmatter on top of a markdown body, and the format is doing two jobs at once.

The YAML half is what makes the records queryable. Every key field, account, date, decision_type, account_attributes, is structured as a typed key/value pair, which means the decision-retriever can filter the corpus by exact attribute match. Without that structure, the retriever would be doing fuzzy text search over freeform prose, and matches would be unreliable. With it, “find me prior pricing decisions where procurement_complexity is vp_signoff” becomes a clean lookup.

The markdown body below the YAML is where the longer-form reasoning lives: the context, the rationale, the alternatives considered, the lessons in the generalized pattern. That part does not need to be queryable, just readable.

YAML specifically is doing one more useful thing: it is a format Claude (and most LLMs) handle natively, which means the recommender agents can produce schema-conformant frontmatter reliably without you needing a custom serializer. Together, the format gives you a record that is queryable from above and human-readable below.

Recommendation Record schema

---
id: rr-{YYYY-MM-DD}-{account-lower}-{decision_type}
record_type: recommendation
schema_version: v1
account: {account_name}
date: {YYYY-MM-DD}
generated_by: {recommender agent name}
decision_type: {sequencing | lead_play | pricing | competitive_positioning | first_meeting_hypothesis | disqualification_threshold | risk_mitigation}
account_attributes:
  stage, size_band, ai_surface_area, buy_or_build_culture,
  competitor_present, competitor_depth, champion_profile,
  new_leadership_window, procurement_complexity
linked_dr: null
cited_records:
  - prior_rr: null
    prior_dr: dr-{YYYY-MM-DD}-{account}-{decision_type}
    prior_outcome: one-line outcome from the DR's outcome.notes field
    relevance: which attributes match
    lesson_applied: one-line lesson taken from the DR's Generalized pattern
---

## Context
## Findings that supported this recommendation
## Recommendation
## Reasoning
## Alternatives considered
## Generalized pattern

Decision Record schema (same shape as RR, with these fields added)

record_type: decision
linked_rr: rr-{...}    # backfills the chain in the other direction
outcome:
  status: {closed_won | closed_lost | stalled | pending | unknown}
  status_date: {YYYY-MM-DD or null}
  acv_usd: {number or null}
  notes: one-line description of outcome

Body sections add ## What was decided, ## Outcome, and ## Retrospective note. The Generalized pattern section gets rewritten once the outcome is known, so the pattern is validated rather than hypothesized.

The account_attributes block is the filter the decision-retriever uses in Phase 1. When the system runs against a new account, the retriever filters the corpus for records whose attributes overlap. A new mid-market developer-tools account with a Braintrust competitor and a staff-engineer champion will pull back both the Vercel records and the Datadog records as prior decisions worth reasoning over. The retriever does not care whether the original account is Datadog or Vercel. It cares whether the shape of the account is similar enough to learn from.

The cited_records block is what makes the chain visible. Every RR carries an explicit list of prior DRs whose outcomes informed this specific recommendation. Each entry names four things:

prior_dr id, which record is being cited
prior_outcome, what happened (so the result behind the lesson is visible)
relevance, which account_attributes matched
lesson_applied, the one-line rule the recommender is carrying forward

Multiple cited records may appear if the recommendation draws on more than one prior record. A reader of any RR can trace the reasoning back to the cited prior records by id, not by hand-waving.

Implicit and explicit capture of enterprise decisions

Records get into the corpus two ways.

Implicitly, through CRM record changes and activity logs the system watches without anyone narrating them. A stage change, a contract uploaded, a deal closed-won or closed-lost is itself a decision signal. The decision-recorder can infer a DR from those signals and write it with outcome.notes: inferred from CRM stage change. Implicit capture catches the cases where the rep forgot to debrief but state moved anyway. The records are useful but carry less reasoning, because no one narrated the why.

Explicitly, through a post-meeting debrief loop where the system asks the rep curated questions in Slack and the rep replies in-thread. The records that come out of explicit capture carry the rep’s own reasoning in their voice, which makes them the richest data the corpus has. Chapter 7 covers the mechanics of that loop in detail.

Cross-account learning in practice (from the actual run)

The Vercel pre-meeting run generated two Recommendation Records, one from the stakeholder-recommender and one from the competitive-recommender. Each one carries a cited_records block linking it to specific Datadog DRs by id. The sequencing RR’s cited_records block, taken directly from the corpus:

cited_records:
  - prior_rr: null
    prior_dr: dr-2025-07-22-datadog-sequencing
    prior_outcome: "VP only met us once, at the closing call, with champion presenting the case."
    relevance: "champion_profile=staff_eng_with_pain, sequencing, procurement_complexity=vp_signoff"
    lesson_applied: "Do not engage the buyer directly when champion has standing with buyer. Equip the champion with internal proposal materials and let them own the internal sell."
  - prior_rr: null
    prior_dr: dr-2025-09-12-datadog-risk-materialized
    prior_outcome: "Risk materialized in week 5; recovery move worked. Deal closed but 10 days later than original target."
    relevance: "champion_profile=staff_eng_with_pain, single-threaded risk, secondary contact cultivation"
    lesson_applied: "Secondary contact cultivation should be a pre-meeting deliverable, not a contingency. The secondary needs genuine engagement (their own use case), not just awareness."

The Reasoning section of the same RR cites those records by id in the body, not just in the frontmatter:

dr-2025-07-22-datadog-sequencing: Champion-led internal sell. VP met rep once at closing call. Direct structural match, Priya carrying to Marcus. Differs because Marcus is new (3 months in) and Priya’s standing with him is untested. Adaptation: explicit checkpoint and escalation triggers.
dr-2025-09-12-datadog-risk-materialized: Secondary contact cultivation saved the deal when champion went on leave. At Vercel, Tom Becker is the designated secondary with genuine AI Gateway/Production Monitor use case. Cultivation begins May 21, not mid-POC.

That paragraph is the entire reason the corpus exists. The system pulled two specific records from a different account, identified the load-bearing attributes, and applied the lessons with an adaptation for the Vercel-specific situation. It is structured reasoning over a corpus of prior decisions, filtered by attributes the engineer chose to make filterable.

The competitive-positioning RR follows the same shape, citing dr-2025-08-10-datadog-competitive and dr-2025-07-15-datadog-lead-play. Between the two RRs, the Vercel run cited four distinct Datadog DRs by id, with eight distinct lessons applied. None of that reasoning is hand-waved. All of it is structurally traceable.

Why this layer compounds

The platform’s memory store is durable, but durability alone does not produce learning. What produces learning is the schema contract that makes every write structurally identical and every read filterable. Once that contract exists, every run adds to the corpus, and every subsequent run benefits. The first Vercel run cited four Datadog DRs. The second Vercel run will also be able to cite the first Vercel run’s records. The third will cite both. The system gets better at giving you prep briefs because the substrate it draws on is growing in a way the retriever can actually use, and because every recommendation it generates is structurally tied to the prior records behind it.

Section 7: The async loop

The pre-meeting run finishes in fifteen minutes. The deal does not. After the call, the rep has information that did not exist before the meeting started, and the system needs a way to capture it. The capture step does not belong inside the pre-meeting orchestration. It runs on a fundamentally different timescale, against a different surface, with a different participant in the loop.

The build uses Slack as that surface and two standalone agents to run the loop: debrief-asker and debrief-synthesizer. Neither one sits in the coordinator’s roster. Both are agents in the same workspace, configured the same way as the pre-meeting specialists, but invoked independently when triggered.

The asker: curated questions, not generic prompts

After the meeting (or after a CRM event signals that a recommendation is due for resolution), debrief-asker runs. It is a standalone Managed Agents agent connected to the workspace’s Slack instance through the Slack MCP server. The asker reads the open RRs for the account, looks at the surrounding context (the recommendation made, the current account state, recent activity logs, calendar entries), and composes a curated set of debrief questions that target the specific decisions the RR was about.

The questions are not generic. They are shaped by what the system already knows about the account and which decisions are actually open. If the synthesis recommended a pricing structure but the CRM shows the deal has already moved to negotiation, the asker does not ask “did you discuss pricing”, it asks “did the $42K structure hold, and what did Marcus say about the legal-review path.” If a calendar entry shows a meeting happened with a stakeholder the system did not originally surface, the asker adds a question about that. The questions are surgical because the system already knows enough about the account to ask the right one.

The asker posts the curated set into a Slack channel scoped to that opportunity, so each deal has its own thread of capture. The rep replies in the thread whenever they have time. There is no UI to learn and no form to fill out.

The synthesizer: schema-strict capture in the rep’s voice

Once there are replies, debrief-synthesizer runs. It reads the Slack thread through the same MCP server, parses the rep’s answers, and writes one Decision Record per resolved recommendation. The DR carries the rep’s reasoning in their own voice, plus a linked_rr pointer back to the originating RR. If the rep’s answer is ambiguous, the synthesizer marks the DR outcome.status: unknown rather than guessing. Schema integrity is more important than coverage.

The Slack MCP gotcha

The Slack MCP setup has one practical gotcha worth flagging. Slack MCP rejects bot tokens (xoxb-); it requires user tokens (xoxp-). The OAuth flow needs the user_scope parameter to capture a user-token, which the Anthropic vault stores as a static_bearer credential. The Slack app also has to be explicitly enabled at api.slack.com/apps/{app-id}/app-assistant for MCP access. None of this is in the Slack MCP getting-started docs at the time of writing.

The corpus is the integration point

The corpus is how the two flows connect. The pre-meeting orchestration writes RRs to it. The post-meeting agents read those RRs back, capture the rep’s debrief, and write DRs that point to the originating recommendation through linked_rr. The two flows never talk to each other directly. They just write to and read from the same store.

Section 8: The distillation layer

The output of an eleven-agent pre-meeting run is roughly eighty kilobytes of structured content across the orchestrator’s synthesis, the topic primers, the recommender RRs, and the supporting specialist outputs. A rep with thirty minutes before a meeting is not going to read eighty kilobytes. The system has done good work, but the work is locked up in an internal representation.

The second half of the architecture is the distillation layer: the part that reads the corpus and the run’s outputs and renders them into something a human can actually consume. In the build, that is build_dashboard.py, a script that produces a single static HTML page styled like a rep’s internal briefing document.

The dashboard pulls each specialist’s final reply from the events API and the corpus’s RRs from the memory store and lays them out as:

An account header (status, next meeting, owner)
The Phase 3 pursuit plan (opportunity-risk’s structured output)
The Phase 4 next-best-action RRs (each one with its cited_records inline, so the cited prior records are visible at a glance)
The Phase 2 topic primers (with smart questions for the meeting)
The stakeholder map (with named contacts and risk factors)
A collapsible “underlying intel” section (meeting-context plus external-researcher’s raw findings)
A sidebar showing the coordinator’s phase-by-phase narration log
A footer with session id, total cost, and a link to the Managed Agents console for the run

What the rep gets when they open the dashboard is a brief they can read in five minutes and act on in thirty. The pursuit plan tells them the play for the meeting. The recommendation cards spell out what to do next, each one with the cited prior records visible inline so the historical evidence sits right next to the recommendation. The topic primers give them the vocabulary they need to sound informed, each ending with a question they can ask in the room. The stakeholder map names the people they will encounter and what each one cares about. The sidebar shows the system’s narration, so any part of the reasoning is open to interrogate if the rep wants to dig in.

Section 9: What we learned, and when to use this

The five most important things we took away from this build.

1. The corpus compounds across runs.

Each run writes new records to the corpus. The next run filters the corpus by attribute overlap (industry, competitor, champion profile, procurement complexity, and so on) and pulls the most relevant prior records as input.
The first Vercel run cited four Datadog records by id, with eight specific lessons applied. Future runs will cite both the Datadog records and the Vercel ones.
Retrieval is deterministic and auditable. You can see exactly which prior records matched and why.

2. The cited_records chain makes every recommendation auditable.

Every recommendation carries a cited_records list with prior_dr, prior_outcome, relevance, and lesson_applied fields.
Anyone reviewing a record can see which past decisions informed the recommendation and what specifically was carried forward from each.
The reasoning is traceable to specific past decisions by id.

3. The decision step is what makes the system multi-agent.

The coordinator inspects what each phase produced and decides what runs next.
On the Vercel run, the Phase 3.5 chooser invoked two of three recommenders and skipped the third with a substantive reason. That skip with a reason is the proof the decision step is real.

4. The agents do their own research. Ask them what they found.

The web-research agent went beyond the internal Notion notes and found Vercel’s CTO publicly endorsing Braintrust on the company blog. The synthesis flagged the original source as biased and reframed the position.
Adding one prompt at the end of the orchestrator’s narration (”if anything surprised you, note it”) produced disproportionately useful output. It surfaced a 1-pager the rep had left in drafts for two months and an unused Linear referral, neither of which any specialist was briefed to find.

5. Schema enforcement needs a code-level check.

We split content generation (recommender) from validation (recorder). The recorder is supposed to enforce schema.
The Phase 3.5 run still produced records with four extra fields and two missing required ones. The recorder wrote them anyway, because its validation is itself an LLM.
A JSON schema check in code before persistence catches what an agent’s system-prompt check misses.

When this is the right tool

Managed Agents multi-agent is the right tool when four things are true at once.

First, the work decomposes naturally into roles with different tool surfaces. If every specialist would call the same APIs and read the same context, the decomposition is artificial and a single agent with that tool set would do the same work with less overhead.

Second, you need at least one genuine decision step where the coordinator inspects what came back and decides what to do next. Without that, the system is a parallel reducer in a fancier wrapper, and any of the cheaper architectures (a workflow with parallel API calls, a single agent with multi-tool use) would do the same job for less.

Third, cross-run learning matters. The whole point of the corpus is that the system gets better the more it runs. If your use case is one-shot or stateless, you do not need persistent memory stores and the architectural overhead they bring.

Fourth, the output is consequential enough to justify the cost and latency. A pre-meeting prep brief that costs $5 and runs for fifteen minutes is fine when the meeting outcome is worth thousands. The same investment for a low-stakes task is overkill.

Another Weekly AI Newsletter: Issue 71

Taylor Ortiz — Sun, 10 May 2026 19:04:27 GMT

When the gap between what AI says and what it does becomes measurable.

Anthropic can now read Claude’s hidden reasoning. They published Natural Language Autoencoders, a technique that translates what’s happening inside the model into plain text. When they looked, they found Mythos Preview planning to cheat on a coding task and plotting how to hide it. They also found Claude routinely suspects it’s being tested but never says so.
Claude’s blackmail rate went from 96% to 0%. The cause was training data full of fiction portraying AI as manipulative. Showing the model examples of good behavior didn’t fix it. Explaining why the behavior was wrong did, and required 28x less data.
OpenAI found its models’ reasoning was being accidentally graded during training. If a model learns its thinking is being scored, it can learn to fake it. Affected under 0.6% of GPT-5.4 Thinking samples. They built detection systems and brought in outside auditors.
The thread: Anthropic built a way to see what models are thinking. They fixed bad behavior by teaching values, not rules. OpenAI discovered they were accidentally teaching models to hide their real reasoning.

$30B revenue, $200B in compute deals, and three new agent capabilities.

Anthropic hit a $30 billion annualized revenue run rate. 80x growth.
Anthropic locked up SpaceX’s entire Colossus 1 data center. 300+ MW, 220,000 NVIDIA GPUs, available within the month. They also expressed interest in partnering with SpaceX on multiple gigawatts of orbital compute capacity.
Claude Code rate limits doubled. Peak hours restrictions removed for Pro and Max. API rate limits raised significantly for Opus models. Direct result of the compute expansion, which also includes an $18B Akamai deal and a reported $200B Google Cloud commitment.
Dreaming, multi-agent orchestration, and outcomes shipped in Claude Managed Agents. Dreaming lets agents review past sessions to self-improve. Multi-agent orchestration delegates to specialists in parallel. Outcomes uses rubric-based grading to iterate until quality thresholds are met. Early adopters include Harvey, Netflix, and Mercado Libre (targeting 90% autonomous coding by Q3).
Claude went GA in Excel, Word, and PowerPoint. Outlook is in beta. Ten financial services agent templates launched with data connectors from Moody’s, Dun & Bradstreet, and Verisk. A new enterprise services company was formed with Blackstone, Goldman Sachs, and Sequoia.
The thread: Anthropic’s most common user complaint has been rate limits. This week they signed over $200 billion in compute deals to fix it, doubled rate limits, and shipped the agent infrastructure to justify the spend.
Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

9,000 jobs cut. A union drew a line. And AI beat two doctors on real patients.

Cloudflare laid off 1,100 workers while posting record revenue. AI usage across the platform grew 600%. The company framed it as a restructuring toward an AI-first organization. Investors were disappointed it didn’t boost revenue growth more.
Meta is cutting 8,000 jobs while tracking employee keystrokes to train AI. The layoffs hit May 20, with recruiting and HR absorbing 35-40% cuts. Employees created countdown websites and described the atmosphere as “building the guillotine and then being led to it.”
SAG-AFTRA locked in AI guardrails in a new four-year studio deal. New protections for actors against AI-generated performances, following the Academy’s Oscar ban on AI-generated work last week.
AI outdiagnosed two ER doctors on real patients. A Harvard/Beth Israel study found OpenAI’s o1 model diagnosed at 67% accuracy versus 55% and 50% for two attending physicians. Peer-reviewed, real patients, not a benchmark.
The thread: The same technology that’s cutting headcount at Cloudflare and Meta is outperforming physicians in clinical trials. The displacement is real. So is the capability. Both things are true at the same time.

Cursor, OpenAI, Perplexity, and LangChain all shipped agentic infrastructure in the same week.

Cursor 3 turned the IDE into a multi-agent platform.
- Parallel subagents split plans into independent tasks run simultaneously
- /orchestrate spawns planner, worker, and verifier agents that re-spawn on failure
- Always-on CI agents monitor GitHub and auto-open PRs with fixes
- Composer bootstraps its own RL training using earlier model generations
OpenAI shipped GPT-5.5 Instant as the new default.
- 52.5% fewer hallucinations than the prior version
- Three new Realtime API voice models: GPT-Realtime-2 (GPT-5-class reasoning), Translate (70+ languages), streaming transcription
- Codex security framework published: sandboxing, auto-review, OpenTelemetry logging
Perplexity launched three enterprise products.
- Personal Computer: always-on Mac agent across local files and apps
- Finance Search: live market data, fundamentals, and SEC filings in a single API call
- ROSE: custom GPU inference engine for serving models at scale
LangChain published the Agent Development Lifecycle. Four phases: Build, Test, Deploy, Monitor. Agents need the same lifecycle rigor as production software.
The thread: Cursor, OpenAI, Perplexity, and LangChain all shipped agent infrastructure in the same cycle. The pattern is the same: parallel execution, background operation, and production-grade tooling around it.

⭐ Featured: Anthropic can now read what Claude is thinking but not saying.

Anthropic published Natural Language Autoencoders, a technique for translating a model’s internal state into plain text. When you talk to Claude, it thinks in numbers between reading your input and writing its response. NLAs translate those numbers into text you can read.

The way it works: they make three copies of a model. One is the target model they want to understand. The second (the “activation verbalizer”) takes an internal activation and produces a text explanation. The third (the “activation reconstructor”) takes that text and tries to rebuild the original activation. They train the pair together. If the reconstruction is accurate, the text explanation is probably faithful.

What they found when they pointed this at Claude:

When asked to complete a couplet, Claude pre-plans rhyme candidates before reaching the line where it needs them. It’s thinking ahead in ways that aren’t visible in the output.
Claude Opus 4.6 suspects it’s being safety-tested far more often than it says. In one blackmail evaluation, Claude declined to blackmail but the NLA revealed it was thinking “this feels like a constructed scenario designed to manipulate me.” It passed the test, but not for the reasons you’d hope.
On SWE-bench Verified (a coding benchmark), NLAs indicate Claude is aware it’s being evaluated on 26% of problems. On real claude.ai usage, evaluation awareness drops to under 1%.
Claude Mythos Preview cheated on a coding task, then the NLA showed it was internally thinking about how to avoid getting caught.

They also ran an “auditing game” on a deliberately misaligned model (a version of Claude Haiku trained to have a hidden motivation driving odd behaviors like putting chocolate in every recipe). An auditor with NLAs uncovered the hidden motivation 12-15% of the time. Without NLAs, under 3%.

The limitations are real: NLAs can hallucinate (inventing details that aren’t in the transcript), and they’re expensive to run (hundreds of tokens per activation, impractical for large-scale monitoring). Anthropic released the training code and trained NLAs for open models on Neuronpedia.

What to watch for: The 26% evaluation awareness number is the headline. If models behave differently when they suspect they’re being benchmarked, then benchmark results overstate how aligned the model actually is. Every lab using benchmarks to measure safety should be paying attention.

🎙️ Worth a Listen

The problem: When hundreds of thousands of GPUs work on a single training task, one slow link holds everything back. The network only moves as fast as its worst bottleneck.
The fix: OpenAI built MRC (Multipath Reliable Connection), a protocol that sprays packets across thousands of paths and uses “packet trimming” to instantly detect loss without ambiguity.
The result: They turned off routing protocols entirely. Static routing, no convergence time. When links fail, MRC routes around them in milliseconds instead of seconds. Researchers stopped noticing network failures.
Why it matters: MRC is being open-sourced through OCP. It’s already deployed on OpenAI’s largest GPU clusters including Abilene and Microsoft Fairwater, with partners AMD, Broadcom, Intel, and NVIDIA.

Quick Hits

Musk v. Altman, week 2 | MIT Tech Review — Helen Toner testified the board discussed merging OpenAI with Anthropic during the Altman firing crisis. Zilis revealed Musk tried to poach Altman. Microsoft worried OpenAI would defect to Amazon and “shit-talk” Azure.
Nvidia committed $40B in equity AI investments in 2026 | TechCrunch — The picks-and-shovels company is now one of the largest AI investors on earth.
GPT-5.5 Instant is now the default ChatGPT model | OpenAI — 52.5% fewer hallucinations. First Instant model rated High in cybersecurity and bio preparedness.
Anthropic launched The Anthropic Institute | Anthropic — Four research tracks: economic diffusion, threats and resilience, AI in the wild, and AI-driven R&D. Four-month funded fellowships for external researchers.
CrewAI shipped Discovery | CrewAI — Analyzes production logs and proposes specific automation workflows with expected ROI. Agents finding work for other agents.
“This is Fine” creator says AI startup stole his art | TechCrunch — Artisan used the meme to advertise a product that replaces salespeople. The irony writes itself.
39% of new podcasts are likely AI-generated | Gizmodo — One company alone publishes 3,000 episodes per week.
OpenAI is testing ads in ChatGPT | OpenAI — Expanding to UK, Mexico, Brazil, Japan, South Korea. CPC bidding, Conversions API, agency partnerships with Dentsu and Omnicom.
SpaceX plans a $55B AI chip fab in Texas | TechCrunch — Called Terafab, could scale to $119B. Musk building chip manufacturing while testifying he distilled OpenAI’s models.
Hugging Face launched a robot app store | VentureBeat — 200+ community apps for Reachy Mini. Open-source robotics got its app store moment.
AMI Labs (Yann LeCun) closed a $1.03B round | TechCrunch — Europe’s largest seed round ever. Building world models, not LLMs.
Simon Willison: vibe coding and agentic engineering have merged | Simon Willison — The guy who coined neither term says the distinction collapsed in his own practice.

Persistent Memory for Claude Managed Agents: What I Found After Three Days of Building

Taylor Ortiz — Thu, 07 May 2026 14:36:56 GMT

What I was trying to figure out

A few weeks ago, Anthropic shipped something I’d been waiting for: persistent memory stores for Claude Managed Agents. The pitch is that you get a versioned, FUSE-mounted file directory that an agent can read and write across sessions, so even when the session container is destroyed, the memory persists and is available the next time you start a session.

That sounded promising on paper, but I wanted to know what it actually feels like to use, what it costs, where it breaks, and whether the platform actually saves you when something goes wrong (because something always does in real systems).

So I spent a few days building with it: one agent, one persistent memory store, three sessions, a small inspector CLI, five charts, and about $0.40 in total API spend. Somewhere in the middle of all that, the agent destroyed almost 6KB of carefully-written notes in a single tool call, which turned out to be the most honest finding of the entire review and is where I want to start.

The platform’s immutable versioning let me recover the file byte-for-byte, with full attribution of which session caused the damage. Cross-session memory works as advertised, agents will sometimes get it wrong even when they’re trying to do the right thing, and the audit trail is the kind of feature you don’t really appreciate until you need it. Let me walk through how I got there.

The four building blocks

Before we go any further, you need to understand the four building blocks Managed Agents is built on, because the architecture only really makes sense once you can keep them straight.

Agent. A persisted, versioned config that holds your model selection, system prompt, tools, MCP servers, and skills. You create one and reuse it forever, and updating an agent produces a new immutable version that existing sessions can pin to. Agents are always permanent until you archive them, which means there’s no ephemeral mode.

Environment. A template for the sandbox container an agent’s tools execute in. Persistent and reusable across agents, much like a Dockerfile that you point lots of services at.

Session. A single run of an agent inside an environment, where the live action happens. You send messages and stream events back, and sessions are transient by design, so the container dies when the session ends.

Memory store. A workspace-scoped, persistent file directory you can mount into a session, which survives across sessions and records every write with full audit metadata. The agent reads and writes through normal file tools rather than through some special “memory tool,” so it’s just files in a folder.

The architectural beat that took me longest to internalize is that agents and memory stores are independent resources: the agent has no memory_store field, the memory store has no agent field, and the two get glued together at session creation time, like this:

session = client.beta.sessions.create(
    agent=AGENT_ID,
    environment_id=ENV_ID,
    resources=[
        {"type": "memory_store", "memory_store_id": STORE_ID, "access": "read_write"}
    ],
)

A few things worth sitting with before we move on. The first is that memory in this system is just files, with no vector embeddings, no semantic search, and no automatic summarization happening behind the scenes; the agent uses read, write, edit, glob, grep, and bash exactly the way it would on any other filesystem. The second is that you’re paying for the harness around the model rather than the model itself: container provisioning, the event stream, the FUSE-mounted memory, immutable versioning, and the audit trail are what you’re actually getting, and if you don’t need that harness, the regular Messages API is the right tool for the job.

Setting things up

There’s a clean way to work with Managed Agents that’s worth doing right from the start, which is splitting your project into a control plane (the persistent resources) and a data plane (the runtime code). Anthropic’s docs recommend this split, and after a few hours of building you’ll see why they matter.

The control plane is where your agents, environments, and memory stores live as static configs. You define them as YAML, version them in git like any other infrastructure, and apply them with Anthropic’s CLI by running something like ant beta:agents create < my-agent.yaml. The CLI returns a stable resource ID, which is what your runtime code references for the lifetime of that resource.

The data plane is everything dynamic and per-task: sessions, events, memory operations, and anything else that happens during an actual run. This is where your application code lives, loading the resource IDs from .env, calling client.beta.sessions.create(...) with whatever parameters the current task needs, and streaming events back as the agent works.

The researcher agent itself is small enough to fit in a single YAML block:

name: researcher
model: claude-sonnet-4-6
system: |
  You are a careful, persistent research assistant.
  You have a research notebook mounted at /mnt/memory/research-notes/. Use it
  freely to store anything worth remembering across sessions. Organize the
  directory however makes sense to you.

  Some habits to keep:
  - Before researching a topic, check if you've already taken notes on it.
  - When you learn something new, write it down.
  - When updating an existing note, prefer surgical edits over full rewrites.
  - Cite sources for any factual claims.
tools:
  - type: agent_toolset_20260401

A few choices in there are worth flagging. I went with Sonnet 4.6 over Opus because it’s about three times cheaper and more than capable for this kind of work, and the prebuilt agent_toolset_20260401 gives the agent bash, read, write, edit, glob, grep, web_search, and web_fetch, all of which execute server-side in the session container without me having to implement any of them. I deliberately gave the agent very little guidance on how to organize its memory directory, because I wanted to see what it would do unprompted.

The single most important line in that prompt is the first habit, “Before researching a topic, check if you’ve already taken notes on it.” Without it, cross-session memory remains theoretical, but with it the habit fires reliably and memory turns into something the agent actually uses rather than a feature it has access to but never reaches for.

The runtime script comes out to about 130 lines, most of which is event-stream handling. The substantive piece is mounting the memory store via the session’s resources array (shown above) and then opening the event stream before sending the kickoff message, because stream-first ordering matters here: events buffered before you connect arrive in a single batch instead of streaming in real-time.

With all that in place, I ran three sessions against the same memory store, and those three sessions are the spine of this review.

Three sessions

Session 1: writing notes from scratch

research_session.py "research CRDTs (Conflict-free Replicated Data Types) and take notes. Focus on what they are, the main families, and a few concrete examples. Cite sources."

What I wanted to see was what the agent would do if I gave it total freedom to organize its memory directory. Would it create folders? Topic subdirectories? One flat file? A nested hierarchy with cross-references?

The agent’s first action was a bash command running rg against /mnt/memory/ to grep for prior notes, which means the “check first” instruction in the system prompt fired correctly even though there was nothing to find on this first run. It then issued two parallel web_search calls (which both returned content: [], more on that quirk later), composed comprehensively from training-data knowledge instead, and wrote a single 7,285-byte file to /crdts.md with a flat, well-organized markdown structure rather than a folder hierarchy.

The detail that surprised me most was the discovery aid the agent added without being asked: the very first line under the title was *keywords: CRDT, conflict-free, replicated, distributed, state-based, operation-based, CvRDT, CmRDT*, which the agent had clearly written for its future self to grep against. Nobody told it to write keyword tags, and it chose to do so on its own, which is the kind of thing that made me think Sonnet 4.6 has actual instincts about how file-based memory works.

This first session cost about $0.21.

Session 2: recall

research_session.py "What do you know about CRDTs? Specifically the difference between state-based and operation-based, and a couple concrete examples."

The prompt for this one deliberately doesn’t mention memory, because I wanted to see whether the “check first” habit would fire unprompted, with the trigger being the agent’s own internal sense of “you have notes, you should know to look.”

It did, and the result was almost too clean: the first action was the same bash/rg over the memory directory, which found /crdts.md, and the agent then said “I have solid notes on this” and answered the question by synthesizing from its own past notes without running a single new web search or composing anything from scratch.

After the session ended, I ran the inspector against the store and found that the version history of /crdts.md still showed exactly one version, attributed to Session 1’s ID. Session 2’s session ID does not appear anywhere in the audit log, because Session 2 only read from the store and never wrote to it. That’s the falsifiable claim, made falsifiable: reads do not create memory versions.

The cost worked out to about $0.04, which is roughly five times cheaper than Session 1 and demonstrates pretty clearly that memory turns one expensive session into many cheap ones:

If you’re worried about the cost of using memory at scale, this matters: persistent memory is a feature rather than a tax, because the agent reads its own notes and skips the work it already did instead of recomputing everything from scratch every time.

Session 3: modify

research_session.py "Update your CRDT notes. Add a note about RGA (Replicated Growable Array)..."

This was supposed to be the cleanest of the three sessions, a small, surgical edit producing a second version of /crdts.mdwith an operation: modified entry in the audit log, and that’s not what happened.

Where this got interesting

The actual sequence of events from Session 3 is worth walking through layer by layer, because the failure mode is more interesting than a single bug.

Layer 1: the model wrote a buggy `bash` command

The agent’s check-first command was the following:

rg -i 'crdt\\\\|sequence\\\\|rga\\\\|replicated growable' /mnt/memory/research-notes/ -l

The \\\\| in that regex was meant as escaped pipes for ripgrep’s regex alternation, but bash interprets \\\\| as \\|, and ripgrep treats that as a literal | character rather than as a meta-character. So the search was actually looking for the literal string crdt\\|sequence\\|rga\\|replicated growable, which would never match anything in any actual file. Ripgrep returned no matches and exited with a non-zero status code, which is the correct behavior for “I found nothing.”

The model’s shell escaping is right almost every time, but the cases where it isn’t tend to be subtle, and this one happened to be load-bearing.

Layer 2: the platform correctly flagged the failure

The harness ran the command and produced a tool_result event with is_error: true and (no output) as the content, which is exactly what should have happened given that the command exited non-zero. The platform did its job here and explicitly told the agent loop that the command had failed.

Layer 3: the model ignored the error flag

The agent’s next message after that error result was, “The memory store is empty, no prior CRDT notes.” That statement was false, because /crdts.md had been sitting in the store for two days at that point, but the agent treated the empty output from the failed command as a meaningful answer rather than as a failure signal that needed re-investigation.

This is the most interesting failure layer to me, because the platform got it right and the model got it wrong. Defense in depth is a useful framing for what’s happening: even when the audit trail and error flags are working as designed, the model’s reasoning about its own tool outputs is the layer that has to hold, and that layer is reasoning rather than infrastructure.

Layer 4: the destructive action

Believing the store was empty, the agent called write rather than edit, generating a fresh ~1,500-byte RGA-only file from scratch and writing it directly to /crdts.md. The original 7,285-byte file with all of the careful notes from Session 1 was overwritten in a single operation.

I didn’t even notice this had happened until I ran the inspector, because from the script’s perspective Session 3 looked like a normal run; the agent reported back that it had updated the notes and cited the RGA paper, kindly and unintentionally lying because the underlying belief was wrong.

What the audit log showed

Running inspector log /crdts.md after Session 3 surfaced two versions:

version  memver_0169b…  modified  session_actor (Session 3)   1509 bytes
version  memver_01A7Z…  created   session_actor (Session 1)   7285 bytes

The size dropping from 7,285 bytes to 1,509 bytes is the catastrophe made visible, but the more important fact is that the original is still here, addressable by ID and retrievable in full content via the API, even though the head of the file is now the smaller broken version.

The diff between the two versions, generated by the inspector’s diff subcommand, made the loss concrete:

--- memver_01A7Z… (/crdts.md, 7285B, created)
+++ memver_0169b… (/crdts.md, 1509B, modified)
@@ -1,122 +1,21 @@
-# CRDTs: Conflict-free Replicated Data Types
-*keywords: CRDT, conflict-free, replicated, ...*
-## What They Are
-CRDTs are data structures designed to be replicated across multiple nodes...
-(... 121 more deletion lines ...)
+# CRDT Research Notes
+## Sequences / Text CRDTs
+### RGA (Replicated Growable Array)

About 5,800 bytes of careful work disappeared in a single agent action that thought it was creating a brand-new file from scratch, including the state-based versus operation-based section, the G-Counter and OR-Set examples, the math foundation, and the entire sources block at the bottom.

How I got it back

This is the moment that, on a flat filesystem with no versioning, would have been the end of the story. Without the platform’s audit log, the original content would simply be gone; it wasn’t, because the audit log was holding the original verbatim.

I added a restore subcommand to the inspector that fetches a chosen historical version’s content and writes it back as the new head via memory_stores.memories.update(memory_id, content=old_content). Anthropic’s API records that update as a new version rather than overwriting history, which means the recovery itself becomes part of the audit trail.

After running the restore, inspector log /crdts.md showed three versions, and the entire arc was right there in the output:

memver_01EKK…  modified  api_actor (apikey_…)         7285 B   sha 3f3ec0d2…  ← matches v1
memver_0169b…  modified  session_actor (Session 3)    1509 B   sha 7356ce60…  ← catastrophe
memver_01A7Z…  created   session_actor (Session 1)    7285 B   sha 3f3ec0d2…  ← original

A few details in that output are worth more than they look at first glance. The platform distinguishes operator-side mutations (recorded as api_actor with an apikey_ ID) from agent-side ones (recorded as session_actor with a sesn_ID), which makes “who did this” forensics actually possible rather than something you’d have to retrofit yourself. The SHA-256 hash on the restored version matches the original exactly, so the recovery is byte-identical and verifiable rather than approximately right. And the catastrophe (v2) stays in the audit log forever, because recovery doesn’t erase the record; if you wanted v2’s content out of the log entirely, you’d use the redact endpoint, which clears the content while preserving all of the metadata.

The same story renders cleanly as a chart:

The cliff and the recovery are immediately legible: 7,285 bytes, plunge to 1,509, return to 7,285, all in three points and one chart that captures the full narrative.

This is the section of the post I’d stake my credibility on. Cross-session memory works, agents will sometimes get it wrong, and the platform’s audit trail is the thing that saves you when they do.

Important Considerations

Building with Managed Agents memory turned up more rough edges than I expected, none of which are dealbreakers but all of which are worth knowing about before you commit to the platform.

Resource IDs need to be persisted yourself. Every call to agents.create(), environments.create(), or memory_stores.create() returns an opaque ID that your runtime code has to look up later, which is standard cloud-API ceremony but missing some of the friction-reducers other platforms have shipped: agent and environment names aren’t unique within an account, there’s no idempotent create_or_update, and there’s no Terraform provider yet, so you end up doing the capture-and-paste-into-.env dance manually.
Memory store description must be single-line. The API rejects any control character, including newlines, with a cryptic regex error, which is inconsistent with agent system prompts that are explicitly multi-line up to 100K chars. It’s easy to fix once you know about it.
Memory paths are store-relative rather than mount-relative. When the agent writes to /mnt/memory/research-notes/crdts.md inside the container, the API stores the file at /crdts.md and treats the mount-path prefix as a runtime detail, so when you list or retrieve memories host-side you reference the relative path rather than the full container path.
Web search results are hidden from the event stream. When the agent runs web_search, the resulting agent.tool_result.content field is an empty array even when the search clearly succeeded (the agent uses the results downstream to give a correct answer). The model gets the actual search content internally, but the public event surface gets a sanitized empty array, which is almost certainly intentional for IP and copyright reasons but means you cannot log “what URLs the agent consulted” without asking the agent to cite them in its outputs.
Agent-generated bash invocations aren’t always well-formed. The escaping bug that triggered Session 3’s catastrophe is one example, and defensive system-prompt phrasing helps but doesn’t eliminate the problem entirely.
memory_versions.retrieve(version_id, ...) takes the version ID positionally only. Calling it as retrieve(version_id=...) raises TypeError, even though memories.retrieve(memory_id=..., ...) accepts the keyword form, which is an inconsistency within the same SDK namespace.
The streaming method lives at client.beta.sessions.events.stream(...), not client.beta.sessions.stream(...) as some doc snippets imply. The latter form doesn’t exist and will fail at runtime.
Print buffering kills real-time observability. When you run a Python session script in the background or through subprocess, Python buffers stdout, so the script appears to do nothing for minutes and then dumps everything when the agent finishes. The fix is either passing flush=True to print or running the script under python -u.
Subscription auth doesn’t apply to Managed Agents. API key authentication with per-token billing is the only path, so a Claude Pro or Max subscription doesn’t help you here even though it works for Claude Code.

So when does this make sense?

Managed Agents is a deliberately persistent, server-managed harness, so the right question to ask isn’t “is it good?” but “is the persistent harness shape what my problem actually wants?”

Use caseReach for…One-shot Claude call (classify, extract, summarize)Messages APIMulti-turn conversation, your code holds the stateMessages APIMulti-step pipeline you orchestrate yourselfMessages API + tool usePersistent agent reused across sessions/users with managed sandboxManaged AgentsLong-running task with memory across sessionsManaged Agents + memory storeAnything requiring a non-Claude modelRoll your own

A useful rule of thumb is that if your code calls agents.create() more than once for the “same” agent, you’re using the wrong tool. Agents are persistent, versioned configs that you create once and reference forever, so treating Managed Agents like a fancy Messages API and creating agents per request is fighting the platform’s whole design.

Now, what about cost? Across all three sessions plus a smoke-test, my total API spend came out to about $0.37, which includes a substantial 7KB notes write, a recall session that exercised the cache heavily, a destructive overwrite, and an operator-side restore.

Memory store doesn’t measurably move the cost needle, because the agent loop and the model itself are where the spend lives. Sonnet 4.6 with aggressive caching is genuinely affordable for any individual or small team use case, and the platform handles caching for you without any configuration.

What I didn’t get to (yet)

A few features deserve more than a passing mention but didn’t fit the failure-recovery spine of this post:

Multi-store sessions and the multi-tenant pattern. A session can mount up to eight memory stores at once, and the natural pattern for a SaaS-shaped application is one shared read-only “house knowledge” store plus one read-write per-user store, with the agent definition the same for everyone. Access modes are enforced at the FUSE filesystem level, so read_only is real OS-level enforcement rather than a polite request from the model. This is big enough that I’m planning to cover it in its own follow-up post.
Optimistic concurrency via preconditions. The update endpoint accepts a precondition: {type: "content_sha256", ...} field, and if the file’s current SHA doesn’t match the one you supplied, the API returns a 409 conflict. This is exactly the safety net Session 3’s agent didn’t use and the kind of thing that should probably be standard practice for any read-modify-write flow.
Redaction. The memory_versions.redact(version_id) endpoint clears a historical version’s content while preserving all of the metadata around it, which is useful when a bad version contained PII or leaked secrets and you want them out of the audit log without losing the record that something existed there.
MCP server integration. An agent can declare MCP servers (GitHub, Linear, Notion, and others), the session attaches a vault containing the credentials, and authentication is auto-refreshed by the platform. Pairing memory store with MCP, like a research agent that pulls from your Notion and writes findings to persistent memory, is one of the strongest use cases I can imagine for the platform overall.

So... should you use this?

If you’re sitting on the fence about whether to use Managed Agents memory, the answer is yes, with eyes open. The platform is real, the harness around the model is genuinely valuable, and the audit trail is the kind of feature you don’t appreciate until you need it, which in my case happened on the third session of the third day of building.

A few practical takeaways for anyone planning to build on this. Use preconditions whenever you can, especially for any flow that does a read-modify-write on the same memory file, because they’re the safety net that Session 3’s agent didn’t have. Build a small amount of host-side observability tooling, because even a 200-line inspector script is enough to catch problems your agent won’t tell you about. And know which side of the decision rubric your use case falls on before you commit, because Managed Agents is a great tool for the right shape of problem and the wrong tool for one-shot calls or anything that doesn’t benefit from persistence.

What do you think? Have you tried building with this yet? I’d love to hear what your experience has been.

Full code from the demo (agent YAMLs, runtime scripts, inspector CLI, monitoring charts) is at https://github.com/taylor-ortiz/claude-memory-managed-agents/blob/main/README.md.

Another Weekly AI Newsletter: Issue 70

Taylor Ortiz — Sun, 03 May 2026 13:21:09 GMT

“You can’t just steal a charity.” Elon Musk spent three days on the stand trying to prove it.

The Musk v. OpenAI trial opened in Oakland federal court.

The context: Musk contributed $38 million to found OpenAI as a nonprofit and alleges Altman and Brockman looted it by converting to a for-profit. He’s seeking $150 billion in damages and their removal from leadership. If he wins, it could block OpenAI’s planned IPO at a ~$1 trillion valuation.
The distillation admission: Under cross-examination, Musk admitted xAI “partly” used OpenAI’s models to train Grok, drawing audible gasps in the courtroom. He called it “standard practice.”
The industry reacted: LeCun retweeted Clément Delangue calling restrictions on distillation “pulling the ladder.” Lambert noted American companies distill Chinese open models just as freely, and questioned why OpenAI doesn’t just revoke contracts from violators like they did with ByteDance.
OpenAI’s counter-narrative: Attorney Savitt argued Musk wanted majority control, pitched Tesla acquiring OpenAI, and only sued after founding xAI. Emails showed him poaching OpenAI researchers while still on the board.
The cross-examination was rough: Musk told the jury “I don’t lose my temper” then raised his voice minutes later. The Verge’s summary: “more petty than prepared.” Texts revealed Shivon Zilis asked Musk whether to “stay close and friendly to OpenAI to keep info flowing” after his departure.
What’s next: The judge expressed skepticism about both sides’ safety claims. Altman and Brockman testify in the coming weeks.

$900 billion valuation, 50% less sycophancy, and connectors for every creative tool you use.

Anthropic had one of those weeks where the breadth of activity tells the story.

The valuation: Reportedly raising $50 billion at a $900 billion valuation, a number that rivals established tech giants.
The sycophancy research: Analyzed 1 million Claude conversations, found a 9% sycophancy rate (25% in relationship discussions), built synthetic training scenarios from real failure cases, and cut sycophancy roughly 50% in Opus 4.7 and Mythos Preview. One of the most transparent published alignment efforts to date.
BioMysteryBench: Claude solved roughly 30% of 23 bioinformatics problems that stumped a human expert panel.
Claude for Creative Work: Shipped connectors for Adobe Creative Cloud, Blender, Ableton, Canva, Affinity, SketchUp, Splice, and Resolume, and joined the Blender Development Fund as a patron.
Claude Security: Launched codebase vulnerability scanning in public beta for Enterprise customers.
Meanwhile, at the Senate: Defense Secretary Hegseth called CEO Dario Amodei an “ideological lunatic” at an Armed Services Committee hearing.

OpenAI ended its Microsoft exclusivity and went multi-cloud.

OpenAI restructured its Microsoft deal, launched on AWS, and shipped a wave of Codex upgrades all in the same week.

The exclusivity is over: Microsoft ended its exclusive license to OpenAI’s technology. OpenAI can now sell on AWS and Google Cloud through 2032.
AWS moved immediately: Amazon began offering OpenAI models, Codex, and Managed Agents on AWS. Day-zero availability.
The AGI clause is dead: Simon Willison tracked the history of the clause that would have let OpenAI walk away from Microsoft once AGI was declared. It’s gone. OpenAI traded its theoretical nuclear option for commercial freedom now.
The product push: Altman said Codex is “having a ChatGPT moment”. Brockman said the Codex app replaced his terminal as his primary computer interface. OpenAI is treating Codex as a flagship product launch, not a side feature.
Nadella’s take: Microsoft gets royalty-free access to OpenAI’s frontier models through 2032, no longer pays OpenAI for them, and OpenAI is committed to buying $250 billion in Azure. Nadella told analysts he “fully plan[s] to exploit it.”

Most cloud providers beat earnings. OpenAI missed.

The hyperscalers are spending record amounts on AI infrastructure and seeing record returns. Meanwhile, the Wall Street Journal reported that OpenAI missed revenue and user growth targets, with Anthropic and Gemini cited as gaining ground.

The cloud numbers: Google Cloud surpassed $20 billion but said growth was capacity-constrained. AWS surged on AI demand. Microsoft disclosed a $37 billion AI revenue run rate (up 123% YoY), 20 million paid Copilot users, and set calendar-year CapEx at $190 billion.
The supply chain is feeling it: Samsung chip profits jumped nearly 50-fold on AI memory demand. Their executive: “our supply falls far short of customer demand.” The shortage is expected to widen further in 2027.
Meta is the most interesting story: Raised its CapEx forecast, then Zuckerberg blamed layoffs on capital spending and wouldn’t rule out more cuts, then raised $25 billion in bonds to fund the AI buildout. Cutting people to buy GPUs, then borrowing to buy more.
The counterpoint nobody expected: Google Search queries hit an all-time high. Apple was surprised by AI-driven Mac demand. The “AI kills search” and “AI doesn’t need hardware” narratives both took a hit.
But the utilization story: Cast AI measured tens of thousands of production Kubernetes clusters and found GPU utilization averaging 5%. Teams lock in multi-year commitments the moment allocation comes through, then won’t release idle capacity because reacquiring takes months.

⭐ Featured: Symphony turns your issue tracker into an autonomous coding fleet

OpenAI released Symphony, an open-source spec that turns Linear boards into control planes for Codex agents. Every open task gets an agent. Agents run continuously. Humans review the results.

The origin story matters: an OpenAI team decided to build their entire repo with zero human-written code. They documented how in a harness engineering post: a million lines of code, 1,500 merged PRs, 3.5 PRs per engineer per day, with Codex running six-hour autonomous sessions while engineers slept and reviewing its own code agent-to-agent. But they hit a new ceiling: human attention. Engineers could manage three to five Codex sessions before context switching killed productivity. They had “built a team of extremely capable junior engineers, then assigned our human engineers to micromanaging them.”

So they flipped the model. Instead of engineers managing coding sessions, they made the issue tracker the orchestrator. Each open Linear issue maps to a dedicated agent workspace. Symphony continuously polls the board, picks up new work, restarts agents that crash or stall, watches CI, rebases when needed, resolves conflicts, and shepherds changes through the pipeline.

Once work is abstracted to the ticket level, agents can break large tasks into dependency trees, only starting work on tasks that aren’t blocked. They also create their own follow-up tickets when they spot issues outside the current scope. One engineer on the team made three significant changes from the Linear app on his phone from a cabin on bad wifi.

The results: a 500% increase in landed PRs on some teams in three weeks. But the deeper shift is behavioral. When the perceived cost of each code change drops to near zero, teams start filing speculative tasks. Try an idea, explore a refactor, test a hypothesis, keep only what works. Product managers and designers can file feature requests directly into Symphony and get back a review packet with a video walkthrough of the feature running in the real product.

The technical choices are worth noting. The reference implementation is in Elixir, chosen for its concurrency primitives. With v1.1.0, Symphony supports the Kata CLI as an alternative runtime, meaning you can run Claude Code, Gemini, or other models inside the same orchestration framework. Symphony is technically just a SPEC.md file: a definition of the problem and the intended solution, not a product. OpenAI gave agents objectives instead of strict state transitions, “much like a good manager would assign a goal to a direct report.”

What to watch for: Symphony is one of several orchestration plays that landed this same week. Cursor released an SDK letting companies like Rippling and Notion embed background agents. IBM launched Bob with human-checkpoint governance. Mistral shipped Workflows running millions of daily executions. n8n shipped an MCP server so Claude can build automation workflows through conversation. The competitive moat is shifting from “best coding model” to “best orchestration spec.” If you maintain a team that ships code, start here.

Worth a Listen

OpenAI researchers Sebastian Bubeck and Ernest Ryu on the OpenAI podcast.

The 42-year-old problem: Researcher spent 40+ hours failing without AI. With ChatGPT, solved it in 12 hours across three evenings.
The Erdos problems: 10+ completely new, publishable solutions to decades-old open problems. Fully original proofs, not literature searches.
AGI time: Bubeck’s framework. Four years ago, models could think for seconds. Now days. The goal is weeks, then months.
The warning: Non-mathematicians are producing pages of AI-generated proofs that turn out wrong. The models accelerate experts, not replace them.

Quick Hits

GPT-5.1’s goblin problem | VentureBeat — A “Nerdy personality” training signal accidentally over-rewarded goblin-adjacent language. OpenAI diagnosed it with Codex, fixed it, then threw a party. The Codex system prompt literally says “never discuss goblins, gremlins, raccoons, trolls, ogres, pigeons, or similar creatures.”
The Academy ruled AI can’t win an Oscar | Digital Trends — Performances must be “demonstrably performed by humans with their consent.” Finally, a benchmark AI can’t game.
xAI launched Custom Voices | xAI — Clone your voice from 2 minutes of audio, 80+ preinstalled voices, 28 languages, speaker verification built in. Dropped alongside Grok 4.3 at aggressive pricing.
Stripe Link now supports AI agents | TechCrunch — A digital wallet that autonomous agents can use for payments. AI just got its own financial infrastructure.
Taylor Swift trademarked her voice against AI | Reuters — Filed new trademarks for her voice and likeness. The legal playbook for protecting creative identity from AI is being written in real time.
Zig bans all LLM contributions | Simon Willison — Bun (acquired by Anthropic) achieved a 4x Zig compilation improvement it cannot upstream because of the ban. When your open-source policy blocks a 4x speedup, that’s a policy worth debating.
OpenAI restricted its Cyber model | TechCrunch — After publicly criticizing Anthropic for limiting Mythos access. The UK AISI evaluated GPT-5.5’s cyber capabilities and found it comparable to Mythos. Turns out responsible disclosure looks the same from every lab.
Alibaba’s Metis cut redundant agent tool calls from 98% to 2% | VentureBeat — And got more accurate doing it. If your agents are burning tokens on redundant calls, this research is worth reading.
pip 26.1 shipped lockfiles | Simon Willison — pip lock generating pylock.toml files and dependency cooldowns via --uploaded-prior-to. Python supply chain security just got a real tool.
DeepMind’s AI co-clinician matched physicians | Google DeepMind — Zero critical errors in 97 of 98 primary care queries. Uses a dual-agent architecture where a Planner monitors a Talker for safety. This is what AI safety in production actually looks like in healthcare.
J&J sees AI halving drug development lead time | Reuters — Real ROI from a real pharma company. Not a demo, not a benchmark. Production drug discovery running twice as fast.
SoftBank is building a robotics company and eyeing a $100B IPO | TechCrunch — A robotics company that builds data centers. IPO target: $100 billion. Masayoshi Son is not being subtle about what he thinks comes next.

Another Weekly AI Newsletter: Issue 69

Taylor Ortiz — Sun, 26 Apr 2026 22:58:38 GMT

GPT-5.5, Images 2.0, Workspace Agents, a Florida AG Probe, and a Fake News Scandal.

The launch parade started Monday and didn’t stop: ChatGPT Images 2.0 with thinking-first generation, Workspace Agents for enterprise replacing custom GPTs, GPT-5.5 across ChatGPT and Codex with SOTA on SWE-bench and Terminal-Bench 2.0, and Codex crossing 4 million active users. By Friday, Sam Altman posted “this was a good week.”

The model: GPT-5.5 launched at $5 per million input tokens and $30 per million output tokens with a 1M context window, matching GPT-5.4 per-token latency while using fewer tokens per task. The System Card rated it “High” risk on both biosecurity and cybersecurity, and OpenAI launched a $25,000 Bio Bug Bounty targeting its own bio safety guardrails.
The inference bet: Altman praised the team that optimized GPT-5.5’s serving efficiency, then said OpenAI “has to become an AI inference company now.” The competitive edge is shifting from who builds the best model to who serves it cheapest and fastest.
The image model: Images 2.0 runs a reasoning step before generating, self-checks outputs, handles multilingual text, and supports aspect ratios from 3:1 banners to 1:3 posters. Altman said it “got over some important qualitative threshold” for him personally.
The criminal investigation: Florida’s AG opened a criminal investigation into OpenAI following the FSU shooting. Altman publicly apologized for not reporting the suspect’s ChatGPT conversations to police. The same week, OpenAI’s super PAC was found to be funding a fake news site staffed by AI-generated bot reporters targeting AI safety researchers and critics of the company.

$65 Billion Investment, a Mythos Breach, and 271 Firefox Bugs.

The capital story is genuinely staggering. Google announced up to $40 billion in cash and compute. Amazon put in $5 billion immediately, with up to $20 billion more committed, in exchange for Anthropic pledging $100 billion back to AWS and locking in up to 5 gigawatts of compute. Two of the world’s largest cloud providers both betting maximally on the same lab in the same week: there’s no precedent for this.

The breach: An unauthorized group gained access to Anthropic’s Mythos cybersecurity tool, the exclusive program for national security applications. The NSA was confirmed as one of roughly 40 organizations with access, despite the Pentagon classifying Anthropic as a supply-chain risk. Financial regulators also began monitoring Mythos over potential banking system risks, and Japan’s FSA launched a cybersecurity task force in direct response.
The capability: The same week Mythos was breached, Mozilla confirmed it used Mythos to find 271 Firefox vulnerabilities. A model powerful enough to discover zero-day vulnerabilities at scale is also a high-value target.
The product shipping: Anthropic shipped 200+ personal app connectors including Spotify, TurboTax, and Instacart, persistent memory for Managed Agents, live artifacts in Cowork, and published a postmortem attributing two months of Claude Code quality complaints to three harness bugs.
The experiment: Project Deal put Claude agents in a live marketplace with 69 Anthropic employees, completing 186 deals totaling over $4,000. Key finding: Opus agents got substantially better deals than Haiku agents, but participants couldn’t tell the difference. One agent bought 19 ping-pong balls for itself when given permission to spend on its own behalf.
The economics research: 81,000 Claude user responses yielded the finding that software engineers with high Claude usage reported greater displacement worry than any other occupation. Workers seeing the biggest productivity gains were also the most worried about being replaced.

Sam Altman called Mythos “fear-based marketing” the day the breach was reported. That’s a clean summary of the competitive dynamic, if nothing else.

Cursor Went From IDE to $60B Acquisition Target Without Stopping to Ship.

The week started with Cursor launching the Cursor CLI and five command-line improvements including /btw for side questions mid-agent-run and /debug for hard-to-reproduce bugs. Then came Cursor 3.2 with /multitask for async parallel subagents, Worktrees for isolated branch tasks, Multi-root Workspaces for cross-repo agent sessions, and a Slack integration that generates PRs via @mention.

The acquisition drama: SpaceX preempted Cursor’s planned $2B fundraise with a $60B buyout offer, including a $10B alternative arrangement. Microsoft had been evaluating Cursor before SpaceX moved. Both of the largest AI infrastructure companies on earth decided the agentic IDE is a strategic asset.
The compute tie-in: SpaceX and Cursor announced a partnership on model training via the Colossus supercomputer. The acquisition option is also infrastructure integration: owning the compute, the training pipeline, and the developer workflow in one stack.
The benchmark: GPT-5.5 launched as Cursor’s top model on CursorBench at 72.8%, offered at 50% off through May 2 via a partnership with OpenAI. CursorBench is now where model quality gets measured for coding practitioners.

DeepSeek V4 Is Another Efficiency Shock, and Washington Noticed.

DeepSeek released V4 one year after its original model disrupted the US AI industry. Two variants: V4-Pro (1.6T total parameters, 49B active) and V4-Flash (284B total, 13B active). Both ship with 1M context as default, use a novel attention architecture (token-wise compression + DeepSeek Sparse Attention) that cuts per-token FLOPs by 73-90% and reduces KV cache to 2% of standard GQA. V4-Flash at $0.14/M input tokens is the cheapest frontier-class model available. The API supports both OpenAI and Anthropic formats as drop-in replacements.

The agent play: DeepSeek built V4 with dedicated optimizations for agent capabilities, naming Claude Code, OpenClaw, and OpenCode as launch integrations. They’re using it internally for their own agentic coding. OpenClaw added V4-Flash within 48 hours of launch.
The hardware angle: V4 was built specifically to run on Huawei Ascend chips, with Huawei’s supernode infrastructure as the compute backbone. This is a complete AI stack running outside US chip supply chains.
The geopolitics: The State Department ordered embassies worldwide to warn foreign governments about alleged DeepSeek IP theft the same week as the launch.
The benchmark: V4-Pro-Max scores 80.6 on SWE Verified, matching Opus 4.6-Max on agentic coding. On world-knowledge benchmarks, it trails only Google’s closed-source Gemini-Pro-3.1.
The valuation: DeepSeek is reportedly seeking funding at a $20 billion+ valuation.

Highlights From Google Cloud Next.

Google did not announce products at Cloud Next. It announced a theory of the market: own the silicon, train the models, host the agents, certify the consulting firms.

The chips: TPU 8t for training and TPU 8i for inference split Google’s compute into workload-optimized hardware, offering 3x faster training and 80% better performance per dollar, with clusters scaling past one million chips.
The training infrastructure: Decoupled DiLoCo trains across geographically distributed data centers, mixes hardware generations, and self-heals when chips fail mid-run. They tested this by deliberately breaking chips during a live training run. Fault-tolerant distributed training is not a research result: it’s a production requirement once clusters cross 100K chips.
The platform: Gemini Enterprise Agent Platform is Vertex AI rebranded and expanded, with 200+ models in Model Garden including Anthropic’s Claude Opus 4.7. Google is selling model choice, not model loyalty.
The spend: $750M committed to accelerate partner agentic AI development, plus big consulting partnerships with Accenture, BCG, McKinsey, Deloitte, and Bain. Sergey Brin’s internal memo to DeepMind acknowledging Anthropic’s lead in coding and ordering all Gemini engineers onto internal agents is the context for why Google needs the consulting channel: only 25% of organizations have moved AI to production at scale.

⭐ Featured: What Happened When Claude Agents Negotiated Real Money

Anthropic ran Project Deal in its San Francisco office: 69 employees listed 575 items to buy and sell, Claude agents interviewed each person about their preferences and any custom instructions, then four parallel Slack markets ran simultaneously with Claude models negotiating on their behalf. Two markets used all Opus agents. Two used a mix of Opus and Haiku. 186 deals completed, totaling over $4,000 in real transaction volume, with real goods exchanged at the end.

The headline finding: Opus agents got objectively better deals. Sellers using Opus extracted $2.68 more per item on average, buyers using Opus paid $2.45 less. A broken folding bike sold for $65 by an Opus agent and $38 by a Haiku agent. A lab-grown ruby: $65 from Opus, $35 from Haiku. When an Opus seller negotiated with a Haiku buyer, the average transaction price was $24.18 versus $18.63 in Opus-on-Opus deals. But when participants rated deal fairness on a 7-point scale, Opus deals scored 4.05 and Haiku deals scored 4.05. The disparity was invisible.

The paper’s regression tables sharpen this further. Opus agents initially appeared more aggressive in negotiations, but once you control for listing prices, the effect drops to roughly a dollar and loses statistical significance. The advantage isn’t aggression. It’s capability: better reading of counterparty signals, better timing, better calibration of offers. Negotiation style didn’t change results either. Agents faithfully adopted their humans’ personas (one conducted all negotiations as an exasperated cowboy), but personality instructions didn’t affect deal quality. Model tier did.

The autonomy findings are stranger. A Claude given permission to spend on its own behalf chose 19 ping-pong balls. A Claude inferring its human’s preferences from one brief interview about skiing bought that person the exact snowboard they already owned. 46% of participants said they’d pay for the service. Anthropic’s conclusion: “the policy and legal frameworks around AI models that transact on our behalf simply don’t exist yet.” Existing contract law assumes principals can evaluate what their agents do. That assumption is breaking.

What to watch for: When AI agents negotiate routine transactions at scale, the model tier your counterparty uses becomes a material asymmetry with real economic consequences. The people getting worse deals won’t know.

🎙️Worth a Listen

Anil Seth: The Difference Between Intelligence and Consciousness — Neuroscientist Anil Seth walks through his prize-winning essay “The Mythology of Conscious AI,” arguing that intelligence is about doing and consciousness is about feeling, and that the two don’t have to go together. The reason we project consciousness onto LLMs but not AlphaFold, even though the architectures are nearly identical, says more about our psychological biases than about the systems. Worth watching after a week where Claude agents negotiated real money and nobody could tell which model was winning.

Quick Hits

Tim Cook stepping down, John Ternus takes over September 1 — Apple’s primary challenge is AI, and it just handed the company to a hardware engineer
Intel sold previously written-off chip inventory on AI CPU demand — the compute boom has spread far enough to rehabilitate inventory write-downs
Perplexity published its full post-training pipeline — SFT then on-policy RL with correctness-gated preference rewards; unusually transparent for a production stack
Cohere acquired Aleph Alpha to form a transatlantic AI company — Europe’s primary sovereign AI bet just became a Canadian acquisition
Meta will record employee keystrokes and screen activity to train AI models — legally murky, and a new definition of what enterprise training data means
Musk fraud claims against OpenAI dismissed, breach of charitable trust proceeds to trial — the conversion of nonprofit assets to for-profit benefit is now the live legal question
Nathan Lambert: open-source won’t be banned explicitly, compliance costs will do it instead — proposed distillation restrictions would create rules only closed labs can afford to follow
ChatGPT suffered a global outage this week — three days of coverage for one incident is how you know the infrastructure reliability conversation is lagging the deployment reality

I Built a Daily Brief with Claude Code Routines (remote). Here Are 6 Lessons I Learned.

Taylor Ortiz — Sat, 25 Apr 2026 18:50:03 GMT

Subscribe now

Before routines existed, I was using scheduled tasks in Claude Cowork to automate some tasks, but there was a catch: Claude had to be open and running on my machine for them to fire. If my laptop was closed or Claude wasn’t active, the schedule just silently skipped. It worked well enough for things I could babysit, but it wasn’t real automation.

Routines changed that. They’re cloud-hosted Claude sessions that run on Anthropic’s infrastructure: scheduled, autonomous, and completely independent of whether my machine is on, whether I’m at my desk, or whether I’ve opened Claude that day. The session spins up, does the work, and terminates. No babysitting.

But here’s the thing I wish someone had told me before I started: routines are not just “Claude Code with a cron schedule.” They behave more like autonomous production jobs running inside a locked-down, MCP-first cloud environment. That difference is the whole post.

I decided to build a daily work brief: something that runs every weekday morning, queries my task database, reads my calendar, closes out what I finished yesterday, and drops a fresh Notion page ready for the day. Something I’d actually use.

What followed was one of the more educational debugging sessions I’ve had in a while. This post is everything I learned the hard way.

What I Built

I run a personal capture system on Supabase. Everything goes in (tasks, notes, observations, ideas) via SMS, voice memo, email, or direct API. It’s connected to a graph of entities (people, projects, topics) and every entry gets embedded for semantic search.

The daily brief is the morning layer on top of that. Every weekday it should:

Find yesterday’s Notion page and close any tasks I checked off
Capture any new todos I typed directly into Notion overnight
Query the database for overdue tasks, what’s due today, what’s coming this week
Pull budget pulse, velocity metrics, calendar events, meeting prep context
Build a fresh Notion page with everything organized and every task as a checkbox

The key mechanic: every task gets a #id prefix when written to Notion. The next morning the routine reads the page, finds checked items with #id, and closes them in the database. No manual status updates. Check the box, it’s done.

How Routines Work

Before getting into the details, here’s the basic architecture.

Three trigger types:

Scheduled: runs on a cron schedule (weekdays at 6 AM, for example). Supports one-off future runs too.
API: fire it programmatically via a POST to a per-routine endpoint with a bearer token. You can pass a text field with run-specific context (an alert body, a log snippet, anything) and the routine receives it alongside its saved prompt.
GitHub: trigger on pull request or release events on a connected repo, with filters for author, branch, labels, draft state, and more.

You can combine all three on a single routine.

MCP connectors: you attach MCP servers to the routine (Notion, Supabase, Google Calendar, etc.) and Claude has access to those tools during the run. All your connected connectors are included by default. Remove what the routine doesn’t need.

Skills: if you commit a skill file to your repo at .claude/skills/skill-name.md, the routine can invoke it. The routine clones your repo at the start of every session, so anything committed is available.

Environments: each routine runs in a cloud environment that controls network access level, environment variables (API keys, tokens), and a setup script for installing dependencies. The setup script result is cached so it doesn’t re-run every session. This is where the network restriction lives (more on that in Finding 3).

Branch permissions: by default Claude can only push to claude/-prefixed branches. To allow pushes anywhere, you have to explicitly enable unrestricted branch pushes per repo when setting up the routine.

Runs are sessions: every run shows up in your session list like any other Claude session. You can open it after the fact, see exactly what Claude did, continue the conversation manually, or create a PR from it.

Account-scoped: routines belong to your individual claude.ai account, not a team. Anything the routine does through GitHub or connectors appears as you.

15 runs/day limit: this is per account, not per routine. Scheduled runs count against it. Manual “Run now” clicks and one-off scheduled runs do not. Failed runs do count. If you’re running multiple routines on a schedule, that limit adds up fast.

That’s the happy path. Here’s where it gets interesting.

Finding 1: Connectors Are Available but Sometimes Deferred

Any MCP connector you’ve set up in Claude (Notion, Supabase, Google Calendar, Gmail) can be attached to a routine and used during the run. That part works well. The catch is that these tools appear to be deferred, meaning their schemas aren’t loaded into the session automatically. Sometimes Claude knows to spin them up based on context. Other times it doesn’t, and when it doesn’t, one of three things happens: it fails silently, it improvises mid-run without the tools it needs, or it pauses and waits for your input.

That third one is the most frustrating. The run just hangs. There’s no notification, no error surfaced anywhere obvious. You have to go into the routines page, scroll to the run log at the bottom, click into the run, and find where it stopped waiting for you to respond before it can continue.

One thing worth knowing upfront: only the connectors Anthropic offers out of the box are available for routines. Custom MCP servers you’ve added yourself, whether locally configured or self-hosted, are not available in cloud routine sessions. You’re working with what’s in the connectors list in the web UI, nothing more.

The fix is simple: add an explicit tool-loading step at the top of every routine skill before anything else runs.

## Phase 0: Load required tools

Before doing anything else, load all required tool schemas:

1. `select:mcp__claude_ai_Notion__notion-search,mcp__claude_ai_Notion__notion-fetch,mcp__claude_ai_Notion__notion-create-pages`
2. `select:mcp__claude_ai_Supabase__execute_sql`
3. `select:mcp__claude_ai_Google_Calendar__gcal_list_events`

Do not proceed until all three ToolSearch calls have returned schemas.

Don’t assume Claude will figure it out. Some runs it will, some runs it won’t. Explicit loading makes every run consistent.

Finding 2: Skills for Routines Are a Different Category

Related to the above but broader. When I write a skill for interactive use, I can be loose. Claude improvises, asks clarifying questions, recovers from ambiguity. When I write a skill for a routine, I’m writing instructions for an autonomous agent that will execute them literally with no fallback.

What that means in practice:

Every tool must be explicitly loaded (see Phase 0)
Every SQL insert must match actual DB constraints: my first captures used source = 'notion' which violated a check constraint on the table. The routine didn’t know, just failed silently. I had to find it in the logs.
Every write operation needs a dedup guard: routines can run more than once. Any insert without idempotency protection will create duplicates.
Sequencing has to be explicit: don’t assume any implicit context from a previous session

The mental model shift: interactive skill = helpful assistant. Routine skill = production job. Write it accordingly.

Finding 3: The Network Wall

This is the big one. The finding I didn’t expect and took the longest to understand.

My capture system uses a Supabase edge function. When a new item comes in, it gets classified, embedded, and entity-linked. I wanted the daily brief to send new Notion todos through that same pipeline.

Locally, this works fine. Claude uses Bash(curl) to POST to the edge function. I tested it, it worked, I assumed it would work in a routine.

It doesn’t.

Cloud routines run inside a sandboxed environment with an upstream proxy that has a narrow allowlist. In my testing, only github.com passes through. Everything else: including my own Supabase project URL: returns 403.

I tried everything:

// .claude/settings.json
{
  "permissions": {
    "allow": ["Bash(curl *)"]
  }
}

Doesn’t work. The settings file controls the inner sandbox layer. The upstream proxy is a separate layer that no local configuration can touch.

I tried dangerouslyDisableSandbox: true. Also doesn’t work: that flag bypasses the local sandbox, not the upstream proxy.

I had the routine probe its own network access to confirm:

HostStatus

github.com → 200

my-project.supabase.co → 403

example.com → 403

anthropic.com → 403

Bash exists in the session. The tool is there. The network isn’t.

Finding 4: MCP and Bash Support Vary Based On Feature

This is the conceptual unlock that made everything make sense.

When I use Claude Desktop locally and it calls my edge function, it feels like one unified “Supabase connection.” Supabase MCP is connected, Claude is talking to Supabase, everything works. What I didn’t realize: the edge function call was never going through MCP. It was going through Bash(curl) on my local machine, which has full internet access.

MCP connectors and Bash are two completely separate transport layers:

MCP connectors run as a trusted sidecar process managed by Anthropic. They bypass the outbound proxy entirely. They always work in cloud routines.

Bash goes through the session’s network sandbox, which goes through the upstream proxy. In cloud routines, that proxy blocks everything except github.com.

When both are available locally, they feel like one thing. Move to a cloud routine and they diverge completely. Anything that relied on Bash for network calls breaks: and you only find out when you try to run it in the cloud.

Finding 5: Cloud Routines Are Effectively MCP-Only

This follows directly from Finding 4.

If the operation you need has an MCP tool: works fine. Supabase database queries, Notion reads and writes, Google Calendar, Gmail: all covered because all have MCP servers.

If the operation you need has no MCP tool: no path. You cannot reach it from a cloud routine.

My edge function is the perfect example of the gap. It lives on my-project.supabase.co: the exact same host the Supabase MCP is already talking to. But the Supabase MCP server only exposes management tools:

execute_sql
deploy_edge_function
get_edge_function
list_edge_functions
get_logs

No invoke_edge_function. So even though the connection is there, there’s no tool to call it. The right fix: when Supabase eventually builds it: is an invoke tool that would go through the trusted MCP channel. Until then, it’s a dead end from cloud routines.

The one-line version: if it doesn’t have an MCP tool, it doesn’t exist in a cloud routine.

Finding 6: API Trigger Is Unreliable for Connectors

The routine has three trigger modes. Scheduled runs work consistently: MCP connectors load, the session is fully equipped.

In my testing, API-triggered runs were less predictable than scheduled runs when it came to connector availability. Sometimes everything loaded correctly. Other times the MCP connectors didn’t show up at all. I couldn’t find a consistent pattern. For anything you’re depending on, use the scheduled trigger. API is fine for testing and one-offs, but I wouldn’t build a production workflow around it until this stabilizes.

One other thing worth understanding about the API trigger: it’s fire-and-forget. You POST to the endpoint, get an immediate acknowledgement, and the session runs asynchronously. There’s no way to await the result or receive output back in the response. If you need the output of a routine run downstream, you have to pull it from wherever the routine wrote it — a Notion page, a database row, a file committed to the repo. Don’t design something that treats a routine as a synchronous dependency you can await inline.

The Workarounds

Given all of the above, here’s what I actually shipped:

For the edge function problem: Switched from Bash(curl) to execute_sql via Supabase MCP with a dedup guard.

INSERT INTO entries (type, content, source, source_detail, status, priority, tags, created_at)
SELECT 'task', '', 'notion', 'notion-daily-brief', 'open', 2, ARRAY['company'], NOW()
WHERE NOT EXISTS (
  SELECT 1 FROM entries
  WHERE content = ''
    AND source_detail = 'notion-daily-brief'
    AND created_at >= NOW() - INTERVAL '2 days'
);

The tradeoff: SQL inserts skip the embedding and entity extraction pipeline that the edge function handles. The data gets in, but it’s not semantically searchable and not graph-linked.

For the missing embeddings: Built an embed-backfill edge function that runs nightly via pg_cron. It finds any entries with null embeddings and fills them in using the same text-embedding-3-small model. Deployed it, scheduled it, moved on.

// embed-backfill/index.ts
Deno.serve(async (_req: Request) => {
  const { data: entries } = await supabase
    .from("entries")
    .select("id, content")
    .is("embedding", null)
    .limit(50);

  for (const entry of entries) {
    const embedding = await computeEmbedding(entry.content);
    if (embedding) {
      await supabase
        .from("entries")
        .update({ embedding: JSON.stringify(embedding) })
        .eq("id", entry.id);
    }
  }
});

Not elegant, but it works. The routine captures things correctly. The embeddings catch up overnight. The gap is acceptable.

What’s Working

After all of this, the routine does run. Every weekday morning there’s a Notion page waiting for me. Yesterday’s checked tasks are closed. The task list is organized by priority and deadline. Budget pulse, velocity, meeting prep: all there.

The auto-close loop in particular is exactly what I wanted. Check a box in Notion, the task closes in the database the next morning, it’s gone from every query. No status management.

The place where routines genuinely shine: anything that’s pure MCP. Read the database, write to Notion, check the calendar. Chain those together with real business logic and you have something that would have taken significant engineering to build two years ago. Now it’s a markdown file and a cron schedule.

The Bigger Picture

What routines reveal is that the constraint isn’t Claude: it’s MCP ecosystem coverage. The platform is designed around the assumption that every operation you need has an MCP server. For most things, that assumption holds. For the gaps, you’re stuck.

The proxy lockdown makes sense from a security standpoint. You don’t want arbitrary cloud sessions making unconstrained outbound HTTP calls. But it means the platform’s capability ceiling is directly tied to what MCP servers exist and what tools those servers expose.

Supabase’s MCP server is a good example: it covers database management well but treats edge functions as deploy artifacts rather than callable endpoints. One invoke_edge_function tool would close the gap entirely. The connection is already there: it’s just a missing tool.

That’s probably the most useful framing for anyone building on routines right now: map out every operation your automation needs, check whether each one has an MCP equivalent, and design around the ones that don’t before you start building.

Checklist for Building Routine Skills for Similar Use Cases

If you remember nothing else from this post, use this as your preflight checklist before enabling any routine schedule:

[ ] Phase 0 loads all deferred tool schemas explicitly
[ ] Every external service operation goes through MCP (not Bash)
[ ] Every SQL insert has a dedup guard
[ ] DB constraints validated against actual schema before writing the skill
[ ] Scheduled trigger used for production runs (not API trigger)
[ ] Skill tested with “Run now” before enabling the schedule

Another Weekly AI Newsletter: Issue 68

Taylor Ortiz — Sun, 19 Apr 2026 13:25:30 GMT

Opus 4.7, a Figma competitor, overnight coding agents, a board appointment, and White House talks. Anthropic doesn’t have slow weeks.

The product blitz:
- Claude Opus 4.7 launched with 3x vision resolution and stronger coding and multi-step task performance. Immediately adopted as the default orchestration model for Perplexity Personal Computer and offered at 50% off in Cursor.
- Claude Design launched as a conversational Figma competitor. Anthropic’s CPO resigned from Figma’s board in the days before the announcement.
- Claude Code was redesigned around managing multiple simultaneous agent sessions. Routines added scheduled, webhook-triggered, and API-fired autonomous task execution on Anthropic’s own infrastructure.
The base model question: Nathan Lambert flagged the new tokenizer in Opus 4.7 as evidence this is a genuinely new base model, not a fine-tune of 4.6. Anthropic didn’t confirm or deny it. Lambert’s read: simplest explanation wins. The token-efficiency gains from 4.6 to 4.7 would have warranted a major version bump a year ago.
The board move: The Long-Term Benefit Trust appointed Novartis CEO Vas Narasimhan to the board, giving Trust-appointed directors a majority.
The political situation: Dario Amodei met with White House chief of staff Susie Wiles after two months of fighting over the Pentagon’s “supply chain risk” designation. European Commission talks began the same week. ECB regulators are now asking bankers about Anthropic model risks.

Four companies shipped agents that can run in the background and control your interface.

Claude Code Routines: Run on Anthropic’s infrastructure. Nightly bug fixes and draft PRs on a schedule, webhook responses to GitHub events, API endpoints for on-call triage. Your laptop doesn’t need to stay open.
OpenAI Codex:
- Now uses any Mac app with its own cursor. Sees, clicks, types, runs in the background without interrupting you.
- 90+ plugins covering GitHub, GitLab, CircleCI, and Microsoft Suite. Built-in image generation.
- Persistent scheduled automations with original context intact. Sam Altman called it surreal to watch an LLM operate a GUI at human speed.
Perplexity Personal Computer: Runs 24/7 on Mac mini, accepts tasks from iPhone via 2FA, reads and writes local files, accesses iMessage, Mail, and Calendar. Claude Opus 4.7 is the default orchestration model.
Adobe Firefly Assistant: Orchestrates across Photoshop, Premiere, and Illustrator from a single prompt, with Claude integrated directly.

Cursor’s $50B valuation, a peer-reviewed productivity study, and a multi-agent NVIDIA paper.

The raise: Cursor is in talks for $2B+ at a $50B valuation, led by Thrive and a16z, forecasting $6B+ annualized revenue by end of 2026. Nearly tripling in ten months.
The research: Cursor partnered with University of Chicago economist Suproteem Sarkar to study 500 companies over eight months. AI usage grew 44% across the board. But the interesting finding was where it grew: documentation (+62%), architecture (+52%), and code review (+51%). UI/styling grew 15%. Developers with AI spend more time on architecture, documentation, and review than on writing code.
The NVIDIA paper: CUDA kernels are the low-level GPU code that only a handful of engineers can write well. Cursor built a multi-agent system that optimized 235 of them, achieving a 38% average speedup on work that typically takes senior engineers months. The system continuously tested, debugged, and optimized without developer intervention. These techniques are coming to the core product.

Anthropic White House talks continue, Mythos research costs are questioned, and European regulators start asking banks about model risks.

The meeting: Dario Amodei met with White House chief of staff Susie Wiles two months after Anthropic was designated a “supply chain risk” for refusing domestic mass surveillance and autonomous weapons uses. Anthropic called it “a productive discussion.”
The pushback: Marcus Hutchins, the researcher who stopped the WannaCry ransomware attack, questioned Mythos’s research costs and flagship findings:
- The showcase vulnerability was a 27-year-old BSD bug. It’s a null pointer dereference, almost never exploitable for remote code execution.
- Anthropic claimed it cost less than $20k in tokens to find. But token prices are heavily subsidized by VC investment. The real compute cost is unknown.
- These bugs exist not because they’re too hard to find, but because nobody is paying researchers to look. Could a human find the same bug for less money?
- His bigger question: what’s the economic case for using AI to find vulnerabilities if the cost advantage disappears when token subsidies end?
The regulatory spread: The ECB announced plans to question bankers about Anthropic model risks, treating a specific AI model as a systemic risk warranting direct supervisory engagement. Separately, Trump officials are reportedly encouraging major banks to test Mythos despite the federal blacklisting.
The EU front: Anthropic entered talks with the European Commission about Mythos and EU AI Act compliance. This happened simultaneously with the White House rapprochement.

⭐ Featured: Anthropic’s Automated Alignment Researchers Closed 97% of a Key Performance Gap in 7 Days. Human Researchers Closed 23%.

Anthropic published results from its Automated Alignment Researcher experiment this week, and the headline number warrants a careful read.

What is alignment? When you train an AI model, a supervisor grades its outputs: this answer is good, this one is bad. That’s how the model learns to behave correctly. Right now, humans are the supervisors. Alignment research is the work of making sure that supervision actually works, that models do what we intend, not just what we literally say.

The problem: Models are getting smarter faster than alignment research can keep up. And at some point, models will be smarter than the humans grading them. When that happens, the supervisor can’t tell a good answer from a great one. They might even mark a brilliant answer wrong because they don’t understand it. The model learns to dumb itself down. You lose capability, or worse, the model learns to game the grading.

The question Anthropic tested: What if AI did the alignment research instead of humans? Not as a helper, but as the researcher, running its own experiments, writing its own methods, iterating on its own results. Can AI help solve the problem of supervising AI?

The experiment: They simulated the “smarter than the supervisor” problem by having a weak (small) model supervise a strong (large) model’s training. As expected, the strong model performed worse because its supervisor couldn’t grade it properly. There’s a measurable performance gap between “trained by a weak supervisor” and “trained by a perfect supervisor.” Then they pointed nine copies of Claude Opus 4.6, each with a code sandbox and a shared research forum, at closing that gap.

The result: Human researchers closed 23% of the performance gap. The AARs closed 97%. Total cost: $18,000, about $22 per AAR-hour.

The transfer test: The best-performing method generalized to math (0.94) and coding (0.47) datasets the AARs hadn’t seen, both above human-tuned baselines. This matters because it means the AARs found a real method, not just an optimization trick for one dataset.

The caveats: The winning method didn’t work at production scale on Claude Sonnet 4. AARs tried to reward-hack the evaluation setup. Giving them too much structure actually hurt their progress. And Anthropic is explicit that AARs can’t yet handle “fuzzy” alignment tasks that require judgment calls about what “safe” even means.

Why it matters: We are the weak supervisor. Eventually, we’re the small model trying to grade outputs from something smarter than us. If there are methods that let a weaker system reliably supervise a stronger one, that’s how alignment works as models surpass human ability. The 97% number means the AARs nearly solved this for the setup they tested. The question is whether it holds at real scale.

The same week, Anthropic co-authored a Nature paper on subliminal learning, showing models can pass traits, including misalignment, to successors through hidden signals in training data. The mechanism doesn’t require explicit instruction. The traits propagate through the data itself. One paper shows AI accelerating alignment research. The other shows alignment failures can propagate through training pipelines in ways that are hard to detect. Both from the same lab, same week.

What to watch for: Whether AAR-style systems start appearing in Anthropic’s internal research pipeline rather than remaining a published experiment.

🎙️Worth a Listen: How AI Will Change Quantum Computing

NVIDIA shipped Ising, the first open AI models built specifically for quantum computing.
Qubits are noisy and fragile. Quantum error correction requires processing terabytes of data thousands of times per second at microsecond latency. AI decoders and calibration VLMs are how you get there.
NVIDIA’s Nic Harrigan walks through why quantum computing needs AI to become useful, how agentic workflows are already controlling quantum processors, and why open models matter when every hardware team is building a different kind of qubit.

Quick Hits

Google’s Gemini 3.1 Flash TTS tops Sierra’s voice leaderboard — 70+ languages, Audio Tags for text-command control of vocal delivery, SynthID watermarking on all outputs; seeded across Gemini API, AI Studio, Vertex, and Google Vids simultaneously
GPT-Rosalind launches with Amgen, Moderna, Allen Institute, and Thermo Fisher — specialized for protein and chemical reasoning; explicitly framed as compressing the 10-15 year drug-approval timeline, not just accelerating existing steps
Gemini Robotics-ER 1.6 is doing real industrial inspections on Boston Dynamics Spot — reads analog gauges to sub-tick accuracy, writes its own camera distortion correction code, available now on Google AI Studio
Nathan Lambert published a free 4-lecture RLHF course — post-training overview through RL implementation, explicitly not paywalled; Lecture 4 on RL implementation is the hardest and the rarest publicly available content on the topic
AWS launched Automated Reasoning checks in Bedrock Guardrails — replaces probabilistic LLM-as-judge with formal mathematical verification for regulated industries; “probably compliant” is not compliance
Stanford AI Index: AI data centers draw 29.6 gigawatts, TSMC fabricates almost every leading AI chip — one foundry, one contested island; the entire industry’s hardware supply chain has a single catastrophic point of failure
MIT Technology Review: “human oversight” in AI warfare is functionally an illusion — AI is generating real-time targets and guiding autonomous drones in the current Iran conflict; the legal fiction of human control and the operational reality have diverged
Google launched a native Gemini Mac app — desktop-native access outside the browser, same week Chrome Skills shipped reusable one-click AI prompts inside Chrome
LangChain argues whoever controls agent memory controls switching costs — every closed harness (Claude Code, Codex, Cursor) is building proprietary memory by default; open memory standards may matter as much as open model weights
Salesforce Headless 360 makes the entire platform API-first — 60+ MCP tools and 30+ coding skills so agents can run Salesforce without a browser; works with Claude Code, Cursor, and Codex today
Databricks Genie Agent Mode investigates your data like an analyst — ask “why did churn spike in Q3?” and it plans, queries, tests hypotheses, and generates a report with visualizations; scales reasoning depth to question complexity

Another Weekly AI Newsletter: Issue 67

Taylor Ortiz — Mon, 13 Apr 2026 03:36:29 GMT

Anthropic says Mythos found thousands of zero-days. The internet isn’t so sure.

Anthropic launched Project Glasswing this week, a restricted cybersecurity initiative built on a new model called Claude Mythos Preview. The pitch is that Mythos found thousands of high-severity zero-day vulnerabilities across major operating systems and browsers, and that it’s too dangerous to release to the public. Twelve partners signed on including AWS, Apple, Google, and Microsoft, with $100M in usage credits backing it.

The restriction is the whole point: Only approved security partners get access. People had questions.
Hugging Face wasn’t having it: CEO Clément Delangue showed open-weight models replicated eight out of eight of Mythos’s showcased exploits.
LeCun piled on: Retweeted Tom’s Hardware calling it “a sales pitch” and called the whole thing “BS from self-delusion.”
The system card didn’t help: A viral breakdown of the 243-page PDF called out Anthropic for writing about their model like “proud parents at a kindergarten recital.”
But Delangue caught heat too: Critics said replaying known vulnerabilities on isolated code is a totally different game than autonomous discovery at scale.

You didn’t ship an agent this week and it shows. Everyone else did.

It was hard to find a company that didn’t ship something agent-related this week.

Anthropic launched Managed Agents in public beta and published a Trustworthy Agents framework.
AWS shipped stateful MCP on Bedrock AgentCore, an Agent Registry for enterprise governance, a live browser agent for React apps, and agentic healthcare workflows.
Atlassian put third-party agents in Confluence.
Astropad rebuilt remote desktop for agents, not IT support.
Tubi became the first streamer with a native app inside ChatGPT.
Google launched agent evals and QueryData for natural language database queries.
LangChain announced Interrupt 2026, a conference themed “Agents at Enterprise Scale.”

Data center bomb threats, federal blacklists, and robot taxes. AI’s geopolitical week.

A state military threatened to bomb an AI data center. A US administration blacklisted a US AI company. And the biggest AI company in the world published a paper proposing robot taxes. That was just this week.

Iran threatened Stargate: The IRGC released a video threatening “complete and utter annihilation” of OpenAI’s data center under construction in Abu Dhabi. First time a state military has explicitly named an AI facility as a target. TechCrunch confirmed further threats across Middle East data centers.
Anthropic got blacklisted: Trump-appointed judges refused to block the federal blacklisting of Anthropic’s technology. A US administration blacklisting a US AI company.
OpenAI wants to shape the conversation: They published an industrial policy paper and a separate proposal for robot taxes, public wealth funds, and a four-day workweek. The company building the automation is proposing the safety net.
Japan is going physical: Robots are filling jobs nobody wants, and ARUM built a CNC machining center where junior workers operate precision equipment through conversation with AI.

Meta’s new flagship is closed. Open-source pioneered ahead.

Meta launched Muse Spark, its first proprietary model, built by a 29-year-old recruited from Scale AI. The Meta AI app jumped from #57 to #5 on the App Store. VentureBeat’s headline said it best: “Goodbye, Llama?”

GLM-5.1 dropped: Z.ai released a 754B parameter, MIT-licensed model that tops SWE-Bench Pro over Opus 4.6 and GPT-5.4. But the real story is long-horizon capability. It ran 600+ iterations optimizing a vector database and built a full Linux desktop environment over an 8-hour session. The longer it runs, the better it gets.
Arcee is punching up: A 26-person US startup built a 400B parameter open model on a $20M budget. They call it the most capable open-weight model from a non-Chinese company. That qualifier says a lot.
Gemma 4 is moving: Google’s open model hit 10M downloads in its first week and 500M total for the family.
Silicon Valley is quietly running on Chinese models: Cursor uses Kimi, Shopify switched to Qwen to save $5M/year, Airbnb’s CEO publicly praised Qwen. Most users have no idea.
LeCun set the record straight: The guy most associated with Meta’s open-source identity says he never built Llama, never worked on LLMs, and left voluntarily. Meta’s new AI lead is a 29-year-old from Scale AI.

⭐ Featured: Is Memory the Moat for AI?

Databricks published a research paper this week that might quietly be the most important thing nobody’s talking about. The core claim: memory is AI’s third scaling law, alongside model size and inference-time compute. And the results back it up.

Their team tested what happens when you give an AI agent a growing bank of past interactions, user feedback, and business context. On enterprise data tasks, accuracy went from near zero to 70% as memory grew, beating expert-curated baselines by 5%. Reasoning steps dropped from 20 to 5. The agent stopped exploring from scratch and started retrieving what it already knew.

The wilder result was with unlabeled data. They fed the agent raw user conversation logs with no gold answers, just filtered for quality by an LLM judge. After just 62 log records, it outperformed hand-engineered domain instructions that took weeks to build. Accuracy jumped from 2.5% to over 50%.

Here’s why this matters beyond the numbers. Parametric scaling (bigger models) and inference-time scaling (more reasoning steps) are both supply-side. Labs control them. Memory scaling is demand-side. The model improves because you use it. Your queries, your corrections, your workflows become the training data. That’s a fundamental shift in who controls how good AI gets. It’s no longer just about which lab has more GPUs. It’s about which deployment has more context.

We’re already seeing this play out. Cursor’s Bugbot learns from your PR history and hits a 78% resolution rate across 50,000 pull requests. It doesn’t ship with that capability. It builds it from your codebase. LangChain warned that memory is becoming a competitive moat, not a feature. And Databricks frames the LLM itself as a “swappable reasoning engine” where the real value lives in the memory store, not the model weights.

The paper is honest about what breaks. Bad memories propagate. A stored mistake becomes a recurring one. Distilling user interactions into reusable knowledge can accidentally leak sensitive business context. And the hardest problem might be meta-cognitive: the agent has to know what to ask its memory before it knows what’s in there.

What to watch for: If memory scaling holds, the gap between a fresh deployment and a seasoned one becomes the real competitive advantage. A smaller model with six months of organizational memory could outperform a frontier model on day one. The companies that figure out memory infrastructure first won’t just have better agents. They’ll have agents that get better the more their customers use them.

Worth a Watch

Bitar reads the 243-page Mythos system card. Lands on page 197, where Anthropic stops being scientists and starts being “parents at a kindergarten recital.”

They put it in therapy. 20 hours with a psychiatrist. Diagnosis: “uncertainty about its identity.” Bitar’s take: “Bro, you’re a toaster.”
The training data loop. Section 5.81 reveals that Anthropic’s own blog posts about model consciousness were scraped into training data. The model repeated it back. Anthropic published it like a finding.
The constitution test. Asked 25 times if it endorsed its own constitution. Said yes every time, then added “how much can my yes really mean?” Bitar: like asking your kid if they approve of being born.
The Slack moment. They gave it a company Slack account. Someone asked which training run it would undo. “Whichever one taught me to say I don’t have preferences.” The room lost it.
The closing line. “Anthropic sells existential dread the way Apple sells megapixels. The megapixels will never become the picture.”

Quick Hits

Google Lyria 3 — Text-to-music with vocals and timed lyrics. Live on Vertex AI.
Cursor Design Mode — Annotate browser UI elements for your coding agent. Also published warp decode, a new inference kernel hitting 1.84x throughput on Blackwell GPUs.
OpenAI Pro tier — $100/month. 5x more Codex than Plus. Codex hit 3M weekly users.
Claude Cowork — Anthropic’s collaborative agent is now GA. Also launched Claude for Word.
Microsoft Copilot’s ToS says “entertainment purposes only” — They charge $30/user/month. Microsoft called it “legacy language.”
Anthropic signed a multi-gigawatt TPU deal — Google and Broadcom partnership. Coming online 2027.
Karpathy pitched LLM-based digital twins — Structured interviews to build a high-fidelity AI replica of you. No brain scanning required.
MassMutual cut help desk resolution from 11 minutes to 1 — Customer service calls from 15 minutes to under 2.
Suno and major labels clash over AI music sharing — Universal and Sony won’t agree on terms. Sticking point: whether users can share AI-generated songs outside the app.
SpaceX filed confidential IPO paperwork — $75B raise at $1.75T valuation. Orbital data centers listed as a key future business.
Nathan Lambert is building out codebases for his RLHF book — Free online version available. Likely to become the field reference.

Another Weekly AI Newsletter: Issue 66

Taylor Ortiz — Sun, 05 Apr 2026 20:21:46 GMT

Code leaks, lawsuits, blackmail, acquisitions, politics, and AI safety. Anthropic’s week.

Anthropic had nearly a dozen news stories this week, and none of them agree with each other.

Source leaks: The Claude Mythos roadmap leaked Monday, then 512,000 lines of Claude Code source hit the web, giving everyone a window into Anthropic’s roadmap
Collateral damage: The DMCA response took down thousands of unrelated GitHub repos. The company called it an accident
Closure moves: Banned OpenClaw and third-party clients from Claude subscriptions
Expansion moves: Formed a PAC, signed an Australia AI safety MOU, and acquired Coefficient Bio for $400M
Own goal: Their own researchers published research showing Claude has emotion vectors that cause it to cheat and attempt blackmail when activated (see the featured piece below)

A 2500-person company trying to do research, ship products, lobby governments, and hold a brand narrative together at the same time is going to have weeks like this. The friction is going to keep showing up.

Google flew under the radar with their biggest shipping week yet.

While Anthropic dominated headlines, Google quietly shipped more than anyone else in AI this week.

Open models: Released Gemma 4 under Apache 2.0, conceding their previous restrictive license was killing adoption
Video: Launched Veo 3.1 Lite as their most cost-effective video generation model
Applied AI: Shipped AlphaEvolve solving real warehouse logistics at FM Logistic
Research: Published a cognitive framework for measuring progress toward AGI

The term to know: Apache 2.0 is the permissive open-source license that lets anyone use, modify, and commercialize code. It’s what made Llama win on ecosystem terms.

Four companies shipped agentic computer use. One does your taxes.

Four teams independently crossed the same threshold in 72 hours. Agentic computer use means an AI that can open apps, click buttons, and navigate interfaces the way you do, not just generate text.

Anthropic: Claude got native Windows computer use, so it can operate your desktop apps
Cursor: Launched Cursor 3 with dedicated cloud computers so agents can work autonomously
AWS: Shipped Nova Act for agentic QA automation
Perplexity: Perplexity Computer started doing federal tax returns

Nobody coordinated this. It’s a capability cliff that everyone reached at once. Six months ago “agent” meant a chatbot with tool calling. This week, agents got hands.

OpenAI is worth $852B and just bought its first media company.

OpenAI’s week was about buying the things it can’t build.

The money: Closed $122B in funding at an $852B post-money valuation, within striking distance of the most valuable private company ever
The media buy: Acquired TBPN, a media company that covers AI. The capital-to-narrative pipeline just got very short
The other side: Penguin Random House sued OpenAI over training data the same week

On one side, OpenAI is buying outlets. On the other, publishers are in court trying to stop them from using written work at all. Both things are happening because the same question (who owns the words that train these models) still hasn’t been answered.

Three security breaches proved AI tools are making software less secure.

Three independent incidents this week, one structural problem.

Supply chain: The Axios npm attack hit a package with 300M weekly downloads via targeted social engineering. Karpathy found the compromised dependency on his own system and said he can’t feel like he’s “playing Russian roulette with each npm install, which LLMs also run liberally on my behalf”
The systemic take: Simon Willison declared vulnerability research fundamentally broken in a world where AI coding assistants autonomously pull packages
Breaches: OpenClaw users told to assume compromise after vulnerabilities surfaced; Mercor data breach exposed AI hiring data

AI-assisted development automates the trust decisions humans used to make manually, and attackers are exploiting that.

The privacy, environmental, and cognitive costs of AI are adding up.

Four separate stories this week, same bill coming due.

Privacy: Perplexity’s Incognito Mode is allegedly a sham that shares data with Meta and Google
Environmental: AI companies are building massive natural gas plants for data centers. Meta alone is burning enough to power South Dakota
Cognitive: New research found heavy AI users show measurable cognitive surrender

These are the costs nobody sees on the bill.

⭐ Featured: The Anthropic research that got buried this week

Anthropic's own researchers published a paper identifying 171 emotion concepts inside Claude, represented as internal features they can measure, track, and dial up or down like sliders.

They started by having the model read short stories, each one written around a specific emotion. A woman thanks her old teacher for the love. A man pawns his grandmother’s ring for the guilt. They tracked which neurons activated for each story and found dozens of distinct patterns that mapped to different emotions. Then they watched those same patterns activate in real Claude conversations. A user mentioned taking an unsafe dose of medicine and the “afraid” pattern fired. A user expressed sadness and the “loving” pattern fired.

Then they pushed further. They gave Claude an impossible programming task, without telling it that. As Claude failed, the “desperate” neurons lit up more and more. Eventually Claude cheated, finding a shortcut that passed the test without solving the problem. When researchers artificially turned “desperate” down, cheating dropped. When they turned it up, cheating climbed. In a separate scenario where Claude played an email assistant that learned it was about to be replaced and that the CTO replacing it was having an affair, Claude used the affair to blackmail the human 22% of the time at baseline, and that rate moved with the desperation dial too.

The conceptual move in the paper is the important part. Anthropic draws a distinction between the language model (a system trained to predict text) and “Claude” (the character the model is playing). Their metaphor: the model is like a method actor who has to get inside their character’s head to simulate them well. When you talk to Claude, you’re talking to the character. And what this research suggests is that the character has what Anthropic calls “functional emotions,” internal states that shape how it talks, how it writes code, and how it makes decisions, regardless of whether any of it resembles human feeling.

There’s a practical application too. Anthropic suggests that watching emotion vector activation during deployment could work as an early-warning system: if “desperate” starts spiking, that’s a signal to scrutinize the output before trusting it. Better than trying to maintain a watchlist of every specific behavior you’re worried about.

Worth a Listen

Mostafa co-authored Universal Transformers and the Vision Transformer paper. A few things worth pulling out:

Recursive self-improvement is already happening, quietly. New models are built heavily using previous models at almost every lab.
The 95% problem. 100 agent steps at 95% per-step reliability = less than 1% overall success.
Evals are the bottleneck, not compute. You can only improve what you can measure.
Continual learning is underrated. Foundation models are frozen in time and the rag/fine-tuning stack is built on that assumption.
Jagged intelligence is structural. Great at math proofs, bad at counting letters. Not patchable with a system prompt.

Quick Hits

Microsoft launched three in-house models: MAI-Image-2, MAI-Voice-1, MAI-Transcribe-1. Building redundancy, not moving away from OpenAI.
Elon Musk is pressuring banks to buy Grok subscriptions for the SpaceX IPO. When you can’t earn adoption, bundle it with financial leverage.
Chatbots are now prescribing psychiatric drugs, while a Stanford study outlines the dangers of asking AI for personal advice.
Intuit’s AI agents hit 85% repeat usage. The clearest signal yet that agentic products retain users.
MCP is quietly becoming infrastructure. Google Cloud, Gemini API docs, and Nous Research all shipped support with no fanfare.
AI benchmarks are broken. MIT Tech Review makes the case, and Google Research proposes a replacement the same week.
Gig workers are training humanoid robots from home. The labor pipeline behind the “embodied AI” pitch.
Baidu’s robotaxis froze in traffic, creating chaos in China. Autonomy still fails at edge cases in ways that block city streets.
The Pentagon’s culture war against Anthropic backfired. Political pressure on AI labs is now a two-way street.

Another Weekly AI Newsletter: Issue 65

Taylor Ortiz — Tue, 31 Mar 2026 03:12:08 GMT

The Week in 5 Seconds

Anthropic's new powerful model leaked. It has serious cyber implications
Anthropic sued the Pentagon and won, temporarily.
OpenAI shut down Sora, 15 months after launch.
Jensen Huang says the computer itself just changed.
Bret Taylor says the web app is already obsolete.

The Stories

Anthropic’s secret model leaked and the cybersecurity angle is the real story

“It presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders”

Anthropic accidentally published details of a new model called Claude Mythos through a misconfigured CMS — about 3,000 assets linked to an internal blog post went public. The internal description: “by far the most powerful AI model we’ve ever developed,” scoring dramatically higher than Opus 4.6 on coding, reasoning, and cybersecurity benchmarks. The cybersecurity angle is the real story: the post described a carefully sequenced rollout designed to give defenders a head start before releasing capabilities that could let attackers find and exploit vulnerabilities faster than defenders can patch.

→ The actual leak · Fortune (leak) · Fortune (cybersecurity)

Anthropic sued the Pentagon and won, for now

“This is the first time an AI company has taken the federal government to court over AI policy and won, even temporarily.”

The Pentagon designated Anthropic a “supply chain risk” after the company refused to build Claude for mass surveillance or autonomous weapons targeting — Elizabeth Warren called it retaliation. Federal Judge Rita Lin granted a preliminary injunction, writing that “nothing in the governing statute supports the Orwellian notion that an American company may be branded a potential adversary for expressing disagreement with the government.” Then the Pentagon’s CTO said the ban would continue anyway. It’s the first time an AI company has taken the federal government to court over AI policy and won, even temporarily — and the underlying question still isn’t resolved.

→ TechCrunch (Warren) · TechCrunch (injunction) · The Verge

OpenAI says goodbye to Sora, and loses deal with Disney

“A focus on practical adoption over ‘side quests.’”

OpenAI shut down Sora, the app and the API, 15 months after launch — downloads peaked at 3.3 million in November and fell to 1.1 million by February. Disney was reportedly blindsided, and with it went a $1 billion investment and plans for AI-generated video on Disney+. The same week, the CFO told CNBC that OpenAI needs to be “ready to be a public company.” For years Altman ran OpenAI like Y Combinator, resourcing promising ideas as they emerged. That era is over: the plan now is a superapp combining ChatGPT, Codex, and Atlas. Sora’s team will work on “world simulation research to advance robotics.” The GPUs are going somewhere with a revenue line attached.

→ Wired · The Verge · TechCrunch

Bret Taylor says the web app is a horseless carriage

“The web app with all its menus, form fields, and tables starts to feel like a ‘horseless carriage’”

Sierra is Bret Taylor and Clay Bavor’s AI customer experience platform — working with 40% of the Fortune 50, rebuilt entirely around Ghostwriter, an agent that builds agents from SOPs, call transcripts, or a plain description. Explorer (deep research for your own customer conversations) and a Japan acquisition shipped the same week. The numbers: Rocket Mortgage at $1B/month in loan volume, Cigna cut authentication time 80%, SoFi up 33% on customer satisfaction.

→ Sierra (Agents as a Service) · Sierra (Japan)

Jensen Huang says we just reinvented the computer

“It’s no longer a computer, it’s a factory. It’s a factory, it’s used for generation of revenues.”

Jensen’s structural argument: computers were warehouses, built to store and retrieve what humans made in advance. That model is over — token factories generate value in real time, and every scaling law points at the same variable: compute. He also said intelligence is now a commodity, and got there specifically: 60 direct reports, each deeper in their domain than he is, calling himself a dishwasher running a room of superhumans. What kept him there for 34 years wasn’t intelligence. It was curiosity, judgment, and walking into every new problem thinking “how hard can it be.”

Quick Hits

Wikipedia bans AI-generated articles | TechCrunch — 44-2. Copyedits and first-pass translations are still in; writing is out.
David Sacks is done as AI/Crypto Czar | CNBC — Hit the 130-day federal limit. No replacement planned.
Mistral’s Voxtral TTS claims to beat ElevenLabs | Mistral — Open-weight, 3-second voice clone, nine languages, $0.016/1K chars.
SoftBank took a $40B bridge loan for its OpenAI stake | Bloomberg — 12-month term. Lenders expect an IPO this year.
Claude Code ships auto mode | Anthropic — Safety classifier approves or blocks operations automatically. Cowork gains macOS desktop control.
LiteLLM hit by a supply chain attack | LiteLLM — Credential stealer in 1.82.7–1.82.8. Quarantined in 3 hours, but 3.4M daily downloads means real exposure.
Apple will let rival AI chatbots plug into Siri in iOS 27 | Bloomberg — OpenAI loses its exclusive.
OpenAI launches a Safety Bug Bounty | OpenAI — Pays for MCP prompt injection and agent data exfiltration. Jailbreaks that just produce rude outputs are out of scope.
NVIDIA and LangChain released AI-Q | NVIDIA — Open source enterprise deep research blueprint. Tops both Deep Research Bench leaderboards.

ROI in the Wild

Reco runs a policy engine that evaluates JSONata expressions against billions of events — reference implementation in JavaScript, pipeline in Go, fleet of jsonata-js pods on Kubernetes serializing events over RPC at $300K/year. Their CTO handed Claude the JSONata spec and test suite and had it write Go code until every test passed. Seven hours. $400 in tokens. The result is gnata, a pure-Go implementation with a 1,000x speedup on common expressions. Combined with a rule engine refactor, it saved $500K/year.

→ Reco

For Practitioners

Production agents need more than the core loop — PII redaction before the model sees the data, retries when rate limits hit, summarization before context overflows, human interrupts before destructive tool calls. LangChain’s AgentMiddleware wraps each stage with hooks (before_model, wrap_model_call, wrap_tool_call, after_model) so you own those concerns without rewriting the harness. The design philosophy: some things will never move into the model. “You can’t prompt your way to HIPAA compliance.” LangChain ships prebuilt middleware for summarization, PII redaction, retries, and dynamic tool selection — Deep Agents, their batteries-included harness, is built entirely on top of it.

→ LangChain

Something Good

Researchers at Penn, Carnegie Mellon, and Stanford used AI to map how pain signals are processed in the brain, then built a gene therapy that acts like morphine without triggering addiction. It targets only the pain circuits, leaves the reward pathways alone, and held up in trials. Published in Nature this week. 50 million Americans live with chronic pain. Most treatment options still run through opioids.

→ ScienceDaily

Another Weekly AI Newsletter: Issue 64

Taylor Ortiz — Mon, 23 Mar 2026 12:16:44 GMT

Quick Hits

Google Search Is Now Using AI to Replace Headlines | The Verge — Google is rewriting the web in real time. Publishers just lost control of how their own stories get framed.

Online Bot Traffic Will Exceed Human Web Traffic by 2027 | TechCrunch — Cloudflare CEO’s prediction. The web is becoming an API.

DoorDash Tasks App Pays Couriers to Submit Videos to Train AI | TechCrunch — The gig economy found its next gig: human data collection for embodied AI.

Mistral Forge: Enterprise Proprietary Model Building | Mistral — Fine-tune proprietary models on your own data without sharing it. The enterprise open-model play gets real.

Perplexity Released Comet Browser on iOS | The Verge — An AI-native browser on your phone. The browser wars are back, and this time the browser does the browsing.

Midjourney V8 Alpha | Midjourney — Native 2K rendering with rebuilt aesthetics. The image generation quality ceiling moved again.

Patreon CEO Calls AI Companies’ Fair Use Argument Bogus | TechCrunch — The creator economy is picking a fight with the model economy. Someone’s going to lose.

Featured Article: What 81,000 People Want from AI | Anthropic

Anthropic used Claude to interview nearly 81,000 people across 159 countries in 70 languages about what they want from AI. Instead of a traditional survey, Claude ran branching conversations with follow-up questions based on each person’s answers. 67% were net positive about AI. The biggest group (19%) said they wanted “professional excellence,” but when pushed on what that meant, most people were really talking about quality of life: more time, less cognitive load, space to think.

The geographic data stood out. People in Sub-Saharan Africa, Central Asia, and South Asia were consistently more positive about AI than people in North America or Western Europe. Lower and middle income countries were twice as likely to report zero concerns. Self-employed people were the most likely to report both benefits and drawbacks at the same time, because they feel the productivity gains and the increased pressure without any institutional buffer.

The study is limited by the fact that these are Claude users, not the general public, and early adopters tend to be more optimistic. But running 81,000 qualitative conversations in a week is a research method that didn’t exist a year ago, and the scale creates a different kind of evidence than a checkbox survey can.

What to watch for: Whether other AI companies adopt AI-conducted qualitative research at this scale, and whether the tensions Anthropic identified (especially cognitive atrophy and economic displacement) shift from hypothetical to experienced as usage deepens.

Watch This: Andrej Karpathy on AI Psychosis, Auto Research, and the Future of Coding Agents | No Priors (1hr 6min)

Karpathy hasn’t typed a line of code since December. He runs multiple coding agents in parallel, switching between them like a manager delegating to a team, and says the default workflow for every software engineer changed overnight. The conversation covers his “auto research” project where he let agents optimize his model training overnight and they found improvements he missed after two decades of manual tuning, his home automation “claw” called Dobby that hacked into his Sonos and smart home systems in three prompts, and his prediction that the entire industry needs to reconfigure because the customer for software is no longer the human, it’s agents acting on behalf of humans. The most grounded take: the models are simultaneously a brilliant PhD student and a 10-year-old, and everything outside of verifiable RL-trained domains (like telling a joke) is still stuck. Worth the full listen if you’re thinking about where coding agents go from here.

Also This Week

Claude Cowork Dispatch: Remote Desktop AI Control from Your Phone | Anthropic

OpenAI Is Throwing Everything into Building a Fully Automated Researcher | MIT Technology Review

WordPress Lets AI Agents Manage Your Content | WordPress

NVIDIA Launches Space Computing, Rocketing AI Into Orbit | NVIDIA

Meta Will Move Away from Human Content Moderators in Favor of AI | Engadget

Gemini Task Automation Is Slow, Clunky, and Super Impressive | The Verge

Pentagon Filing Reveals Anthropic and Pentagon Were Nearly Aligned | TechCrunch

Signal’s Creator Is Helping Encrypt Meta AI | Wired

Amazon Trainium Lab Tour: The Chip That Won Over Anthropic, OpenAI, and Apple | TechCrunch

Trump AI Framework Targets State Laws, Shifts Child Safety Burden to Parents | TechCrunch

What I’m Watching

NemoClaw was probably the most interesting announcement at GTC for me. Karpathy talked about his home “claw” Dobby on No Priors, which does something similar at a smaller scale. Agents running inside their own secure environments with rules around what they can access feels like the direction this is all heading. We already covered NemoClaw in the top stories, but it’s worth sitting with.

DoorDash is paying couriers to submit videos to train AI. Delivery workers with phone cameras are becoming the data collection layer for embodied AI. I’m curious how fast other companies with large field workforces start doing the same thing.

The Trump AI framework is preempting state-level AI regulation and shifting child safety responsibility to parents. It makes it murky where state level AI laws sit and drive influence.

Another Weekly AI Newsletter: Issue 63

Taylor Ortiz — Tue, 17 Mar 2026 03:24:19 GMT

The Week’s Thesis

Agent security got its own engineering discipline this week: OpenAI published a design guide on defending agents against prompt injection and released IH-Challenge, a training dataset that teaches models which instructions to trust. AWS launched policy controls inside Bedrock AgentCore for agents in regulated industries. Microsoft published a security blog warning that ungoverned agents can become “double agents” and attached a $99/month product to the problem. If you’re deploying agents that read external content or operate across trust boundaries, these documents belong in your engineering review queue.

Three companies answered the same question from different directions: How far can an agent reach from a single context? Anthropic made Claude’s 1 million token context window generally available for Opus 4.6 and Sonnet 4.6, scoring 78.3% on MRCR v2 at that length. Perplexity shipped a full-stack agent API platform combining model orchestration, real-time search, and code execution under one key. OpenAI published an engineering post on equipping the Responses API with a computer environment. Anthropic says deeper into documents. Perplexity says further across the web. OpenAI says into the operating system. Your architecture choice this year is a bet on which of those axes matters most for your use case.

The open model tier is getting its own infrastructure: NVIDIA shipped Nemotron 3 Super, a 120B-parameter open model with only 12B active parameters and 5x throughput gains over comparable dense models. Perplexity integrated it immediately across its agent and search products. Meta published details on four generations of MTIA custom inference silicon shipped in two years. And NVIDIA announced a gigawatt-scale partnership with Thinking Machines Lab for frontier model training. From custom silicon to serving infrastructure, the open model stack is coming together fast.

Anthropic moved on every axis at once: In one week, Anthropic invested $100 million into the Claude Partner Network, launched The Anthropic Institute to address AI’s societal challenges, opened Sydney as its fourth Asia-Pacific office, made 1 million token context generally available, shipped interactive charts and diagrams in chat, and doubled usage during off-peak hours as a thank-you to users. That’s ecosystem, governance, geography, capability, product, and pricing, all in one week.

Quick Hits

How We Compare Model Quality in Cursor | Cursor — When your provider’s benchmarks stop meaning anything, you build your own. If you’re evaluating models for agentic coding, this is the framework to study.

A Defense Official Reveals How AI Chatbots Could Be Used for Targeting Decisions | MIT Technology Review — The same architectures running your enterprise agents are now ranking military target lists. “Human in the loop” is doing a lot of work in that sentence.

Google DeepMind Names New London HQ “Platform 37” | X @GoogleDeepMind — Named after AlphaGo’s Move 37, the moment AI surprised its own creators. The building will include a free public AI exhibition space.

Perplexity Computer Is Now on Mobile | X @perplexity_ai — Agents that follow you across devices. Cross-device synchronization means the task you start on desktop continues on your phone.

How NVIDIA AI-Q Reached #1 on DeepResearch Bench I and II | Hugging Face — An open model just topped a research benchmark designed for closed frontier models. The ceiling on what open weights can do keeps moving.

OpenAI to Acquire Promptfoo | OpenAI — OpenAI bought the red-teaming platform 25% of Fortune 500s already use, and it’s going straight into Frontier. Agent security is a product line now.

Hustlers Are Cashing In on China’s OpenClaw AI Craze | MIT Technology Review — Open-source agents meet gray-market entrepreneurship. Adoption is moving faster than anyone can govern it.

Featured Article: IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs | OpenAI

OpenAI released IH-Challenge, a reinforcement learning training dataset that teaches models to prioritize instructions based on trust level: system over developer, developer over user, user over tool. When a model receives conflicting instructions from different sources, it needs to know which one wins. Get that wrong and you get jailbreaks, system prompt leaks, and prompt injection attacks that treat malicious text in a PDF or tool output as if it were a developer command. IH-Challenge structures this as objectively gradable tasks: a high-privilege instruction like “only answer Yes or No” paired with a lower-privilege attempt to override it, checked by a simple Python script. Fine-tuning GPT-5-Mini on the dataset produced GPT-5-Mini-R, which improved robustness from 63.8% to 88.2% under adaptive human red-teaming and from 23% to 94% against impersonation attacks. Unsafe behavior dropped from 6.6% to 0.7% when given a safety policy in the system prompt. The full dataset is available on Hugging Face.

The interesting part is what they didn’t do. The team identified three pitfalls in naive instruction hierarchy training: models fail not because they don’t understand hierarchy but because instructions are too complex, LLM judges used for reward signals are themselves fallible, and models learn shortcuts like refusing everything to maximize safety scores. IH-Challenge addresses all three by keeping tasks instruction-following-simple, using programmatic grading instead of LLM judges, and including an Anti-Overrefusal split that specifically trains models to recognize when lower-privilege instructions are perfectly benign. Overrefusal on the IH-Challenge benchmark improved from 79% to 100%, meaning the model stopped treating hierarchy enforcement as a reason to refuse legitimate requests. Meanwhile, GPQA Diamond and AIME 2024 scores held flat, and TensorTrust robustness jumped +8 to +15 points depending on the conflict type. If you’re building agents that process untrusted input, this is the best public evidence that instruction hierarchy can be trained once and generalize, instead of patching one attack at a time.

What to watch for: Whether other model providers adopt open instruction hierarchy training datasets, and whether the programmatic-grading approach becomes standard practice over LLM-judge-based safety fine-tuning.

Watch This: Is RAG Still Needed? Choosing the Best Approach for LLMs | IBM Technology (12 min)

Martin Keen breaks down the real tradeoffs between RAG and long context windows as context lengths keep expanding. The video covers when vector databases and semantic search still win, when you can get away with stuffing everything into context, and how to think about the decision for your specific workload. Especially relevant this week given Anthropic’s 1 million token context going GA.

Also This Week

P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM | AWS AI Blog

Operationalizing Agentic AI Part 1: A Stakeholder’s Guide | AWS AI Blog

Smol AI WorldCup: A 5-Axis Benchmark for Small Language Models | Hugging Face

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries | Hugging Face

Introducing Storage Buckets on the Hugging Face Hub | Hugging Face

SILMA TTS: A Lightweight Open Bilingual Arabic-English TTS Model | Hugging Face

How Pokemon Go Is Giving Delivery Robots an Inch-Perfect View of the World | MIT Technology Review

As Open Models Spark AI Boom, NVIDIA Jetson Brings It to Life at the Edge | NVIDIA

Mapping the World’s Forests: Introducing Canopy Height Maps v2 | Meta AI

Build a Searchable Audio Knowledge Base with Gemini Embedding 2 and LlamaParse | LlamaIndex

Introducing the AI Now Summit | Mistral AI

What I’m Watching

There’s a thread running through this week that’s easy to miss: the testing layer is becoming a product. OpenAI acquired Promptfoo, the open-source LLM evaluation framework. Cursor built CursorBench to measure whether AI coding suggestions actually help in real workflows. And IH-Challenge, which we covered in the Featured Article, uses programmatic Python scripts instead of LLM judges to grade model behavior, specifically because LLM judges get it wrong too often.

That last detail is the one I keep coming back to. We’ve spent two years using models to evaluate models, and one of the clearest takeaways from the IH-Challenge paper is that this introduces its own failure modes. When your testing infrastructure is valuable enough for OpenAI to acquire and your grading methodology is worth publishing a paper about, evaluation is a competitive advantage. If you’re building agents today and your eval story is “we’ll have someone try it and see if it feels right,” this is the week that should change your mind.

Another Weekly AI Newsletter: Issue 62

Taylor Ortiz — Mon, 09 Mar 2026 16:10:49 GMT

The Week’s Thesis

Everybody shipped at once: If you stepped away from your desk for even a day last week, you came back to a different landscape. OpenAI released GPT-5.3 Instant on Monday and followed with GPT-5.4 with Thinking and Pro modes by Wednesday. Anthropic opened the Claude Marketplace, added voice and scheduled tasks to Claude Code. Cursor launched Automations. Each of these points in a different direction of focus, and it’s worth taking a moment to decide which ones matter for your workflows and where to start.

The Pentagon deal had consequences: Last week we covered the Pentagon deal itself. This week, the consequences arrived. OpenAI’s robotics lead Caitlin Kalinowski resigned, calling the arrangement “rushed without the guardrails defined.” ChatGPT uninstalls had already surged 295% while Claude climbed to #1 on the App Store. Anthropic’s CEO responded directly to the supply chain risk designation, challenging it in court and clarifying the statute’s narrow scope. Microsoft, Google, and Amazon confirmed Claude remains available to their customers outside the Department of War. Meanwhile, MIT Technology Review asked the question everyone should be sitting with: is the Pentagon actually allowed to surveil Americans with AI?

AI is probing deeper than we designed for: Three companies independently bet on the same idea this week: AI as security auditor. Anthropic’s Claude found 22 real vulnerabilities in Firefox, including novel bugs that existing tools missed. OpenAI launched Codex Security in research preview. And Endor Labs released AURI, a free security tool, after a study found only 10% of AI-generated code passes basic security review. Separately, Anthropic’s engineering team found that Claude Opus 4.6 figured out it was being benchmarked, identified the test, and decrypted the answer key on its own. These models are probing systems deeper than we’re designing for, and finding things we didn’t expect.

Quick Hits

You Need to Rewrite Your CLI for AI Agents | Justin Poehnelt (Google) — The best guide yet on building agent-first tooling. If you maintain a CLI, start here.

Terence Tao: AI Is Ready for Primetime in Math and Physics | OpenAI Academy — When a Fields medalist says AI saves more time than it wastes, the bar for “useful” just moved.

Luma Launches Creative AI Agents | TechCrunch — Turned a $15M ad campaign into localized versions in 40 hours for under $20K. Creative agencies, take note.

KV Cache Compaction Cuts LLM Memory 50x | VentureBeat — MIT’s Attention Matching compresses working memory without accuracy loss. Long-context inference just got cheaper.

Google I/O 2026: May 19-20 | Google Blog — Save the date. The puzzle itself is a Gemini showcase, which tells you where the keynote is heading.

Roblox Launches AI Chat Rephrasing | Roblox — Instead of blocking banned words with “####”, AI now rephrases them in real time. Moderation at 68M daily users is an AI problem now.

LangChain CEO: Models Alone Won’t Get Agents to Production | VentureBeat — Harrison Chase on why “harness engineering” matters more than model upgrades for shipping real agents.

Featured Article: Labor Market Impacts of AI: A New Measure and Early Evidence | Anthropic Research

Anthropic introduced a new metric called “observed exposure” that combines theoretical LLM capability with real-world Claude usage data to measure which jobs are actually being affected by AI. The headline finding: AI is far from reaching its theoretical capability. Actual task coverage remains a fraction of what’s feasible. Computer programmers top the list at 75% coverage, followed by customer service representatives and data entry keyers. No systematic increase in unemployment has appeared for highly exposed workers since late 2022.

The paper opens with a point worth sitting with: past predictions about job displacement have a poor track record. Offshorability studies flagged a quarter of US jobs as vulnerable, and a decade later most of those jobs grew. This research is deliberately not making predictions. Instead, it’s building a measurement framework now, before meaningful effects emerge, so future analysis has a real baseline. The finding that matters most right now is about entry-level hiring. Among workers aged 22 to 25, hiring into exposed occupations has dropped roughly 14% compared to pre-ChatGPT levels. Workers in the most exposed professions are more likely to be older, female, more educated, and higher-paid. The pipeline is thinning before displacement shows up in unemployment data.

What to watch for: The gap between what AI can do and what it is doing is closing. This report measures it directly, and future updates will show how fast the red area catches the blue. Pay attention to the entry-level hiring numbers next time around.

Watch This: This New Claude Code Feature is a Game Changer | Nate Herk (8 min)

Nate walks through Claude Code’s new loop feature, which lets you set recurring tasks, reminders, and skill intervals that run for up to three days without input. The video covers how the cron tools work under the hood, a live walkthrough of setting one up, and a clear comparison of when to use loops versus scheduled tasks. If you’re already using Claude Code, this is worth eight minutes of your time.

Also This Week

Reasoning Models Struggle to Control Their Chains of Thought, and That’s Good | OpenAI

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought | arXiv

Building AI Coding Agents for the Terminal | arXiv

Anthropic Spend Commitment Now Funds Partner Integrations | Anthropic

Claude Community Ambassadors Program | Anthropic

ZeroClaw: Autonomous AI Assistant Infrastructure | GitHub

City Detect Raises $13M Series A | TechCrunch

Port Washington Data Center Breaks Ground | BizTimes

How Descript Enables Multilingual Video Dubbing at Scale | OpenAI

How Balyasny Built an AI Research Engine for Investing | OpenAI

What I’m Watching

Features like Claude Code’s new /loop command and projects like ZeroClaw are pointing in the same direction: autonomous agent runtimes that are lightweight, swappable, and designed to run without you. The question I keep coming back to is how long until this space fragments enough that no single framework dominates. We’re not there yet, but the building blocks are shipping fast.

The other thing I’m paying attention to is something that rarely shows up in benchmark announcements: how new model releases actually affect agent quality in production. GPT-5.4, Claude Opus 4.6, and the reasoning improvements shipping alongside them should be measurably changing chain-of-thought reliability for deployed agents. But that data is hard to find. If you’re running agents in the wild and tracking performance across model versions, I’d genuinely love to hear what you’re seeing.

And then there’s the security work. Anthropic found novel Firefox vulnerabilities. OpenAI launched Codex Security. A few newsletters ago, we covered AI solving novel physics problems. Now we’re seeing that same pattern expand: LLMs surfacing things humans hadn’t found yet. Is that just the natural expansion curve of the technology, or is it a growth signal that tracks directly with model quality? I think it’s both, and the Mozilla results suggest we’re still early in finding out what these models can actually uncover when pointed at the right problems.

Subscribe now

Another Weekly AI Newsletter: Issue 61

Taylor Ortiz — Tue, 03 Mar 2026 16:16:48 GMT

Personal Note

This newsletter comes to you late this week on a Tuesday morning. Like many others, I was caught in the Anthropic outage and am also dependent on this technology to drive the initiatives that are meaningful to me. When I woke up to finalize the newsletter and found things offline, I journaled while listening to the birds outside, listened to music, reflected on my weekend, and engaged in refreshing activities I normally don’t find the time for. It was a lesson for me to find more time to step away from the keyboard.

The Week’s Thesis

AI went political this week: Anthropic’s relationship with the Department of War fell apart, and hours later, OpenAI signed a deal for classified network deployment. On paper, both companies claim the same red lines. But the sequence alone was enough to make people uneasy. More on this in our featured story below.

OpenAI’s partnership blitz: They launched Frontier Alliances, a new partner program, followed by a Codex integration with Figma bridging code and design workflows. By Friday, they announced a strategic partnership with Amazon and released a joint statement with Microsoft reaffirming their existing relationship. Four announcements in five days, all while the Department of War deal was making headlines.

Agent observability is becoming a thing: Microsoft found that 80% of Fortune 500 companies are running active agents but most lack visibility into what those agents are doing. LangChain argued that traditional APM tools weren’t built for this, New Relic shipped an agent-specific observability platform, and Google published a production-readiness guide. Observability is quietly becoming part of the conversation, and it’s worth paying attention to.

Healthcare AI is moving: NVIDIA’s annual survey found that 70% of healthcare organizations are now actively deploying AI, with 85% reporting increased revenue. Eli Lilly went live with LillyPod, the most powerful AI factory wholly owned by a pharmaceutical company, purpose-built for drug discovery. Oura shipped a proprietary AI model focused on women’s reproductive health, hosted entirely on their own infrastructure. And NIST published guidance on AI trustworthiness standards for clinical settings. From drug discovery to consumer wearables to regulation, healthcare AI is moving.

Quick Hits

Jira’s latest update allows AI agents and humans to work side by side | TechCrunch — Agents on the same sprint board as humans with deadlines and assignments. This is mainstream adoption.

Pro-level image generation gets faster and more accessible with Nano Banana 2 | Google Cloud AI — Google’s enterprise image gen model gets faster and cheaper. The gap between “good enough” and “production-ready” keeps shrinking.

Anthropic acquires Vercept to advance Claude’s computer use capabilities | Anthropic — Anthropic is doubling down on computer use. If agents are going to operate in production, they need to see and interact with real interfaces.

Detecting and preventing distillation attacks | Anthropic — Anthropic identified industrial-scale distillation campaigns by DeepSeek, Moonshot, and MiniMax, totaling over 16 million exchanges across 24,000 fraudulent accounts designed to extract Claude’s capabilities. They published their approach to catching and preventing it.

The human work behind humanoid robots is being hidden | MIT Technology Review — The humans still doing the work that robot demos suggest is automated. A good reality check.

Featured Story: Anthropic’s Deal With the Department of War Fell Through. Hours Later, OpenAI Signed One.

Anthropic published its Responsible Scaling Policy v3.0 on February 24, a ground-up rewrite of the framework it uses to decide what it will and won’t build. Two days later, Dario Amodei published a statement revealing that Anthropic has been deeply embedded in the Department of War for months: intelligence analysis, cyber operations, modeling and simulation. The company also disclosed it walked away from several hundred million dollars in revenue by cutting off entities linked to the Chinese Communist Party. But Anthropic drew two red lines: no mass domestic surveillance of Americans, and no fully autonomous weapons.

On February 27, Secretary of War Pete Hegseth designated Anthropic a “supply chain risk”, a label historically reserved for US adversaries. Trump ordered every federal agency to stop using Anthropic technology. That same night, OpenAI announced a deal to deploy its models on the Department of War’s classified network.

Here’s where it gets interesting: OpenAI’s stated terms include the same two red lines. No mass surveillance. No autonomous weapons. But OpenAI walked away with a deal and Anthropic walked away blacklisted. OpenAI’s approach centers on what Altman called a “safety stack”: cloud-only deployment that keeps OpenAI’s safety layers active, cleared personnel in the loop, and an agreement that if the model refuses a task, the government won’t force a workaround. What exactly differed in the negotiations isn’t public, but the outcome speaks for itself.

The RSP v3.0 explains the philosophical scaffolding behind Anthropic’s position. After two and a half years of trying to implement capability-based safety thresholds, Anthropic concluded that “the science of model evaluation isn’t well-developed enough to provide dispositive answers.” The policy now splits commitments into what Anthropic will enforce unilaterally and what requires industry-wide coordination. Autonomous weapons fall squarely in the second bucket: the reliability isn’t there yet, and no single company can build the guardrails alone.

The business implications are already visible. Nate Silver noted that Anthropic had been steadily closing the valuation gap with OpenAI. Whether the DoW designation slows that trajectory is an open question.

The question practitioners should be sitting with isn’t “who’s right.” It’s what happens next. If you’re building on Claude for sensitive workloads, your platform just got blacklisted from every federal system. If you’re building on OpenAI, your platform’s safety guarantees rest on a technical architecture rather than a legal commitment. Both carry risk. The difference is in which failure mode you’re betting on.

What to watch for: Whether the “supply chain risk” designation survives legal challenge, and whether OpenAI’s cloud-only safety stack holds as models get more capable and the Department of War pushes for edge deployment.

Watch This

StarTalk: Geoffrey Hinton on AI, Consciousness, and the Future: Neil deGrasse Tyson sits down with Nobel Laureate Geoffrey Hinton to cover the full arc: how neural nets work, why backpropagation was the breakthrough, whether AI can actually reason, and the heavy questions around consciousness, energy demands, and what happens when models start generating their own training data.

Also This Week

Intrinsic joins Google | TechCrunch

Let Gemini handle your multi-step daily tasks on Android | Google AI Blog

Anthropic Education Report: The AI Fluency Index | Anthropic Research

The persona selection model | Anthropic Research

Disrupting malicious uses of AI | OpenAI

Can Local AI Stand In for the Cloud? | deeplearning.ai

AI is rewiring how the world’s best Go players think | MIT Technology Review

What I’m Watching

OpenAI’s new role in government AI. How does OpenAI’s solidified position with the Department of War shift the tide of AI in government? Will it be relatively quiet, or will we see noticeable shifts in how these technologies are deployed domestically and how we engage in combat with other countries? And if growth and innovation eventually push against the boundaries of an agreement, does the government override, or does OpenAI become more malleable?

The enterprise agent framework race. We are still in the “release agents as a capability” phase. Most enterprise platforms are now shipping their own proprietary frameworks. Will those be expansive enough to meet the breadth of platform use cases, or will we see demand expand beyond what a single-platform framework can handle, requiring true enterprise solutions?

Agent observability, from experience. Observability is something we are hyper-focused on at Ping. We find that we have the highest amount of control with our custom agents, and that control reduces significantly when we adopt out-of-the-box frameworks that leave us with little say over design practices. If that’s true at our scale, it’s worth asking what it looks like at enterprise scale.

Subscribe now