The Alignment Tax

Three things happened this week that usually don't share a newsletter. The hardest reasoning benchmark ever built put frontier AI at 0.26% — humans at 100%. Software stocks broke below the S&P 500 for the first time in history. And Anthropic got blacklisted by the Pentagon for refusing to let its AI run domestic surveillance. The connecting thread: the distance between what AI can do in a lab, what it's doing in production, and what some companies will agree to build.

3 predictions updated5 milestones2 companies refreshed

Key Developments

The Pentagon tried to blacklist Anthropic — and its biggest competitors defended it in court

Secretary Hegseth designated Anthropic as a "supply chain risk" and President Trump ordered all federal agencies to stop using Claude — the penalty for refusing to allow AI-driven domestic mass surveillance and autonomous weapons use. Microsoft, which holds equity in Anthropic, filed an amicus brief calling the designation a use of supply chain law to resolve a contract dispute that "may bring severe economic effects not in the public interest." More surprising: Google and OpenAI — competitors for the same government contracts — also filed supporting court documents. The whole AI industry just publicly agreed that using procurement blacklists to punish companies for ethical constraints sets a precedent none of them can afford.

Counterpoint

Microsoft's equity stake in Anthropic means this is financial self-defense as much as principle — protecting a $4B+ investment from a government designation that would crater its value. Google and OpenAI's support is similarly self-interested: a legal precedent that lets the Pentagon blacklist AI companies for ethical constraints is a weapon any future administration could use against them. When financial interest and ethics align this neatly, calling it a values test overstates the moral clarity.

Source →

ARC-AGI-3 just proved that 'approaching human-level reasoning' was about the benchmarks, not the reasoning

The team behind ARC-AGI launched their third benchmark on March 25, and every frontier model that supposedly dominated ARC-AGI-2 — GPT-5.4, Claude 4.6, Gemini 3.1 — scored below 1%. Humans score 100%. ARC-AGI-3 replaces fixed puzzles with interactive game environments where the AI must explore, learn the rules, and generalize across difficulty levels with no stated goals — what a human naturally does when picking up a new game. No memorization, no fine-tuning, no pattern-matching on training data. Models don't generalize from first principles; they compress and recall patterns. Every "reasoning" milestone of the last two years describes how well AI pattern-matches, not how well it thinks.

Challenges P-003 ↘ Systems capable of outperforming Nobel laureates a...Challenges P-007 ↘ AGI is 3-5 years away; current systems lack reason...

Counterpoint

ARC-AGI-3 was designed by François Chollet specifically to be hard for LLMs — it measures a narrow definition of reasoning (novel game generalization) that may not be relevant to the work most businesses care about. GPT-5.4 scored 83% on GDPVal (OpenAI's own benchmark — directionally correct, not independent validation) and solved an open math problem this same week. The 0.26% result says AI can't learn new games from scratch; it says nothing about whether AI can do your accounting, write your code, or draft your legal brief — tasks it's already doing at scale. ARC-AGI-3 tests one specific cognitive capability that happens to be hard to fake with pattern compression.

Source →

Software stocks just broke below the S&P 500 for the first time in history — the market has a name for why

The iShares IGV software ETF is down 21% year-to-date. Salesforce is down 30%. Adobe's forward P/E collapsed from a five-year average of 30x to 12x. For the first time in market history, software trades at a discount to the broader S&P 500. The market's thesis is "seat compression": one AI agent can replace the work of multiple software seats, breaking the per-seat revenue model that made SaaS stocks worth premium multiples for a decade. The question isn't whether SaaS companies will build AI — Salesforce already has Agentforce at $800M ARR. It's whether bolting AI onto per-seat subscription models is fast enough to offset the structural repricing. The market is saying no.

Confirms ↗

Counterpoint

Adobe at 12x forward P/E and Salesforce at a 10-year price low look like deep value to any investor who believes the software layer survives AI. These companies have data moats, customer lock-in, and compliance requirements that AI agents can't easily replicate — try telling a regulated financial institution to swap Workday for a custom agent. And 95% of generative AI pilots fail to reach production scale (Deloitte, March 2026). The seat compression thesis assumes enterprises will replace software with AI agents; the data shows most can't even get agents off the ground.

Source →

What the Evidence Moved

P-004AI systems will be able to discover genuinely novel scientif...

GPT-5.4 Pro solves open FrontierMath problem (Epoch AI verified) + AI Scientist paper in Nature — two independent confirming signals for novel AI scientific insights

70%→75%▲ +5pp

P-02140% of enterprise applications will feature task-specific AI...

DOL DOLA 2.8M cases in production + Salesforce Agentforce embedded in all Suites + Workday Sana Labs + Microsoft Azure Copilot agents + Accenture/Databricks 327% deployment increase. Move reflects vendor-embedded definition being met; rigorous autonomy threshold remains ~0.45.

70%→80%▲ +10pp

P-002AI will disrupt 50% of entry-level white-collar jobs over 1-...

Tufts Jobs Risk Index 9.3M at risk (33 tipping-point occupations), Dallas Fed workers 22-25 in AI-exposed roles -16%, HBS automation postings -13%. Counter: CompTIA +1.9% tech employment, ADP Solow paradox (task time +346% in bad implementations)

55%→58%▲ +3pp