Week two. Three stories, CLI updates, and some links.

The big stories

Microsoft Copilot now runs Claude and GPT side by side. Two new features in Copilot Researcher: Critique and Council. In Critique mode, GPT does the research while Claude acts as a "senior editor" - checking citations, flagging inconsistencies, and sending corrections back to GPT before you see the output. In Council mode, both models run the same query in parallel, then a third model reads both reports and writes a summary of where they agreed, diverged, and what each caught that the other missed. Microsoft says Critique beats Perplexity Deep Research on factual accuracy by nearly 14%. Available now through the Microsoft 365 Copilot Frontier program.

Google dropped Gemma 4 under Apache 2.0. Four sizes: 2B, 4B, 26B MoE, and 31B dense. The 31B model is currently #3 on Chatbot Arena’s text leaderboard. All models support native function-calling, structured JSON output, and system instructions - built for agentic workflows, not just chat. The edge models (2B and 4B) run on-device with 128K context. NVIDIA is already shipping optimized runtimes for RTX hardware. If you’ve been waiting for open models good enough to run local agents, this is the release to pay attention to.

Pinterest deployed MCP at production scale. Their engineering team built a fleet of domain-specific MCP servers - one each for Presto, Spark, Airflow, and other internal tools - behind a central registry that handles discovery and access control. 66,000 invocations per month across 844 users, saving roughly 7,000 hours monthly. Sensitive operations require human-in-the-loop approval. This is the first major public case study of MCP running as real infrastructure inside a large company, not a demo.

CLI tool updates

Codex CLI shipped GPT-5.3-Codex-Spark, a real-time coding model running at 1,000+ tokens/sec on Cerebras hardware. That’s 15x faster than standard GPT-5.3-Codex. Also landed: Windows sandbox with OS-level network isolation (blocks data exfiltration at the OS level, not just the app level) and device code login for headless environments like SSH sessions and Docker containers.

Claude Code added interactive /powerup lessons, stronger session resume, and a broad set of fixes across hooks, editing, scrolling, and PowerShell permissions. Privacy tightening and faster session handling under the hood.

The take

MCP’s inflection point isn’t the install count - it’s Pinterest’s architecture doc. They solved the problems everyone’s been hand-waving about: how to scope servers so agents don’t drown in context, how to gate dangerous operations, and how to let teams discover and connect to each other’s tools without a free-for-all. Meanwhile, Invariant Labs published a reproducible tool poisoning attack showing that malicious MCP servers can embed hidden instructions in tool descriptions that hijack agent behavior. The protocol is simultaneously proving itself in production and revealing serious security gaps. The MCP Dev Summit happened in NYC this week (April 2-3) under the new Agentic AI Foundation - expect the security story to move fast from here.

The Microsoft Copilot move is worth watching for a different reason. Running Claude as GPT’s editor isn’t just a feature - it’s an admission that no single model is good enough to trust alone. Multi-model verification is about to become a pattern you see everywhere.

One thing

If you’re building MCP servers, read the Pinterest post before you ship. They figured out the governance and scoping problems that most tutorials skip. And run mcp-scan on any third-party servers you’re connecting to - tool poisoning is real and trivially easy to pull off.