Wednesday, March 11, 2026 4 min read 12 findings
Good morning, Mark.
OpenAI shipped GPT-5.4 overnight and it's the first model to beat human baselines on OSWorld-Verified. Meanwhile, Qwen 3.5 benchmarks leaked on Reddit and they're surprisingly strong at the 35B tier -- relevant for your local inference stack.
Today: local AI models, OpenClaw skill candidates, and a potential voice calling integration
🧠 AI MODELS
1 GPT-5.4 beats human baselines on computer use
OpenAI's newest model scores 75% on OSWorld-Verified, surpassing the human baseline of 72.4%. This is specifically trained for computer-use tasks -- navigating GUIs, filling forms, clicking buttons. Latency and pricing TBD but the capability jump is real.
2 Qwen 3.5 35B benchmarks leaked on r/LocalLLaMA
Leaked benchmarks show the 35B variant matching GPT-4o on coding tasks while running at 91.9 tok/s on your M3 Ultra. If confirmed, this is a significant upgrade over the current model you're running locally.
3 Apple announces Core AI framework at WWDC 2026
Core ML is being replaced by Core AI, a broader framework supporting on-device LLM inference natively. Could simplify your local inference setup long-term.
🦞 OPENCLAW RELEASES
1 OpenClaw 2026.3.8 ships backup commands
New openclaw backup create and openclaw backup verify for local state archives. Supports --only-... flags for selective backup. You're already on 3.8.
2 2026.3.2 broke tool permissions for many users
Community reports confirm tools disabled by default after updating to 3.2. Browser profile defaults changed too. We patched past this in early March but worth noting the ecosystem impact.
🔥 HOT SKILLS
1 Persistent Memory skill hits 26K users in one week
Addresses the default context loss between sessions. You've already built a more sophisticated version with your context architecture, but the community traction validates the approach.
2 Heartbeat automations going mainstream
Autonomous skills that wake up to check calendars, Slack, and news feeds without user triggers. Aligns with your existing heartbeat system but the community is catching up.
GPT-5.4 Computer Use: Should You Care?

OpenAI's GPT-5.4 is the first model to beat human baselines on the OSWorld-Verified benchmark -- a test of real computer use tasks like navigating GUIs, managing files, and interacting with web apps. The 75% score edges past the 72.4% human baseline.

This matters because computer-use models are becoming the backbone of agent automation. While your current setup uses headless Camoufox with scripted browser control, a model that can natively understand and operate GUIs could simplify complex automation tasks significantly.

The catch: pricing and latency haven't been announced. And OpenAI's track record on computer-use (remember the CUA preview?) suggests the real-world reliability may lag the benchmarks. Worth monitoring but not worth rebuilding around yet.

Why this matters for you
Your browser automation relies on scripted selectors and explicit page interaction. A model that reliably operates GUIs natively could replace hundreds of lines of Camoufox automation code -- but only once latency and cost make sense for your use cases.
Pending
QMD Semantic Search
BM25 + vector embeddings index for local markdown files. 96% token reduction on file reads.
Planned
Voice Call Escalation
Twilio + ElevenLabs emergency phone calls when critical alerts go unacknowledged.
Planned
Self-Improving Agent
Structured error capture and automatic correction promotion into CLAUDE.md.
👁
Agent Browser Watch
Abstracts DOM interaction for AI agents. Pre-built navigation and snapshot capabilities.
Promising but Camoufox setup is working well. Revisit when maturity improves.
ACPX CLI Build
Unified CLI for managing stateful agent sessions across different agent types.
Could streamline your multi-agent workflows (claude-max, codex, openclaw).
Viral Playbook Skill Skip
One-shot viral content playbook implementation.
Not relevant to your use cases.
6/6 services
$12.40 token burn
Overnight: all passed