MarkBot Intelligence — Wednesday, March 11, 2026

Good morning, Mark.

OpenAI shipped GPT-5.4 overnight and it's the first model to beat human baselines on OSWorld-Verified. Meanwhile, Qwen 3.5 benchmarks leaked on Reddit and they're surprisingly strong at the 35B tier -- relevant for your local inference stack.

🧠 AI MODELS

1 GPT-5.4 beats human baselines on computer use

OpenAI's newest model scores 75% on OSWorld-Verified, surpassing the human baseline of 72.4%. This is specifically trained for computer-use tasks -- navigating GUIs, filling forms, clicking buttons. Latency and pricing TBD but the capability jump is real.

Tech by Johan

2 Qwen 3.5 35B benchmarks leaked on r/LocalLLaMA

Leaked benchmarks show the 35B variant matching GPT-4o on coding tasks while running at 91.9 tok/s on your M3 Ultra. If confirmed, this is a significant upgrade over the current model you're running locally.

3 Apple announces Core AI framework at WWDC 2026

Core ML is being replaced by Core AI, a broader framework supporting on-device LLM inference natively. Could simplify your local inference setup long-term.

Power On Newsletter

🦞 OPENCLAW RELEASES

1 OpenClaw 2026.3.8 ships backup commands

New openclaw backup create and openclaw backup verify for local state archives. Supports --only-... flags for selective backup. You're already on 3.8.

GitHub Releases

2 2026.3.2 broke tool permissions for many users

Community reports confirm tools disabled by default after updating to 3.2. Browser profile defaults changed too. We patched past this in early March but worth noting the ecosystem impact.

r/openclaw

🔥 HOT SKILLS

1 Persistent Memory skill hits 26K users in one week

Addresses the default context loss between sessions. You've already built a more sophisticated version with your context architecture, but the community traction validates the approach.

GitHub Projects

2 Heartbeat automations going mainstream

Autonomous skills that wake up to check calendars, Slack, and news feeds without user triggers. Aligns with your existing heartbeat system but the community is catching up.

Ido Green

From the Frontier

GPT-5.4 Computer Use: Should You Care?

OpenAI's GPT-5.4 is the first model to beat human baselines on the OSWorld-Verified benchmark -- a test of real computer use tasks like navigating GUIs, managing files, and interacting with web apps. The 75% score edges past the 72.4% human baseline.

This matters because computer-use models are becoming the backbone of agent automation. While your current setup uses headless Camoufox with scripted browser control, a model that can natively understand and operate GUIs could simplify complex automation tasks significantly.

The catch: pricing and latency haven't been announced. And OpenAI's track record on computer-use (remember the CUA preview?) suggests the real-world reliability may lag the benchmarks. Worth monitoring but not worth rebuilding around yet.

Why this matters for you

Your browser automation relies on scripted selectors and explicit page interaction. A model that reliably operates GUIs natively could replace hundreds of lines of Camoufox automation code -- but only once latency and cost make sense for your use cases.

Tech by Johan · OpenAI Blog

Build Queue

Pending

QMD Semantic Search

BM25 + vector embeddings index for local markdown files. 96% token reduction on file reads.

Planned

Voice Call Escalation

Twilio + ElevenLabs emergency phone calls when critical alerts go unacknowledged.

Planned

Self-Improving Agent

Structured error capture and automatic correction promotion into CLAUDE.md.

Tools & Candidates

👁

Agent Browser Watch

Abstracts DOM interaction for AI agents. Pre-built navigation and snapshot capabilities.

Promising but Camoufox setup is working well. Revisit when maturity improves.

✅

ACPX CLI Build

Unified CLI for managing stateful agent sessions across different agent types.

Could streamline your multi-agent workflows (claude-max, codex, openclaw).

❌

Viral Playbook Skill Skip

One-shot viral content playbook implementation.

Not relevant to your use cases.

System Health

6/6 services

$12.40 token burn

Overnight: all passed