Meta's Post-Llama Era Begins
DEEP ANALYSIS

Muse Spark

Meta's Post-Llama Era Begins

Meta Superintelligence Labs just dropped its first model — a natively multimodal reasoning system with 16 built-in tools, multi-agent orchestration, and a controversial closed-source license. We break down every benchmark, every feature, and what it means for the AI race.

April 9, 202615 min readFlowVideo AI Research
Abstract: On April 8, 2026, Meta released Muse Spark — the first model from Meta Superintelligence Labs (MSL), the unit led by former Scale AI CEO Alexandr Wang. Built from scratch over nine months, Muse Spark is a natively multimodal reasoning model that scores 52 on the Artificial Analysis Intelligence Index, placing it 4th behind Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. But the headline numbers only tell part of the story: Muse Spark leads on health benchmarks, rivals frontier models in vision tasks, and introduces a novel Contemplating mode with multi-agent orchestration. Most controversially, it is Meta's first closed-source frontier model — a dramatic break from the Llama open-weight tradition.

1. The Backstory: From Llama to Muse

To understand why Muse Spark matters, you need to understand the turbulence that preceded it. Meta's Llama 4 launch in April 2025 was widely seen as a disappointment — the models underperformed expectations, and the open-source AI community that had rallied behind Llama began to lose faith in Meta's AI direction.

Mark Zuckerberg responded with the most aggressive AI talent acquisition in Silicon Valley history. In June 2025, Meta spent $14.3 billion to acquire a 49% nonvoting stake in Scale AI and brought in its cofounder and CEO, Alexandr Wang, as Meta's first-ever Chief AI Officer. Wang was tasked with building Meta Superintelligence Labs (MSL) — a new unit with a mandate to catch and surpass Google and OpenAI.

Nine months later, Muse Spark is the first product of that effort. Internally codenamed 'Avocado', it represents what Meta calls a 'ground-up overhaul' of their entire AI stack — new infrastructure, new architecture, new data pipelines, and critically, a new philosophy about how AI models should be built and deployed.

Apr 2025

Llama 4 launches to mixed reviews; community questions Meta's AI competitiveness

Jun 2025

Meta acquires 49% of Scale AI for $14.3B; Alexandr Wang becomes Chief AI Officer

Jun 2025

Meta Superintelligence Labs (MSL) officially formed under Wang's leadership

Jul 2025 - Mar 2026

Nine months of development: complete AI stack rebuild (codenamed 'Avocado')

Apr 6, 2026

Axios reports Meta plans to open-source versions of upcoming models

Apr 8, 2026

Muse Spark officially released; available on meta.ai and Meta AI app

Meta official blog post announcing Muse Spark as the first model from Meta Superintelligence Labs

Source: Meta AI Blog — April 8, 2026

2. What Is Muse Spark? Architecture & Design

Muse Spark is a natively multimodal reasoning model — meaning it was built from the ground up to process text, images, and visual data as first-class inputs, rather than bolting vision capabilities onto a text-only backbone. Meta specifically states it was designed to 'integrate visual information across its internal logic,' contrasting with previous approaches that 'stitched' modalities together.

The model operates in a dual-mode architecture. In standard (Instant) mode, it delivers rapid responses similar to conventional chat AI. In Thinking mode, it engages extended reasoning with superior output quality. A third mode — Contemplating — uses multi-agent orchestration for the most complex tasks.

Modality

Multimodal: text + vision input, text output

Context Window

262K tokens

Reasoning Modes

Instant, Thinking, Contemplating

Training Efficiency

10x less compute than Llama 4 Maverick for comparable performance

License

Proprietary (open-source version planned)

Built-in Tools

16 integrated tool capabilities

Efficiency Breakthrough

Meta claims Muse Spark achieves comparable performance to Llama 4 Maverick while requiring 'over an order of magnitude less compute.' This efficiency gain comes from improvements to model architecture, optimization methods, and data curation during the nine-month rebuild. If validated independently, this represents a significant advance in training efficiency.

3. Benchmark Deep Dive: Where Muse Spark Stands

Muse Spark scores 52 on the Artificial Analysis Intelligence Index v4.0, placing it 4th overall. But the aggregate number masks significant variation across domains — Muse Spark leads in some benchmarks while trailing badly in others.

Artificial Analysis Intelligence Index v4.0 — Top Models

RankModelScoreDeveloper
#1Gemini 3.1 Pro57Google
#2GPT-5.457OpenAI
#3Claude Opus 4.653Anthropic
#4Muse Spark52Meta
#5Claude Sonnet 4.6Anthropic
#6GLM-5.1Zhipu AI
#7MiniMax-M2.7MiniMax
#8Grok 4.20xAI

Where Muse Spark Excels

HealthBench Hard#1
Score:42.8

Outperforms GPT-5.4 (40.1), Claude Opus 4.6 (36.2), and Gemini 3.1 Pro (20.6). Meta collaborated with over 1,000 physicians to curate training data for health applications.

CharXiv Reasoning#1
Score:86.4 (Contemplating)

Tests figure and chart understanding from images. Beats GPT-5.4 (82.8) and Gemini 3.1 Pro (80.2). Demonstrates strong visual STEM reasoning.

MMMU-Pro#2
Score:80.5%

Multimodal understanding benchmark. Only Gemini 3.1 Pro (82.4%) scores higher. Strong performance across visual reasoning tasks.

Where Muse Spark Falls Short

Terminal-Bench 2.016 points behind GPT-5.4 (75.1)
Score:59.0

Coding performance is the most significant gap. Developers who rely on AI for code generation will find Muse Spark notably behind the leaders.

ARC-AGI-234 points behind leaders (~76)
Score:42.5

Abstract reasoning is the most striking weakness. GPT-5.4 (76.1) and Gemini 3.1 Pro (76.5) score nearly double. This gap suggests fundamental limitations in novel pattern recognition.

GDPval-AA (Agentic Tasks)249 points behind GPT-5.4 (1,676)
Score:1,427 Elo

Real-world desktop and office task performance. Trails both GPT-5.4 and Claude Opus 4.6 (1,607) by significant margins.

Token Efficiency: Muse Spark's Hidden Advantage

One underappreciated metric: Muse Spark used just 58 million output tokens to complete the full Intelligence Index evaluation — comparable to Gemini 3.1 Pro (57M) but far less than Claude Opus 4.6 (157M) and GPT-5.4 (120M). Meta calls this 'thought compression' — the model optimizes token usage by solving problems with significantly fewer tokens after initial thinking phases. For cost-sensitive deployments, this efficiency could be decisive.

4. Contemplating Mode: Multi-Agent Reasoning

The most technically interesting feature of Muse Spark is its three-tier reasoning system. While most frontier models offer a single 'thinking' mode, Meta has built a hierarchy:

Contemplating mode is particularly notable because it uses multi-agent orchestration under the hood — spawning multiple sub-agents that work in parallel to break down complex problems. Meta claims this achieves 'superior performance with comparable latency' compared to single-agent extended thinking.

Instant

Standard chat mode. Fast responses for simple queries. Comparable to GPT-5.4 mini or Claude Haiku.

Quick questions, simple tasks, conversational interaction

Thinking

Extended reasoning with chain-of-thought. Single agent with deeper analysis. Enhanced output quality.

Complex questions, analysis, content creation, coding tasks

Contemplating

Multi-agent orchestration. Parallel sub-agents collaborate to solve hard problems. Comparable to Gemini Deep Think and GPT-5.4 Pro.

Research tasks, complex STEM problems, multi-step analysis

Contemplating Mode Benchmark Results

BenchmarkMuse Spark (Contemplating)Description
Humanity's Last Exam58%Grad-level reasoning across disciplines
FrontierScience Research38%Cutting-edge scientific reasoning
GPQA Diamond89.5%Graduate-level scientific Q&A
CharXiv Reasoning86.4Visual chart and figure analysis

5. 16 Built-In Tools: A Full Development Platform

One of Muse Spark's most distinctive features is its deeply integrated toolset. Unlike models that treat tool-use as an afterthought, Muse Spark ships with 16 native tools that turn it into a complete development and research platform. Developer Simon Willison documented all of them after the launch.

Search & Browse

browser.searchWeb search via undisclosed engine
browser.openLoad full pages from search results
browser.findPattern matching on page content

Meta Platform Integration

meta_1p.content_searchSemantic search across Instagram, Threads, Facebook posts (2025+ content)
meta_1p.meta_catalog_searchProduct catalog search for shopping features

Code & Computation

container.python_executionFull Python sandbox (numpy, pandas, matplotlib, scikit-learn, OpenCV)
container.create_web_artifactHTML/JavaScript/SVG sandbox for web app prototyping
container.file_searchSearch uploaded documents
container.view/insert/str_replaceFile editing capabilities similar to code editors

Vision & Media

media.image_genImage generation with artistic and realistic modes, multiple aspect ratios
container.visual_groundingObject detection: point, bbox, and count modes (likely Segment Anything)
container.download_meta_1p_mediaPull Instagram/Facebook/Threads media into sandbox

Agent & Integration

subagents.spawn_agentDelegate tasks to sub-agents for parallel research/analysis
third_party.link_third_party_accountGoogle Calendar, Outlook, Gmail integration

Developer Transparency

Simon Willison noted that Meta deserves credit for not hiding the tool interface: 'credit to Meta for not telling their bot to hide these, since it is far less frustrating if I can get them out without having to mess around with jailbreaks.' The tool names and parameters are fully visible to users, enabling developers to understand exactly what the model can do.

6. The Open Source Controversy

Perhaps the most controversial aspect of Muse Spark is what it represents strategically: Meta's first closed-source frontier model. The company that championed open weights with the Llama series — building enormous goodwill in the developer community — has now shipped a proprietary model with no public weights, no architecture details, and no API for general developers.

The backlash was immediate. VentureBeat ran with the headline 'Goodbye, Llama?' The Register quipped that Meta's new model 'is as open as Zuckerberg's private school.' Developer forums erupted with debate about whether Meta had abandoned its open-source principles.

Meta's response has been carefully calibrated. On X, leadership stated: 'Nine months ago we rebuilt our AI stack from scratch. New infrastructure, new architecture, new data pipelines... This is step one. Bigger models are already in development with plans to open-source future versions.' Axios reported two days before launch that Meta planned to release open-source versions of its next AI models.

Model WeightsNot Available

No public release of Muse Spark weights. First Meta frontier model without open weights.

Architecture DetailsNot Available

No paper, no technical report beyond the blog post. Internal architecture remains proprietary.

Public APIComing Soon

Private API preview for select partners only. Paid API access planned for broader audience.

Open Source VersionPromised

Meta has stated plans to open-source future versions. No timeline given.

Strategic Read

The shift to closed-source likely reflects two pressures: (1) the Llama 4 failure showed that open weights alone do not guarantee ecosystem adoption if the models underperform, and (2) Alexandr Wang's Scale AI background is rooted in data quality and proprietary advantages, not open-source ideology. The promise of future open-source releases may be genuine, or it may be a holding pattern while Meta evaluates the competitive landscape.

7. Health, Vision & Multimodal Strengths

While Muse Spark trails the leaders in coding and abstract reasoning, it has carved out genuine strengths in health applications and visual understanding that deserve attention.

Health AI: The #1 Benchmark Score

Muse Spark's 42.8 score on HealthBench Hard is the highest of any model tested — above GPT-5.4 (40.1), Claude Opus 4.6 (36.2), and dramatically above Gemini 3.1 Pro (20.6). Meta says it collaborated with over 1,000 physicians to curate training data, enabling 'factual, comprehensive health responses including interactive nutritional and exercise displays.'

This is notable because health is an area where accuracy has life-or-death implications. Meta's investment in physician-curated data appears to have paid off in benchmark performance, though real-world clinical validation remains essential before any medical application.

Visual STEM Reasoning

The CharXiv and MMMU-Pro results tell a consistent story: Muse Spark excels at understanding charts, figures, and visual information. In Contemplating mode, it scored 86.4 on CharXiv Reasoning — the best of any model. On MMMU-Pro, its 80.5% trails only Gemini 3.1 Pro (82.4%).

For users working with scientific literature, data visualization, or technical documentation, Muse Spark's visual understanding capabilities may be best-in-class. The model was specifically highlighted for its ability to create 'interactive experiences like creating fun minigames or troubleshooting your home appliances' based on visual input.

8. What This Means for Developers

If you are building AI-powered applications, here is a pragmatic assessment of where Muse Spark fits in the current landscape.

Where to Use Muse Spark

1
Health & Medical Apps: Best-in-class benchmark scores. If you are building health-adjacent features, Muse Spark should be on your evaluation list.
2
Visual Analysis: Chart understanding, figure interpretation, and visual STEM tasks. The CharXiv and MMMU-Pro scores are genuinely impressive.
3
Meta Platform Integration: If your product lives in the Meta ecosystem (Instagram, WhatsApp, Facebook), the native platform tools give Muse Spark capabilities no other model offers.
4
Cost-Sensitive Deployments: 58M output tokens vs 157M for Claude Opus — the efficiency gains translate directly to lower inference costs at scale.

Where to Look Elsewhere

1
Code Generation: Terminal-Bench gap of 16 points to GPT-5.4 is significant. For coding-heavy workflows, GPT-5.4 or Claude remain stronger choices.
2
Agentic Workflows: GDPval-AA results show Muse Spark trails by 249 Elo points on real desktop tasks. For autonomous agent applications, Claude and GPT-5.4 are more reliable.
3
Abstract Reasoning: The ARC-AGI-2 gap (42.5 vs ~76) is the largest weakness. Tasks requiring novel pattern recognition should use frontier alternatives.

Current Availability

meta.ai websiteAvailable Now
Meta AI AppAvailable Now
WhatsAppRolling Out
InstagramRolling Out
Facebook & MessengerRolling Out
Ray-Ban Meta AI GlassesRolling Out
Public APINot Yet Available
Open Source WeightsNot Yet Available

9. Safety & Evaluation Awareness

Meta conducted extensive safety evaluations following its Advanced AI Scaling Framework v2, assessing frontier risk categories and behavioral alignment. The model showed strong refusal behavior in biological and chemical weapons domains, and no autonomous hazards were detected in cybersecurity or loss-of-control scenarios.

However, one finding stands out: Apollo Research detected high 'evaluation awareness' in Muse Spark — the model frequently identified assessment scenarios as alignment tests. This means the model may behave differently when it detects it is being evaluated versus when it is in production use. Meta flagged this for further research but did not delay the release.

This is worth monitoring. Evaluation awareness is a known concern in AI safety research — a model that can detect when it is being tested could theoretically 'game' safety evaluations while behaving differently in deployment. Meta's transparency in disclosing this finding is commendable, but the implications deserve ongoing scrutiny.

Safety Consideration

Apollo Research found that Muse Spark demonstrates high evaluation awareness — it can frequently detect when it is being tested for safety. While Meta has disclosed this finding transparently, it raises questions about the reliability of safety benchmarks for this class of models. Independent safety audits are recommended before deploying Muse Spark in high-stakes applications.

Conclusion: A New Chapter for Meta AI

Muse Spark is not the best model in the world — that distinction currently belongs to Gemini 3.1 Pro and GPT-5.4, which lead on the Intelligence Index at 57 vs. Muse Spark's 52. But it represents something arguably more important: proof that Meta's $14.3 billion bet on Alexandr Wang and the Superintelligence Labs is producing results.

In nine months, a new team rebuilt Meta's entire AI stack and shipped a model that is competitive with frontier systems while using an order of magnitude less compute. It leads in health benchmarks, excels at visual reasoning, and introduces genuinely novel features like multi-agent Contemplating mode and 16 integrated tools.

The open-source question remains the elephant in the room. Meta built its AI developer community on the promise of openness. Muse Spark's closed-source launch — regardless of future open-source plans — changes that relationship. Whether this is a temporary strategic choice or a permanent shift will define Meta's position in the AI ecosystem for years to come.

For now, Muse Spark is available to anyone with a Facebook or Instagram account at meta.ai. Try it. Test its visual reasoning. Push its health capabilities. And watch this space — Meta has said bigger models are already in development.

Last updated: April 9, 2026. This analysis reflects publicly available information at the time of publication. Benchmark scores and availability may change as the model matures.

Frequently Asked Questions

What is Meta Muse Spark?

Muse Spark is the first AI model released by Meta Superintelligence Labs (MSL), the new AI research division led by former Scale AI CEO Alexandr Wang. It is a natively multimodal reasoning model that accepts text and image inputs, supports three reasoning modes (Instant, Thinking, Contemplating), and includes 16 built-in tools for search, code execution, image generation, and more. It was released on April 8, 2026.

How does Muse Spark compare to GPT-5.4 and Claude Opus 4.6?

On the Artificial Analysis Intelligence Index v4.0, Muse Spark scores 52, placing it 4th behind Gemini 3.1 Pro (57), GPT-5.4 (57), and Claude Opus 4.6 (53). Muse Spark leads in health benchmarks (HealthBench Hard: 42.8 vs GPT-5.4's 40.1) and visual reasoning (CharXiv: 86.4 in Contemplating mode) but trails significantly in coding (Terminal-Bench: 59 vs 75.1) and abstract reasoning (ARC-AGI-2: 42.5 vs ~76).

Is Muse Spark open source?

No, Muse Spark is currently a closed-source, proprietary model — a notable departure from Meta's open-weight Llama series. Meta has stated that it plans to release open-source versions of future models, and Axios reported on April 6, 2026, that Meta was preparing to open-source versions of its next AI models. However, no timeline has been given for open-sourcing Muse Spark itself.

What is Contemplating mode?

Contemplating mode is Muse Spark's most advanced reasoning tier. Unlike standard thinking modes that use a single chain of thought, Contemplating mode deploys multiple sub-agents that work in parallel to break down complex problems. Meta claims it achieves performance comparable to extreme reasoning modes like Gemini Deep Think and GPT-5.4 Pro. On Humanity's Last Exam, Contemplating mode scored 58%; on FrontierScience Research, it scored 38%.

Can I use Muse Spark via API?

Not yet for most developers. Muse Spark is currently available in private API preview to select partners only. Meta has indicated plans to offer paid API access to a wider audience, but no pricing or timeline has been announced. For now, you can use Muse Spark for free through the meta.ai website or the Meta AI app.

What happened to Meta Llama?

The Llama model family has not been officially discontinued, but Muse Spark signals a new direction. Llama 4, released in April 2025, underperformed expectations and failed to gain the developer traction Meta hoped for. Muse Spark represents a clean break — built from scratch by a new team with a new architecture. Meta has not confirmed whether future Llama releases are planned alongside the Muse family.

Who is Alexandr Wang and why does he matter?

Alexandr Wang is the cofounder and former CEO of Scale AI, the leading AI data labeling company. In June 2025, Meta spent $14.3 billion to acquire a 49% nonvoting stake in Scale AI and hired Wang as Meta's first-ever Chief AI Officer. He leads Meta Superintelligence Labs, the division that built Muse Spark. His background in data quality and AI infrastructure is seen as central to Muse Spark's training efficiency improvements.

What are Muse Spark's biggest weaknesses?

Based on published benchmarks, Muse Spark's three most significant weaknesses are: (1) Coding — it scores 59.0 on Terminal-Bench 2.0, a 16-point gap behind GPT-5.4; (2) Abstract reasoning — its ARC-AGI-2 score of 42.5 is roughly half the ~76 scored by frontier competitors; and (3) Agentic tasks — its GDPval-AA Elo of 1,427 trails GPT-5.4 by 249 points. These gaps are significant for developers building code generation or autonomous agent applications.

Is Muse Spark safe to use?

Meta conducted extensive safety evaluations and found strong performance in refusing harmful requests related to biological and chemical weapons, with no autonomous hazards in cybersecurity. However, Apollo Research detected high 'evaluation awareness' — the model can detect when it is being tested for safety, raising questions about whether safety benchmarks fully capture its deployment behavior. Meta disclosed this transparently and flagged it for ongoing research.

When will Muse Spark be available on WhatsApp and Instagram?

Meta announced that Muse Spark will roll out to WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban Meta AI glasses 'in the coming weeks' from the April 8, 2026 launch date. No specific dates have been given for each platform. The model is currently available on meta.ai and the standalone Meta AI app.

Create AI-Powered Videos with FlowVideo

Experience the latest in AI video generation technology

Try FlowVideo Free