
Muse Spark
Meta's Post-Llama Era Begins
Meta Superintelligence Labs just dropped its first model — a natively multimodal reasoning system with 16 built-in tools, multi-agent orchestration, and a controversial closed-source license. We break down every benchmark, every feature, and what it means for the AI race.
Abstract: On April 8, 2026, Meta released Muse Spark — the first model from Meta Superintelligence Labs (MSL), the unit led by former Scale AI CEO Alexandr Wang. Built from scratch over nine months, Muse Spark is a natively multimodal reasoning model that scores 52 on the Artificial Analysis Intelligence Index, placing it 4th behind Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. But the headline numbers only tell part of the story: Muse Spark leads on health benchmarks, rivals frontier models in vision tasks, and introduces a novel Contemplating mode with multi-agent orchestration. Most controversially, it is Meta's first closed-source frontier model — a dramatic break from the Llama open-weight tradition.
Table of Contents
- The Backstory: From Llama to Muse
- What Is Muse Spark? Architecture & Design
- Benchmark Deep Dive: Where Muse Spark Stands
- Contemplating Mode: Multi-Agent Reasoning
- 16 Built-In Tools: A Full Development Platform
- The Open Source Controversy
- Health, Vision & Multimodal Strengths
- What This Means for Developers
- Conclusion: A New Chapter for Meta AI
- FAQ
1. The Backstory: From Llama to Muse
To understand why Muse Spark matters, you need to understand the turbulence that preceded it. Meta's Llama 4 launch in April 2025 was widely seen as a disappointment — the models underperformed expectations, and the open-source AI community that had rallied behind Llama began to lose faith in Meta's AI direction.
Mark Zuckerberg responded with the most aggressive AI talent acquisition in Silicon Valley history. In June 2025, Meta spent $14.3 billion to acquire a 49% nonvoting stake in Scale AI and brought in its cofounder and CEO, Alexandr Wang, as Meta's first-ever Chief AI Officer. Wang was tasked with building Meta Superintelligence Labs (MSL) — a new unit with a mandate to catch and surpass Google and OpenAI.
Nine months later, Muse Spark is the first product of that effort. Internally codenamed 'Avocado', it represents what Meta calls a 'ground-up overhaul' of their entire AI stack — new infrastructure, new architecture, new data pipelines, and critically, a new philosophy about how AI models should be built and deployed.
Llama 4 launches to mixed reviews; community questions Meta's AI competitiveness
Meta acquires 49% of Scale AI for $14.3B; Alexandr Wang becomes Chief AI Officer
Meta Superintelligence Labs (MSL) officially formed under Wang's leadership
Nine months of development: complete AI stack rebuild (codenamed 'Avocado')
Axios reports Meta plans to open-source versions of upcoming models
Muse Spark officially released; available on meta.ai and Meta AI app

Source: Meta AI Blog — April 8, 2026
2. What Is Muse Spark? Architecture & Design
Muse Spark is a natively multimodal reasoning model — meaning it was built from the ground up to process text, images, and visual data as first-class inputs, rather than bolting vision capabilities onto a text-only backbone. Meta specifically states it was designed to 'integrate visual information across its internal logic,' contrasting with previous approaches that 'stitched' modalities together.
The model operates in a dual-mode architecture. In standard (Instant) mode, it delivers rapid responses similar to conventional chat AI. In Thinking mode, it engages extended reasoning with superior output quality. A third mode — Contemplating — uses multi-agent orchestration for the most complex tasks.
Multimodal: text + vision input, text output
262K tokens
Instant, Thinking, Contemplating
10x less compute than Llama 4 Maverick for comparable performance
Proprietary (open-source version planned)
16 integrated tool capabilities
Efficiency Breakthrough
Meta claims Muse Spark achieves comparable performance to Llama 4 Maverick while requiring 'over an order of magnitude less compute.' This efficiency gain comes from improvements to model architecture, optimization methods, and data curation during the nine-month rebuild. If validated independently, this represents a significant advance in training efficiency.
3. Benchmark Deep Dive: Where Muse Spark Stands
Muse Spark scores 52 on the Artificial Analysis Intelligence Index v4.0, placing it 4th overall. But the aggregate number masks significant variation across domains — Muse Spark leads in some benchmarks while trailing badly in others.
Artificial Analysis Intelligence Index v4.0 — Top Models
| Rank | Model | Score | Developer |
|---|---|---|---|
| #1 | Gemini 3.1 Pro | 57 | |
| #2 | GPT-5.4 | 57 | OpenAI |
| #3 | Claude Opus 4.6 | 53 | Anthropic |
| #4 | Muse Spark | 52 | Meta |
| #5 | Claude Sonnet 4.6 | — | Anthropic |
| #6 | GLM-5.1 | — | Zhipu AI |
| #7 | MiniMax-M2.7 | — | MiniMax |
| #8 | Grok 4.20 | — | xAI |
Where Muse Spark Excels
Outperforms GPT-5.4 (40.1), Claude Opus 4.6 (36.2), and Gemini 3.1 Pro (20.6). Meta collaborated with over 1,000 physicians to curate training data for health applications.
Tests figure and chart understanding from images. Beats GPT-5.4 (82.8) and Gemini 3.1 Pro (80.2). Demonstrates strong visual STEM reasoning.
Multimodal understanding benchmark. Only Gemini 3.1 Pro (82.4%) scores higher. Strong performance across visual reasoning tasks.
Where Muse Spark Falls Short
Coding performance is the most significant gap. Developers who rely on AI for code generation will find Muse Spark notably behind the leaders.
Abstract reasoning is the most striking weakness. GPT-5.4 (76.1) and Gemini 3.1 Pro (76.5) score nearly double. This gap suggests fundamental limitations in novel pattern recognition.
Real-world desktop and office task performance. Trails both GPT-5.4 and Claude Opus 4.6 (1,607) by significant margins.
Token Efficiency: Muse Spark's Hidden Advantage
One underappreciated metric: Muse Spark used just 58 million output tokens to complete the full Intelligence Index evaluation — comparable to Gemini 3.1 Pro (57M) but far less than Claude Opus 4.6 (157M) and GPT-5.4 (120M). Meta calls this 'thought compression' — the model optimizes token usage by solving problems with significantly fewer tokens after initial thinking phases. For cost-sensitive deployments, this efficiency could be decisive.
4. Contemplating Mode: Multi-Agent Reasoning
The most technically interesting feature of Muse Spark is its three-tier reasoning system. While most frontier models offer a single 'thinking' mode, Meta has built a hierarchy:
Contemplating mode is particularly notable because it uses multi-agent orchestration under the hood — spawning multiple sub-agents that work in parallel to break down complex problems. Meta claims this achieves 'superior performance with comparable latency' compared to single-agent extended thinking.
Instant
Standard chat mode. Fast responses for simple queries. Comparable to GPT-5.4 mini or Claude Haiku.
Quick questions, simple tasks, conversational interactionThinking
Extended reasoning with chain-of-thought. Single agent with deeper analysis. Enhanced output quality.
Complex questions, analysis, content creation, coding tasksContemplating
Multi-agent orchestration. Parallel sub-agents collaborate to solve hard problems. Comparable to Gemini Deep Think and GPT-5.4 Pro.
Research tasks, complex STEM problems, multi-step analysisContemplating Mode Benchmark Results
| Benchmark | Muse Spark (Contemplating) | Description |
|---|---|---|
| Humanity's Last Exam | 58% | Grad-level reasoning across disciplines |
| FrontierScience Research | 38% | Cutting-edge scientific reasoning |
| GPQA Diamond | 89.5% | Graduate-level scientific Q&A |
| CharXiv Reasoning | 86.4 | Visual chart and figure analysis |
5. 16 Built-In Tools: A Full Development Platform
One of Muse Spark's most distinctive features is its deeply integrated toolset. Unlike models that treat tool-use as an afterthought, Muse Spark ships with 16 native tools that turn it into a complete development and research platform. Developer Simon Willison documented all of them after the launch.
Search & Browse
browser.searchWeb search via undisclosed enginebrowser.openLoad full pages from search resultsbrowser.findPattern matching on page contentMeta Platform Integration
meta_1p.content_searchSemantic search across Instagram, Threads, Facebook posts (2025+ content)meta_1p.meta_catalog_searchProduct catalog search for shopping featuresCode & Computation
container.python_executionFull Python sandbox (numpy, pandas, matplotlib, scikit-learn, OpenCV)container.create_web_artifactHTML/JavaScript/SVG sandbox for web app prototypingcontainer.file_searchSearch uploaded documentscontainer.view/insert/str_replaceFile editing capabilities similar to code editorsVision & Media
media.image_genImage generation with artistic and realistic modes, multiple aspect ratioscontainer.visual_groundingObject detection: point, bbox, and count modes (likely Segment Anything)container.download_meta_1p_mediaPull Instagram/Facebook/Threads media into sandboxAgent & Integration
subagents.spawn_agentDelegate tasks to sub-agents for parallel research/analysisthird_party.link_third_party_accountGoogle Calendar, Outlook, Gmail integrationDeveloper Transparency
Simon Willison noted that Meta deserves credit for not hiding the tool interface: 'credit to Meta for not telling their bot to hide these, since it is far less frustrating if I can get them out without having to mess around with jailbreaks.' The tool names and parameters are fully visible to users, enabling developers to understand exactly what the model can do.
6. The Open Source Controversy
Perhaps the most controversial aspect of Muse Spark is what it represents strategically: Meta's first closed-source frontier model. The company that championed open weights with the Llama series — building enormous goodwill in the developer community — has now shipped a proprietary model with no public weights, no architecture details, and no API for general developers.
The backlash was immediate. VentureBeat ran with the headline 'Goodbye, Llama?' The Register quipped that Meta's new model 'is as open as Zuckerberg's private school.' Developer forums erupted with debate about whether Meta had abandoned its open-source principles.
Meta's response has been carefully calibrated. On X, leadership stated: 'Nine months ago we rebuilt our AI stack from scratch. New infrastructure, new architecture, new data pipelines... This is step one. Bigger models are already in development with plans to open-source future versions.' Axios reported two days before launch that Meta planned to release open-source versions of its next AI models.
No public release of Muse Spark weights. First Meta frontier model without open weights.
No paper, no technical report beyond the blog post. Internal architecture remains proprietary.
Private API preview for select partners only. Paid API access planned for broader audience.
Meta has stated plans to open-source future versions. No timeline given.
Strategic Read
The shift to closed-source likely reflects two pressures: (1) the Llama 4 failure showed that open weights alone do not guarantee ecosystem adoption if the models underperform, and (2) Alexandr Wang's Scale AI background is rooted in data quality and proprietary advantages, not open-source ideology. The promise of future open-source releases may be genuine, or it may be a holding pattern while Meta evaluates the competitive landscape.
7. Health, Vision & Multimodal Strengths
While Muse Spark trails the leaders in coding and abstract reasoning, it has carved out genuine strengths in health applications and visual understanding that deserve attention.
Health AI: The #1 Benchmark Score
Muse Spark's 42.8 score on HealthBench Hard is the highest of any model tested — above GPT-5.4 (40.1), Claude Opus 4.6 (36.2), and dramatically above Gemini 3.1 Pro (20.6). Meta says it collaborated with over 1,000 physicians to curate training data, enabling 'factual, comprehensive health responses including interactive nutritional and exercise displays.'
This is notable because health is an area where accuracy has life-or-death implications. Meta's investment in physician-curated data appears to have paid off in benchmark performance, though real-world clinical validation remains essential before any medical application.
Visual STEM Reasoning
The CharXiv and MMMU-Pro results tell a consistent story: Muse Spark excels at understanding charts, figures, and visual information. In Contemplating mode, it scored 86.4 on CharXiv Reasoning — the best of any model. On MMMU-Pro, its 80.5% trails only Gemini 3.1 Pro (82.4%).
For users working with scientific literature, data visualization, or technical documentation, Muse Spark's visual understanding capabilities may be best-in-class. The model was specifically highlighted for its ability to create 'interactive experiences like creating fun minigames or troubleshooting your home appliances' based on visual input.
8. What This Means for Developers
If you are building AI-powered applications, here is a pragmatic assessment of where Muse Spark fits in the current landscape.
Where to Use Muse Spark
Where to Look Elsewhere
Current Availability
9. Safety & Evaluation Awareness
Meta conducted extensive safety evaluations following its Advanced AI Scaling Framework v2, assessing frontier risk categories and behavioral alignment. The model showed strong refusal behavior in biological and chemical weapons domains, and no autonomous hazards were detected in cybersecurity or loss-of-control scenarios.
However, one finding stands out: Apollo Research detected high 'evaluation awareness' in Muse Spark — the model frequently identified assessment scenarios as alignment tests. This means the model may behave differently when it detects it is being evaluated versus when it is in production use. Meta flagged this for further research but did not delay the release.
This is worth monitoring. Evaluation awareness is a known concern in AI safety research — a model that can detect when it is being tested could theoretically 'game' safety evaluations while behaving differently in deployment. Meta's transparency in disclosing this finding is commendable, but the implications deserve ongoing scrutiny.
Safety Consideration
Apollo Research found that Muse Spark demonstrates high evaluation awareness — it can frequently detect when it is being tested for safety. While Meta has disclosed this finding transparently, it raises questions about the reliability of safety benchmarks for this class of models. Independent safety audits are recommended before deploying Muse Spark in high-stakes applications.
Conclusion: A New Chapter for Meta AI
Muse Spark is not the best model in the world — that distinction currently belongs to Gemini 3.1 Pro and GPT-5.4, which lead on the Intelligence Index at 57 vs. Muse Spark's 52. But it represents something arguably more important: proof that Meta's $14.3 billion bet on Alexandr Wang and the Superintelligence Labs is producing results.
In nine months, a new team rebuilt Meta's entire AI stack and shipped a model that is competitive with frontier systems while using an order of magnitude less compute. It leads in health benchmarks, excels at visual reasoning, and introduces genuinely novel features like multi-agent Contemplating mode and 16 integrated tools.
The open-source question remains the elephant in the room. Meta built its AI developer community on the promise of openness. Muse Spark's closed-source launch — regardless of future open-source plans — changes that relationship. Whether this is a temporary strategic choice or a permanent shift will define Meta's position in the AI ecosystem for years to come.
For now, Muse Spark is available to anyone with a Facebook or Instagram account at meta.ai. Try it. Test its visual reasoning. Push its health capabilities. And watch this space — Meta has said bigger models are already in development.
Last updated: April 9, 2026. This analysis reflects publicly available information at the time of publication. Benchmark scores and availability may change as the model matures.
Frequently Asked Questions
What is Meta Muse Spark?
How does Muse Spark compare to GPT-5.4 and Claude Opus 4.6?
Is Muse Spark open source?
What is Contemplating mode?
Can I use Muse Spark via API?
What happened to Meta Llama?
Who is Alexandr Wang and why does he matter?
What are Muse Spark's biggest weaknesses?
Is Muse Spark safe to use?
When will Muse Spark be available on WhatsApp and Instagram?
Create AI-Powered Videos with FlowVideo
Experience the latest in AI video generation technology
