>_TheQuery
// Reading nowStart
← All Articles

Qwen 3.7 Max Is Closed Source: Alibaba Just Copied the Playbook It Was Beating

By Addy · May 26, 2026

For three years, Alibaba's Qwen team did something the American AI labs would not. They shipped their best public models openly. Not the second-best. Not the tier below flagship. The flagship itself, under Apache 2.0, with weights on Hugging Face that any developer anywhere could download, fine-tune, and deploy without asking permission or paying per token.

The strategy worked. 942 million cumulative downloads by March 2026. More than 200,000 derivative models built on Qwen weights. A global open-weight download share exceeding 50%. The Qwen family became the most-cited Chinese open-weight model family in Western developer workflows, not because Alibaba ran a marketing campaign, but because the weights were there and the models were genuinely good.

On May 20, 2026, at the Alibaba Cloud Summit in Hangzhou, the company announced Qwen 3.7 Max. The model is proprietary, closed-weight, API-only, and listed on Alibaba Cloud Model Studio at 2.50permillioninputtokensand2.50 per million input tokens and 7.50 per million output tokens. There are no official weights on Hugging Face. There is no official GGUF. There is no official Ollama path. There is no self-hosted option. The model that Alibaba is calling its most capable ever is not available to the developer community that made Qwen what it is.

Alibaba just adopted the playbook it was supposed to be dismantling.

What Changed and Why It Matters

Every Qwen flagship through the 3.5 series shipped with openly available weights. The pattern first broke with the closed Max line around Qwen 3.6 Max Preview. Qwen 3.7 Max confirms it was not an experiment. It was a strategic pivot.

The two-tier architecture Alibaba has now committed to is the same architecture OpenAI and Anthropic have used since the beginning. Open smaller or mid-tier models to seed ecosystem adoption and build developer goodwill. Close the flagship to capture enterprise revenue. The open models generate the community. The closed model generates the margin.

This is the playbook Qwen was supposed to be the answer to. The Chinese open-weight labs, Qwen, DeepSeek, Kimi, Ling, were the story this publication has been tracking all year as the competitive pressure that was compressing frontier pricing and democratizing capability. The Kimi K2.6 article asked whether the cheap open model would bring the downfall of proprietary frontiers. The Qwen 3.6 piece showed a 55GB model beating Claude Opus on reasoning benchmarks.

Now the lab that was winning that race has decided to join the other side of it. The competitive pressure from open weights was real enough that Alibaba felt it too, from below, from its own open-weight releases and from the other Chinese labs. The economics of running a frontier model at scale are the same regardless of which country the lab is in. At some point, the compute bill requires a revenue model. Qwen 3.7 Max is that revenue model.

The Benchmarks, Honestly

The model Alibaba closed its weights for is genuinely competitive. That is worth acknowledging before dissecting the strategy.

MetricQwen 3.7 MaxClosest ComparisonRead
Artificial Analysis Index v4.056.6Fifth globally at launchHighest-placed Chinese model in the comparison set.
Context window1M tokensSame class as current frontier long-context modelsLong-context capability is part of the product, not an add-on.
GPQA Diamond92.4Claude Opus 4.6 Max: 91.3, GPT-5.5: 93.6Beats Opus 4.6 on scientific reasoning, trails GPT-5.5.
HMMT 2026 February97.1Highest in its comparison groupMath is one of the model's clearest strengths.
Apex44.5DeepSeek V4-Pro: 38.3Strong tool-use and external-service reasoning result.
HLE41.4Claude Opus 4.6: 40.0Narrow win on broad expert-level reasoning.
SWE-bench Pro60.6Claude Opus 4.7: 64.3Still trails the newest Opus on hard production coding.
Terminal-Bench 2.0-Terminus69.7Competitive with the frontier packTerminal execution is strong, but not the cleanest win.
API price per 1M tokens$2.50 input, $7.50 outputClaude Opus 4.7 costs materially morePrice is the strategic lever.
WeightsClosed, API-onlyQwen 3.6 remains open-weightThe benchmark story and the strategy story diverge here.

Qwen 3.7 Max scores 56.6 on the Artificial Analysis Intelligence Index v4.0, fifth globally at launch and the highest-placed Chinese model on that leaderboard. The context window is one million tokens, which moves Qwen into the same long-context class as the latest frontier models.

On GPQA Diamond, the graduate-level scientific reasoning benchmark, it scores 92.4, ahead of Claude Opus 4.6 Max at 91.3 and behind GPT-5.5 at 93.6. On HMMT 2026 February, the math olympiad benchmark, it scores 97.1, the highest in its comparison group. On Apex, which tests tool use and reasoning across diverse external services, it scores 44.5 against DeepSeek V4-Pro at 38.3. On HLE, Humanity's Last Exam, it scores 41.4 against Claude Opus 4.6 at 40.0. Qwen wins narrowly.

On agentic coding benchmarks, the ones that matter most for production developer use, Qwen 3.7 Max scores 60.6 on SWE-bench Pro. Published comparisons put Claude Opus 4.7 at 64.3 on the same benchmark, a 3.7-point gap. On Terminal-Bench 2.0-Terminus, Qwen 3.7 scores 69.7, ahead of several Opus 4.6-era comparisons and competitive with the frontier pack. The model that wins on math and science still faces real competition on long-horizon production coding, especially when compared with the newest Anthropic flagship.

The pricing is the number that changes the competitive calculation. Qwen 3.7 Max costs 2.50inputand2.50 input and 7.50 output per million tokens. Claude Opus 4.7 costs materially more. Qwen is cheaper for a model that beats Opus 4.6 on scientific reasoning and trails Opus 4.7 by a few points on the hardest coding benchmark. The cost-per-benchmark-point calculation clearly favors Qwen 3.7 for most workloads that are not long-horizon software engineering.

The Abstention Problem Nobody Is Leading With

Before treating Qwen 3.7 Max's benchmark numbers as reliable, one caveat deserves its own section.

Qwen 3.7 Max's hallucination rate is reported at 22.9%, the lowest among comparable frontier models in the Artificial Analysis omniscience-style measurements cited by reviewers. That sounds like a significant advantage. It is, with an asterisk.

The attempt rate for Qwen 3.7 Max on broad-recall tasks dropped to 48%, the lowest among comparable frontier models. The model refuses to answer roughly half of broad queries to keep its hallucination rate low.

A model that answers 100 questions and gets 22 wrong has a 22% hallucination rate. A model that answers 52 questions, gets 12 wrong on those, and refuses the other 48 has an even lower hallucination rate on attempted questions, and has also failed to be useful for 48% of what it was asked. These are not equivalent performances.

The developer community has started calling this the abstention tax. Qwen 3.7's low hallucination numbers are real and earned. The model was explicitly tuned to refuse rather than confabulate. That is a defensible design choice for high-stakes workloads where a wrong answer is worse than no answer. It is a poor fit for general-purpose use where a refusal is itself a failure.

The practical fix is well-documented: give Qwen 3.7 source material to reason over rather than asking it to recall facts. A PDF, a codebase, a transcript, tasks where the context window does the heavy lifting and the model reasons over what it can see rather than retrieves from training. The 1M context window and the extended thinking mode are where this model actually shines. Open-domain trivia is where it becomes cautious.

A developer choosing between Qwen 3.7 Max and Claude Opus 4.7 for a document analysis pipeline where everything the model needs is in the context window is making a real case for Qwen at lower output cost. The same developer building a general-purpose coding assistant where the model needs to recall library APIs and language conventions is making a real case for Opus.

Alibaba's Strategic Calculation

The decision to close Qwen 3.7's weights is not irrational. It is the logical conclusion of a specific trajectory.

Alibaba built the Qwen ecosystem on open weights. That ecosystem now has hundreds of millions of downloads and hundreds of thousands of derivative models. The community trust and the developer goodwill are real and accumulated. The open strategy produced adoption at a scale that no marketing budget would have achieved.

Having built that ecosystem, Alibaba now wants to monetize it. The path from ecosystem to revenue runs through enterprise API contracts. Enterprise customers do not need to download weights. They need an API with an SLA, support, compliance documentation, compatible endpoints, and predictable infrastructure. Closing the flagship does not hurt the enterprise customer. It does not hurt the individual developer who was using Qwen through an API anyway. It hurts the developer who was running the flagship weights on their own hardware and routing production traffic through their own inference stack, paying Alibaba nothing.

That last developer is exactly the use case the open strategy was designed to support. Closing the flagship tells them: we built the ecosystem with you, and now we are moving to a tier where you need to pay for access to the best of what we have built.

This is not a betrayal. It is a maturation. Every successful open-source project that becomes commercially significant faces the same inflection point. The maintainers who built the community eventually need to decide how to fund the next generation of work. Alibaba's answer, close the flagship, keep the mid-tier open, is the same answer HashiCorp gave with Terraform, the same answer Elastic gave with Elasticsearch, the same answer Redis gave with its license change.

In each case, the open community responded by forking the last open version and continuing development independently. The Qwen 3.6 open weights exist, are on Hugging Face under permissive licenses, and will not disappear. The developer who needs an open-weight Qwen model has Qwen 3.6. They just do not have Qwen 3.7.

What This Means for the Open Source Arc

This publication has been tracking a thesis since April: open-weight models from Chinese labs were closing the capability gap with proprietary American frontier models, and the pricing compression from that competition would eventually force a rethink of what "frontier" means and what it costs.

Qwen 3.7 Max closing its weights does not disprove that thesis. It complicates it.

The capability compression is still happening. Qwen 3.7 Max beats Claude Opus 4.6 on scientific reasoning at a fraction of the cost. The open-weight Qwen 3.6 beats Claude Opus on three benchmarks. DeepSeek V4-Pro matches Opus on standard software engineering and beats it on competitive programming. The gap has closed. That part of the thesis is confirmed.

What Qwen 3.7 adds to the picture is that the labs doing the closing are arriving at the same economic conclusions as the labs they were closing in on. The compute required to train and serve a frontier model is expensive regardless of geography. The revenue model that makes a frontier lab sustainable at scale is an API business with enterprise customers. The open-weight strategy that built the ecosystem reaches its limits when the model is good enough to compete for the contracts the American labs are winning.

The race to the open-weight frontier is not over. DeepSeek V4 is MIT-licensed and shipping. The Qwen 3.6 mid-tier and Ling 2.6 Flash are both open and both competitive. The next generation of Chinese open-weight models, likely including an open Qwen 3.7 mid-tier at some point, will continue the compression.

But the flagship story has changed. The model at the top of the Qwen capability chart is now behind an API gate, priced competitively against Opus but not free, not self-hostable, not available to the developer in a market where $7.50 per million output tokens is not an abstraction but a real constraint.

Alibaba built a developer ecosystem on the premise that the best model should be available to everyone. Then it built a model good enough to charge for and decided to charge for it.

This is not hypocrisy. It is the normal arc of every technology company that started with openness as strategy and arrived at openness as insufficient business model.

The open-source community will respond the way it always has: by building the open alternative, pointing at the last open checkpoint, and improving it. Qwen 3.6 remains one of the best open-weight models in existence. The developer who needs frontier capability and cannot pay for it has options.

They just do not have Qwen 3.7 Max. For the first time in Qwen's history, the best of what the team built is not theirs to keep.

Sources:

Previously on TheQuery: The Cheap Model Is Winning and A 55GB Open Model Just Beat Claude Opus on Three Benchmarks