Alibaba introduces Qwen-3 Max, targets code and reasoning tasks

Alibaba introduces Qwen-3 Max, a trillion-parameter AI model based on a 1M-token context window, which is matched to GPT-5, Gemini, and Claude in reasoning and coding.

25 Sep 2025 09:00 IST

New Update

Alibaba has released the latest model of the Qwen-3 Max in the Qwen line of large language models (LLM). Positioned as the company’s most advanced release to date, the model is designed to strengthen reasoning, coding, and multilingual capabilities while extending the scope of long-context understanding.

Advertisment

Qwen-3 Max Model scale and training

The first in the series to achieve over one trillion parameters is Qwen-3 Max, which was trained on 36 trillion tokens. Its context length is a million tokens, which allows it to operate on large inputs like large codebases or long documents.

Qwen-3 Max performance benchmarks

The model ranks third on LMArena’s text leaderboard, behind Google’s Gemini 2.5 Pro and Anthropic’s Claude Opus 4.1, but ahead of OpenAI’s standard GPT-5. On SWE-Bench Verified, which evaluates real-world coding problem-solving, Qwen-3 Max scored 69.6, above DeepSeek V3.1’s non-thinking model but slightly below Claude Opus 4 non-thinking.

In Tau2-Bench, which measures AI agent tool-calling proficiency, the model achieved 74.8, outperforming both Claude Opus 4 and DeepSeek V3.1.

Alibaba also revealed a variant, Qwen-3 Max Thinking, currently in training, which it says has demonstrated "remarkable potential", particularly in reasoning benchmarks like AIME 25 and HMMT.

Multilingual and reasoning focus

Beyond performance scores, Qwen-3 Max is designed to reduce hallucinations and improve instruction following in English and Chinese. It also seeks to enhance accuracy in math, logic, and science-based tasks.

Comparison with competitors

Alibaba Qwen-3 Max is closing the performance disparity of reasoning and code-related tasks, albeit in the realms of general-purpose conversational ability; the leaderboard of LMArena, GPT-5 is in the lead, whereas Google, with its Gemini 2.5 Pro, has its performance in more advanced categories. Claude Opus 4 is still ahead in some select benchmarks, with Qwen-3 Max reporting better performance in tool-calling proficiency.

Advertisment

With the emphasis on long-context processing and coding accuracy, Alibaba makes Qwen-3 Max a direct competitor to Western-developed models, becoming another step in the ever-growing competitive world of AI models.

Access and availability

Users can try the model at no cost via the Qwen app or website. On iOS and Android apps, Qwen-3 Max is set as the default model. For users who do not see it immediately, the model can be selected manually from the app menu.