Microsoft's new MAI models

Simon Willison delves into Microsoft's new MAI models, revealing that despite claims of 'clean licensed data', the training process still relies on web crawls, sparking discussion on AI copyright issues.

Large Language Models 训练数据模型发布微软编程助手行业趋势

KEY POINTS

MAI-Thinking-1 has 1T parameters (35B active) and MAI-Code-1-Flash has 137B (5B active), both using MoE for cost efficiency.
Despite Microsoft's claims of 'clean, commercially licensed data', the technical paper confirms training on web crawls, similar to other LLMs.
MAI-Code-1-Flash is already integrated into GitHub Copilot, potentially improving the coding experience for individual developers.
This case highlights the AI industry's conflict between the need for high performance and the difficulty of obtaining fully compliant training data.

ANALYSIS

The Trigger: Microsoft's new models and the gap between marketing and reality

At last week's Build conference, Microsoft unveiled two new large language models: MAI-Thinking-1, a trillion-parameter reasoning model (with 35B active parameters), and MAI-Code-1-Flash, a 137B model (5B active) purpose-built for coding and already integrated into GitHub Copilot. Noted developer Simon Willison quickly covered the news but soon realized he had been misled by Microsoft's messaging. He initially believed these were 'small' models, only to discover that their total parameter counts are actually massive—they simply use a mixture-of-experts (MoE) architecture to reduce the active parameters at inference time. More importantly, while Microsoft's press release emphasized that the models were trained on 'clean, commercially licensed data,' the technical paper plainly stated that the training corpus came from public web crawls.

Technical Breakdown: How MoE makes giant models affordable

To grasp the significance, we need to understand MoE. In a traditional dense model, all parameters are activated during inference, making costs scale linearly with size. MoE splits the model into multiple 'expert' sub-networks and activates only a fraction of them per inference. So, although MAI-Thinking-1 has 1 trillion total parameters, only 35B are working at any moment. This allows it to run on a consumer-grade GPU, dramatically lowering the barrier to entry. Simon's mix-up—mistaking active parameters for total—is a testament to how this 'small but mighty' design can even fool experts.

The Data Puzzle: 'Clean data' is just another web crawl

In the press release, Microsoft claimed: 'We trained MAI-Thinking-1 from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models.' This sounded like a resolution to the ongoing copyright debate in AI training. But when Simon dug into the technical paper (page 80 onward), he found otherwise. The data section reveals that the main corpus comes from Microsoft's proprietary web crawl: roughly 1.2 trillion pages were crawled, filtered down to 794 billion. They also processed Common Crawl data, retaining 24.2 billion pages after filtering. This is essentially the same pipeline used by GPT, Claude, and others—massive scraping of publicly available web content. The 'clean' part mainly refers to blocking adult content and AI-generated pages; what exactly 'licensed' means remains unexplained.

Trend Insight: The grey area of training data copyright is here to stay

This episode highlights a deep industry dilemma: model performance increasingly depends on vast and diverse data, yet the most readily available source—the open web—is often used without explicit permission. Meanwhile, legal and public pressure around data rights is mounting. Microsoft's wording appears to be playing a word game: the term 'appropriately licensed' cleverly avoids defining the scope of licensing. It also explains why MAI-Code-1-Flash is tightly coupled with GitHub Copilot: Copilot itself was trained on public GitHub repositories, a practice already steeped in controversy. Until clear data licensing standards emerge, this 'scrape and use' approach will likely remain the norm.

Practical Impact: What developers gain, and what they should watch out for

For everyday developers, the integration of MAI-Code-1-Flash into GitHub Copilot in VS Code could mean better code completions and generation, especially under high concurrency and cost constraints. If MAI-Thinking-1 becomes publicly available, running a top-tier reasoning model locally becomes feasible. However, if your organization has strict requirements around code copyright or data compliance, Copilot's inherited data lineage could pose legal risks. Another key point from Simon: neither model uses distillation, which is good news for those worried about model 'inbreeding.'

The Uncomfortable Takeaway: Why an expert's mistake is worth pondering

The irony is that even someone like Simon, who follows AI closely, was initially misled by the marketing narrative—assumed the models were small, and only later discovered the truth. It serves as a reminder that when tech giants wrap their products in phrases like 'clean licensed data' and 'trained from scratch,' the real details are often buried on page 80 of a paper. Next time you see such claims, it might be wise to ask: 'Whose website did your crawler visit?'

Analysis by BitByAI · Read original

Originally from Simon Willison · Analyzed by BitByAI