In the early days of large language models (LLMs), we grew accustomed to massive 10x jumps in reasoning and coding capability with every new model iteration. Today, those jumps have flattened into incremental gains. The exception is domain-specialized intelligence, where true step-function improvements are still the norm.

When a model is fused with an organization’s proprietary data and internal logic, it encodes the company’s history into its future workflows. This alignment creates a compounding advantage: a competitive moat built on a model that understands the business intimately. This is more than fine-tuning; it is the institutionalization of expertise into an AI system. This is the power of customization.

Intelligence tuned to context

Every sector operates within its own specific lexicon. In automotive engineering, the “language” of the firm revolves around tolerance stacks, validation cycles, and revision control. In capital markets, reasoning is dictated by risk-weighted assets and liquidity buffers. In security operations, patterns are extracted from the noise of telemetry signals and identity anomalies.

Custom-adapted models internalize the nuances of the field. They recognize which variables dictate a “go/no-go” decision, and they think in the language of the industry.

Domain expertise in action

The transition from general-purpose to tailored AI centers on one goal: encoding an organization’s unique logic directly into a model’s weights.

Mistral AI partners with organizations to incorporate domain expertise into their training ecosystems. A few use cases illustrate customized implementations in practice:

Software engineering and assisting at scale: A network hardware company with proprietary languages and specialized codebases found that out-of-the-box models could not grasp their internal stack. By training a custom model on their own development patterns, they achieved a step function in fluency. Integrated into Mistral’s software development scaffolding, this customized model now supports the entire lifecycle—from maintaining legacy systems to autonomous code modernization via reinforcement learning. This turns once-opaque, niche code into a space where AI reliably assists at scale.

Automotive and the engineering copilot: A leading automotive company uses customization to revolutionize crash test simulations. Previously, specialists spent entire days manually comparing digital simulations with physical results to find divergences. By training a model on proprietary simulation data and internal analyses, they automated this visual inspection, flagging deformations in real time. Moving beyond detection, the model now acts as a copilot, proposing design adjustments to bring simulations closer to real-world behavior and radically accelerating the R&D loop.

Public sector and sovereign AI: In Southeast Asia, a government agency is building a sovereign AI layer to move beyond Western-centric models. By commissioning a foundation model tailored to regional languages, local idioms, and cultural contexts, they created a strategic infrastructure asset. This ensures sensitive data remains under local governance while powering inclusive citizen services and regulatory assistants. Here, customization is the key to deploying AI that is both technically effective and genuinely sovereign.

The blueprint for strategic customization

Moving from a general-purpose AI strategy to a domain-specific advantage requires a structural rethinking of the model’s role within the enterprise. Success is defined by three shifts in organizational logic.

1. Treat AI as infrastructure, not an experiment.  Historically, enterprises have treated model customization as an ad hoc experiment—a single fine-tuning run for a niche use case or a localized pilot. While these bespoke silos often yield promising results, they are rarely built to scale. They produce brittle pipelines, improvised governance, and limited portability. When the underlying base models evolve, the adaptation work must often be discarded and rebuilt from scratch.

In contrast, a durable strategy treats customization as foundational infrastructure. In this model, adaptation workflows are reproducible, version-controlled, and engineered for production. Success is measured against deterministic business outcomes. By decoupling the customization logic from the underlying model, firms ensure that their “digital nervous system” remains resilient, even as the frontier of base models shifts.

    2. Retain control of your own data and models. As AI migrates from the periphery to core operations, the question of control becomes existential. Reliance on a single cloud provider or vendor for model alignment creates a dangerous asymmetry of power regarding data residency, pricing, and architectural updates.

    Enterprises that retain control of their training pipelines and deployment environments preserve their strategic agency. By adapting models within controlled environments, organizations can enforce their own data residency requirements and dictate their own update cycles. This approach transforms AI from a service consumed into an asset governed, reducing structural dependency and allowing for cost and energy optimizations aligned with internal priorities rather than vendor roadmaps.

    3. Design for continuous adaptation. The enterprise environment is never static: regulations shift, taxonomies evolve, and market conditions fluctuate. A common failure is treating a customized model as a finished artifact. In reality, a domain-aligned model is a living asset subject to model decay if left unmanaged.

    Designing for continuous adaptation requires a disciplined approach to ModelOps. This includes automated drift detection, event-driven retraining, and incremental updates. By building the capacity for constant recalibration, the organization ensures that its AI does not just reflect its history, but it evolves in lockstep with its future. This is the stage where the competitive moat begins to compound: the model’s utility grows as it internalizes the organization’s ongoing response to change.

    Control is the new leverage

    We have entered an era where generic intelligence is a commodity, but contextual intelligence is a scarcity. While raw model power is now a baseline requirement, the true differentiator is alignment—AI calibrated to an organization’s unique data, mandates, and decision logic.

    In the next decade, the most valuable AI won’t be the one that knows everything about the world; it will be the one that knows everything about you. The firms that own the model weights of that intelligence will own the market.

    This content was produced by Mistral AI. It was not written by MIT Technology Review’s editorial staff.

    Read more

    This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

    There are more AI health tools than ever—but how well do they work? 

    In the last few months alone, Microsoft, Amazon, and OpenAI have all launched medical chatbots. 

    There’s a clear demand for these tools, given how hard it is for many people to access advice through the existing medical system—and they could make safe and useful recommendations. But concerns have surfaced about how little external evaluation they undergo before being released to the public.  

    Read the full story to understand what’s at stake

    —Grace Huckins 

    The Pentagon’s culture war tactic against Anthropic has backfired 

    A judge has temporarily blocked the Pentagon from labeling Anthropic a supply chain risk and ordering government agencies to stop using its AI. Her intervention suggests that the feud never needed to reach such a frenzy. 

    It did so because the government disregarded the existing process for such disputes—and fueled the fire on social media. Find out how it happened and what comes next

    —James O’Donnell 

    This story is from The Algorithm, our weekly newsletter giving you the inside track on all things AI. Sign up to receive it in your inbox every Monday. 

    The must-reads 

    I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 

    1 California has defied Trump to impose new AI regulations 
    Governor Newsom signed off on the new standards yesterday.  (Guardian
    + Firms seeking state contracts will need extra safeguards. (Reuters $) 
    + States are installing guardrails despite Trump’s order to stop. (NYT $)  
    + An AI regulation war is brewing in the US. (MIT Technology Review)  

    2 Experiments have verified quantum simulations for the first time 
    It’s a breakthrough for quantum computing applications. (Nature
    + Which could one day help solve healthcare problems. (MIT Technology Review

    3 The new White House app is a security and privacy nightmare 
    It extensively tracks users and relies on external code. (Gizmodo
    + The new app promises “unparalleled access” to Trump. (CNET
    + It also invites users to report people to ICE. (The Verge

    4 Big Tech’s $635 billion AI spending faces an energy shock test 
    The Middle East crisis is clouding prospects for growth. (Reuters $) 
    + Here are three big unknowns about AI’s energy burden. (MIT Technology Review

    5 Meta and Google have been accused of breaking child safety rules 
    Australia suspects they flouted a social media ban. (Bloomberg $) 
    + Indonesia is also investigating non-compliance. (Reuters $) 

    6 Nebius is building a $10 billion AI data center in Finland 
    The company is rapidly expanding Europe’s AI infrastructure. (CNBC

    7 South Korea’s chipmakers’ helium stocks will last until June 
    Beyond that? Who knows. (Reuters $) 
    + Shortages caused by the Iran war threaten the chip industry. (NYT $)  

    8 Another Starlink satellite has inexplicably exploded  
    SpaceX suffered a similar episode in December. (The Verge
    + We went inside Ukraine’s largest Starlink repair shop. (MIT Technology Review

    9 Bluesky’s new AI tool is already its most blocked account—after JD Vance 
    About 83 times as many users have blocked it as have followed it. (TechCrunch

    10 An AI agent banned from Wikipedia has lashed out in angry blogs 
    The bot accused its human editors of “uncivil behavior.” (404 Media)  

    Quote of the day 

    “Is any of this illegal? Probably not. Is it what you’d expect from an official government app? Probably not either.” 

    —Security researcher Thereallo reviews the White House’s new app.

    One More Thing 

    CHANTAL JAHCHAN

    Inside Amsterdam’s high-stakes experiment to create fair welfare AI 

    When Hans de Zwart, a digital rights advocate, saw Amsterdam’s plan to have an algorithm evaluate every welfare applicant for potential fraud, he nearly fell out of his chair. He believed the system had “unfixable problems.”  

    Meanwhile, Paul de Koning, a consultant to the city, was excited. He saw immense potential to improve efficiencies and remove biases. 

    These opposing viewpoints epitomize a global debate about whether algorithms can ever make fair decisions that shape people’s lives. Read the full story.  

    —Eileen Guo, Gabriel Geiger, and Justin-Casimir Braun 

    We can still have nice things 

    A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line.) 

    + A newly authenticated Rembrandt had been hiding in plain sight for years. 
    + This debunking of guitar legends is musical enlightenment for strummers. 
    Smoking into bubbles looks oddly satisfying. 
    + The man who made the front page twice exposes the thin line between heroes and villains. 

    Read more

    For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks. 

    This framing is seductive: An AI vs. human comparison on isolated problems with clear right or wrong answers is easy to standardize, compare, and optimize. It generates rankings and headlines. 

    But there’s a problem: AI is almost never used in the way it is benchmarked. Although   researchers and industry have started to improve benchmarking by moving beyond static tests to more dynamic evaluation methods, these  innovations resolve only part of the issue. That’s because they still evaluate AI’s performance outside the human teams and organizational workflows where its real-world performance ultimately unfolds. 

    While AI is evaluated at the task level in a vacuum, it is used in messy, complex environments where it usually interacts with more than one person. Its performance (or lack thereof) emerges only over extended periods of use. This misalignment leaves us misunderstanding AI’s capabilities, overlooking systemic risks, and misjudging its economic and social consequences.

    To mitigate this, it’s time to shift from narrow methods to benchmarks that assess how AI systems perform over longer time horizons within human teams, workflows, and organizations. I have studied real-world AI deployment since 2022 in small businesses and health, humanitarian, nonprofit, and higher-education organizations in the UK, the United States, and Asia, as well as within leading AI design ecosystems in London and Silicon Valley. I propose a different approach, which I call HAIC benchmarksHuman–AI, Context-Specific Evaluation.

    What happens when AI fails 

    For governments and businesses, AI benchmark scores appear more objective than vendor claims. They’re a critical part of determining whether an AI model or application is “good enough” for real-world deployment. Imagine an AI model that achieves impressive technical scores on the most cutting-edge benchmarks—98% accuracy, groundbreaking speed, compelling outputs. On the strength of these results, organizations may decide to adopt the model, committing sizable financial and technical resources to purchasing and integrating it. 

    But then, once it’s adopted, the gap between benchmark and real-world performance quickly becomes visible. For example, take the swathe of FDA-approved AI models that can read medical scans faster and more accurately than an expert radiologist. In the radiology units of hospitals from the heart of California to the outskirts of London, I witnessed staff using highly ranked radiology AI applications. Repeatedly, it took them extra time to interpret AI’s outputs alongside hospital-specific reporting standards and nation-specific regulatory requirements. What appeared as a productivity-enhancing AI tool when tested in a vacuum introduced delays in practice. 

    It soon became clear that the benchmark tests on which medical AI models are assessed do not capture how medical decisions are actually made. Hospitals rely on multidisciplinary teams—radiologists, oncologists, physicists, nurses—who jointly review patients. Treatment planning rarely hinges on a static decision; it evolves as new information emerges over days or weeks. Decisions often arise through constructive debate and trade-offs between professional standards, patient preferences, and the shared goal of long-term patient well-being. No wonder even highly scored AI models struggle to deliver the promised performance once they encounter the complex, collaborative processes of real clinical care.

    The same pattern emerges in my research across other sectors: When embedded within real-world work environments, even AI models that perform brilliantly on standardized tests don’t perform as promised. 

    When high benchmark scores fail to translate into real-world performance, even the most highly scored AI is soon abandoned to what I call the “AI graveyard.” The costs are significant: Time, effort and money end up being wasted. And over time, repeated experiences like this erode organizational confidence in AI and—in critical settings such as health—may erode broader public trust in the technology as well. 

    When current benchmarks provide only a partial and potentially misleading signal of an AI model’s readiness for real-world use, this creates regulatory blind spots: Oversight is shaped by metrics that do not reflect reality. It also leaves organizations and governments to shoulder the risks of testing AI in sensitive real-world settings, often with limited resources and support. 

    How to build better tests 

    To close the gap between benchmark and real-world performance, we must pay attention to the actual conditions in which AI models will be used. The critical questions: Can AI function as a productive participant within human teams? And can it generate sustained, collective value? 

    Through my research on AI deployment across multiple sectors, I have seen a number of organizations already moving—deliberately and experimentally—toward the HAIC benchmarks I favor. 

    HAIC benchmarks reframe current benchmarking in four ways: 

    1.     From individual and single-task performance to team and workflow performance (shifting the unit of analysis)

    2.     From one-off testing with right/wrong answers to long-term impacts (expanding the time horizon)

    3.     From correctness and speed to organizational outcomes, coordination quality, and error detectability (expanding outcome measures)

    4.     From isolated outputs to upstream and downstream consequences (system effects)

    Across the organizations where this approach has emerged and started to be applied, the first step is shifting the unit of analysis. 

    For example, in one UK hospital system in the period 2021–2024, the question expanded from whether a medical AI application improves diagnostic accuracy to how the presence of AI within the hospital’s multidisciplinary teams affects not only accuracy but also coordination and deliberation. The hospital specifically assessed coordination and deliberation in human teams using and not using AI. Multiple stakeholders (within and outside the hospital) decided on metrics like how AI influences collective reasoning, whether it surfaces overlooked considerations, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices. 

    This shift is fundamental. It matters a lot in high-stakes contexts where system-level effects matter more than task-level accuracy. It also matters for the economy. It may help recalibrate inflated expectations of sweeping productivity gains that are so far predicated largely on the promise of improving individual task performance. 

    Once that foundation is set, HAIC benchmarking can begin to take on the element of time. 

    Today’s benchmarks resemble school exams—one-off, standardized tests of accuracy. But real professional competence is assessed differently. Junior doctors and lawyers are evaluated continuously inside real workflows, under supervision, with feedback loops and accountability structures. Performance is judged over time and in a specific context, because competence is relational. If AI systems are meant to operate alongside professionals, their impact should be judged longitudinally, reflecting how performance unfolds over repeated interactions. 

    I saw this aspect of HAIC applied in one of my humanitarian-sector case studies. Over 18 months, an AI system was evaluated within real workflows, with particular attention to how detectable its errors were—that is, how easily human teams could identify and correct them. This long-term “record of error detectability” meant the organizations involved could design and test context-specific guardrails to promote trust in the system, despite the inevitability of occasional AI mistakes.

    A longer time horizon also makes visible the system-level consequences that short-term benchmarks miss. An AI application may outperform a single doctor on a narrow diagnostic task yet fail to improve multidisciplinary decision-making. Worse, it may introduce systemic distortions: anchoring teams too early in plausible but incomplete answers, adding to people’s  cognitive workloads, or generating downstream inefficiencies that offset any speed or efficiency gains at the point of the AI’s use. These knock-on effects—often invisible to current benchmarks—are central to understanding real impact. 

    The HAIC approach, admittedly promises to make benchmarking more complex, resource-intensive, and harder to standardize. But continuing to evaluate AI in sanitized conditions detached from the world of work will leave us misunderstanding what it truly can and cannot do for us. To deploy AI responsibly in real-world settings, we must measure what actually matters: not just what a model can do alone, but what it enables—or undermines—when humans and teams in the real world work with it.

     Angela Aristidou is a professor at University College London and a faculty fellow at the Stanford Digital Economy Lab and the Stanford Human-Centered AI Institute. She speaks, writes, and advises about the real-life deployment of artificial-intelligence tools for public good.

    Read more

    How to Gain Superpowers With AI by Social Media Examiner

    Tired of not seeing real results from AI experiments in your day-to-day work? Wondering how to use AI to fundamentally change the way you work? In this article, you’ll discover the five-step ADOPT framework that turns scattered AI experimentation into a structured, sustainable approach to using AI to change how you work. The Biggest Misconceptions […]

    The post How to Gain Superpowers With AI appeared first on Social Media Examiner.

    Read more
    1 78 79 80 81 82 3,186