Self-improving language models are becoming reality with MIT&#8217;s updated SEAL technique

Researchers at the Massachusetts Institute of Technology (MIT) are gaining renewed attention for developing and open sourcing a technique that allows large language models (LLMs) — like those underpinning ChatGPT and most modern AI chatbots — to improve themselves by generating synthetic data to fine-tune upon.

The technique, known as SEAL (Self-Adapting LLMs), was first described in a paper published back in June and covered by VentureBeat at the time.

A significantly expanded and updated version of the paper was released last month, as well as open source code posted on Github (under an MIT License, allowing for commercial and enterprise usage), and is making new waves among AI power users on the social network X this week.

SEAL allows LLMs to autonomously generate and apply their own fine-tuning strategies. Unlike conventional models that rely on fixed external data and human-crafted optimization pipelines, SEAL enables models to evolve by producing their own synthetic training data and corresponding optimization directives.

The development comes from a team affiliated with MIT’s Improbable AI Lab, including Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Their research was recently presented at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025).

Background: From “Beyond Static AI” to Self-Adaptive Systems

Earlier this year, VentureBeat first reported on SEAL as an early-stage framework that allowed language models to generate and train on their own synthetic data — a potential remedy for the stagnation of pretrained models once deployed.

At that stage, SEAL was framed as a proof-of-concept that could let enterprise AI agents continuously learn in dynamic environments without manual retraining.

Since then, the research has advanced considerably. The new version expands on the prior framework by demonstrating that SEAL’s self-adaptation ability scales with model size, integrates reinforcement learning more effectively to reduce catastrophic forgetting, and formalizes SEAL’s dual-loop structure (inner supervised fine-tuning and outer reinforcement optimization) for reproducibility.

The updated paper also introduces evaluations across different prompting formats, improved stability during learning cycles, and a discussion of practical deployment challenges at inference time.

Addressing the Limitations of Static Models

While LLMs have demonstrated remarkable capabilities in text generation and understanding, their adaptation to new tasks or knowledge is often manual, brittle, or dependent on context.

SEAL challenges this status quo by equipping models with the ability to generate what the authors call “self-edits” — natural language outputs that specify how the model should update its weights.

These self-edits may take the form of reformulated information, logical implications, or tool configurations for augmentation and training. Once generated, the model fine-tunes itself based on these edits. The process is guided by reinforcement learning, where the reward signal comes from improved performance on a downstream task.

The design mimics how human learners might rephrase or reorganize study materials to better internalize information. This restructuring of knowledge before assimilation serves as a key advantage over models that passively consume new data “as-is.”

Performance Across Tasks

SEAL has been tested across two main domains: knowledge incorporation and few-shot learning.

In the knowledge incorporation setting, the researchers evaluated how well a model could internalize new factual content from passages similar to those in the SQuAD dataset, a benchmark reading comprehension dataset introduced by Stanford University in 2016, consisting of over 100,000 crowd-sourced question–answer pairs based on Wikipedia articles (Rajpurkar et al., 2016).

Rather than fine-tuning directly on passage text, the model generated synthetic implications of the passage and then fine-tuned on them.

After two rounds of reinforcement learning, the model improved question-answering accuracy from 33.5% to 47.0% on a no-context version of SQuAD — surpassing results obtained using synthetic data generated by GPT-4.1.

In the few-shot learning setting, SEAL was evaluated using a subset of the ARC benchmark, where tasks require reasoning from only a few examples. Here, SEAL generated self-edits specifying data augmentations and hyperparameters.

After reinforcement learning, the success rate in correctly solving held-out tasks jumped to 72.5%, up from 20% using self-edits generated without reinforcement learning. Models that relied solely on in-context learning without any adaptation scored 0%.

Technical Framework

SEAL operates using a two-loop structure: an inner loop performs supervised fine-tuning based on the self-edit, while an outer loop uses reinforcement learning to refine the policy that generates those self-edits.

The reinforcement learning algorithm used is based on ReSTEM, which combines sampling with filtered behavior cloning. During training, only self-edits that lead to performance improvements are reinforced. This approach effectively teaches the model which kinds of edits are most beneficial for learning.

For efficiency, SEAL applies LoRA-based fine-tuning rather than full parameter updates, enabling rapid experimentation and low-cost adaptation.

Strengths and Limitations

The researchers report that SEAL can produce high-utility training data with minimal supervision, outperforming even large external models like GPT-4.1 in specific tasks.

They also demonstrate that SEAL generalizes beyond its original setup: it continues to perform well when scaling from single-pass updates to multi-document continued pretraining scenarios.

However, the framework is not without limitations. One issue is catastrophic forgetting, where updates to incorporate new information can degrade performance on previously learned tasks.

In response to this concern, co-author Jyo Pari told VentureBeat via email that reinforcement learning (RL) appears to mitigate forgetting more effectively than standard supervised fine-tuning (SFT), citing a recent paper on the topic. He added that combining this insight with SEAL could lead to new variants where SEAL learns not just training data, but reward functions.

Another challenge is computational overhead: evaluating each self-edit requires fine-tuning and performance testing, which can take 30–45 seconds per edit — significantly more than standard reinforcement learning tasks.

As Jyo explained, “Training SEAL is non-trivial because it requires 2 loops of optimization, an outer RL one and an inner SFT one. At inference time, updating model weights will also require new systems infrastructure.” He emphasized the need for future research into deployment systems as a critical path to making SEAL practical.

Additionally, SEAL’s current design assumes the presence of paired tasks and reference answers for every context, limiting its direct applicability to unlabeled corpora. However, Jyo clarified that as long as there is a downstream task with a computable reward, SEAL can be trained to adapt accordingly—even in safety-critical domains. In principle, a SEAL-trained model could learn to avoid training on harmful or malicious inputs if guided by the appropriate reward signal.

AI Community Reactions

The AI research and builder community has reacted with a mix of excitement and speculation to the SEAL paper. On X, formerly Twitter, several prominent AI-focused accounts weighed in on the potential impact.

User @VraserX, a self-described educator and AI enthusiast, called SEAL “the birth of continuous self-learning AI” and predicted that models like OpenAI’s GPT-6 could adopt similar architecture.

In their words, SEAL represents “the end of the frozen-weights era,” ushering in systems that evolve as the world around them changes.

They highlighted SEAL’s ability to form persistent memories, repair knowledge, and learn from real-time data, comparing it to a foundational step toward models that don’t just use information but absorb it.

Meanwhile, @alex_prompter, co-founder of an AI-powered marketing venture, framed SEAL as a leap toward models that literally rewrite themselves. “MIT just built an AI that can rewrite its own code to get smarter,” he wrote. Citing the paper’s key results — a 40% boost in factual recall and outperforming GPT-4.1 using self-generated data — he described the findings as confirmation that “LLMs that finetune themselves are no longer sci-fi.”

The enthusiasm reflects a broader appetite in the AI space for models that can evolve without constant retraining or human oversight — particularly in rapidly changing domains or personalized use cases.

Future Directions and Open Questions

In response to questions about scaling SEAL to larger models and tasks, Jyo pointed to experiments (Appendix B.7) showing that as model size increases, so does their self-adaptation ability. He compared this to students improving their study techniques over time — larger models are simply better at generating useful self-edits.

When asked whether SEAL generalizes to new prompting styles, he confirmed it does, citing Table 10 in the paper. However, he also acknowledged that the team has not yet tested SEAL’s ability to transfer across entirely new domains or model architectures.

“SEAL is an initial work showcasing the possibilities,” he said. “But it requires much more testing.” He added that generalization may improve as SEAL is trained on a broader distribution of tasks.

Interestingly, the team found that only a few reinforcement learning steps already led to measurable performance gains. “This is exciting,” Jyo noted, “because it means that with more compute, we could hopefully get even more improvements.” He suggested future experiments could explore more advanced reinforcement learning methods beyond ReSTEM, such as Group Relative Policy Optimization (GRPO).

Toward More Adaptive and Agentic Models

SEAL represents a step toward models that can autonomously improve over time, both by integrating new knowledge and by reconfiguring how they learn. The authors envision future extensions where SEAL could assist in self-pretraining, continual learning, and the development of agentic systems — models that interact with evolving environments and adapt incrementally.

In such settings, a model could use SEAL to synthesize weight updates after each interaction, gradually internalizing behaviors or insights. This could reduce the need for repeated supervision and manual intervention, particularly in data-constrained or specialized domains.

As public web text becomes saturated and further scaling of LLMs becomes bottlenecked by data availability, self-directed approaches like SEAL could play a critical role in pushing the boundaries of what LLMs can achieve.

You can access the SEAL project, including code and further documentation, at: https://jyopari.github.io/posts/seal

Self-improving language models are becoming reality with MIT’s updated SEAL technique

carlfranzen@gmail.com (Carl Franzen)

Visa just launched a protocol to secure the AI shopping boom — here’s what it means for merchants

Recommended.

HP launches AI PCs with focus on security and trust

Agentic AI: A deep dive into the future of automation

Trending.

KURZGESCHICHTEN: Sammlung moderner Kurzgeschichten für die Schule

UNTERRICHT: Thematische Lieder im Unterricht

Microsoft remakes Windows for an era of autonomous AI agents

98% of market researchers use AI daily, but 4 in 10 say it makes errors — revealing a major trust problem

How AI is introducing errors into courtrooms

Menü

Welcome Back!

Retrieve your password