Study findings recommend a reassessment of pre-training methodologies

POSTED ON Apr 07, 2025

A study published in March 2025 revealed that the common practice of scaling up the pre-training of large language models does not always result in improved performance during post-training.

This study, which was published on arXiv (a scientific repository for sharing papers before formal peer review), suggested that models with extensive pre-training may encounter “catastrophic training.”

This phenomenon refers to the decline in performance of post-trained models as the length of the pre-training phase increases.

The study reported average performance metrics from standard LLM benchmarks and proposed that overtraining occurs due to a gradual increase in model security parameter transformations during pre-training.

When fine-tuning is applied, this results in a significant loss of the capabilities acquired during pre-training. Overtrained language models also tend to be more challenging to refine through fine-tuning.

The authors of the study argue that their findings necessitate a critical reassessment of pre-training methodologies, emphasizing the importance of considering how well models can adapt to downstream tasks.

Through controlled experiments and theoretical analyses, the researchers demonstrated that overtraining is linked to a systematic increase in the sensitivity of pre-trained parameters to changes, including fine-tuning.

In April 2024, MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) released a study indicating that training large language models (LLMs) could be significantly less expensive than previously thought.

The survey found that while companies like OpenAI and Meta invest billions in developing their models, researchers at CSAIL and My Shell—which commercializes high-quality AI agents—demonstrated that just $0.1 million is sufficient to train LLaMA2-level large language models (LLMs).

Furthermore, a study published in the summer of 2024 as part of the proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics—the premier international conference in natural language processing—found that LLMs possess a superficial ability to follow instructions and demonstrate proficiency in a language.

However, they lack the potential to master new skills without explicit instruction. This characteristic means they remain inherently controllable, predictable, and safe.

The research team concluded that, although LLMs, which are being trained on increasingly larger datasets, can continue to be deployed without significant safety concerns, there remains a potential for the misuse of the technology.

As these models grow, they are likely to generate more sophisticated language and improve in following explicit and detailed prompts. However, they are unlikely to develop complex reasoning skills.

Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease performance after post-training!

Introducing’ catastrophic overtraining’." 🥁🧵+arXiv 👇

1/9 pic.twitter.com/TpCDgZ862C

— Jacob Springer (@jacspringer) March 26, 2025

Training LLMs can be much cheaper than previously thought.

While companies like @OpenAI and @Meta use billions of dollars to train theirs, CSAIL & @myshell_ai research shows that just 0.1 million USD is sufficient for training LLaMA2-level LLMs.

Introducing the open-source… pic.twitter.com/dLjoGprBxA

— MIT CSAIL (@MIT_CSAIL) April 4, 2024

Sometimes, the obvious must be studied so it can be asserted with full confidence:
– LLMs can not answer questions whose answers are not in their training set in some form,
– they can not solve problems they haven't been trained on,
– they can not acquire new skills our…

— Yann LeCun (@ylecun) August 13, 2024