Norway is Building a ‘Sovereign AI’ to Save Its Language from English-Centric Models

Table of Contents
The Battle for Linguistic Sovereignty
For most of the world, the current AI revolution feels like an English-language project. While models from OpenAI and Google can translate dozens of languages, there is a fundamental gap between translation and true cultural understanding. Norway is attempting to bridge that gap by building its own sovereign Large Language Model (LLM), ensuring that its history, news, and cultural nuances aren’t lost in the weights of a globally trained, English-centric system.
Marius Husnes, Head of IT Platform at Norway’s National Library (Nasjonalbiblioteket), recently detailed the technical and philosophical hurdles of this project at Huawei’s ID Forum 2026 in Paris. The premise is straightforward but urgent: any nation that relies solely on foreign AI for its language risks a form of digital erasure, where the AI understands the words of a language but not the specific historical context or cultural identity of the people who speak it.
The Data Advantage
The Norwegian Ministry of Culture tasked the National Library with this project because the institution possesses something no private Silicon Valley firm can replicate: a comprehensive, legal mandate to collect every piece of published content in the country. From books and newspapers to broadcasted media and web pages, the library is the central custodian of Norway’s intellectual heritage.
Crucially, the library has secured agreements with Norwegian newspapers to train the LLM on copyrighted content—a legal hurdle that has plagued commercial AI developers worldwide. This unique access to high-quality, curated data provides a moat that commercial LLM providers simply cannot cross.
The scale of the archive is immense. Since 2005, the library has been digitizing its collection, amassing 20 PB of unique data. Following a strict 3-2-1 backup strategy—three copies, two different media types, and one off-site—the total footprint expands to roughly 60 PB of data, including OCR-scanned text, audio, and moving images.
Solving the Pipeline Bottleneck
While the public conversation around AI often focuses on compute power, Husnes argues that the real bottleneck is data throughput and quality. The challenge isn’t just having the data, but moving it from a dormant state of preservation into a state of active training.
The library’s 60 PB preservation system is designed for durability and cost-efficiency, not speed. It is a “cold” archive with high read latency. To transform this into training-ready data, Husnes implemented a high-performance intermediary layer. This involves an Nvidia DGX H200 system and a 384-core CPU cluster supported by 2 PB of Huawei OceanStor Dorado all-flash storage.
This all-flash layer serves as the engine for the data pipeline, handling ingestion, cleaning, deduplication, and format normalization. Once the data is scrubbed and validated in this low-latency environment, it is pushed to the actual training site: Norway’s national supercomputer, the Sigma2 Olivia system.
The Infrastructure of a Nation’s Mind
The Olivia system is a powerhouse of computation, utilizing an HPE Cray Supercomputing EX system equipped with 448 GPUs and over 64,000 CPU cores. This is where the heavy lifting happens, supported by a 5.3 PB Cray ClusterStor E1000 storage system.
The project highlights a significant technical gap in current AI discourse. Husnes noted that while many are discussing GPU counts, very few are talking about the logistical nightmare of moving petabyte-scale datasets from a long-term archive through a cleaning pipeline and into a supercomputer. His team essentially had to map this territory from scratch.
The ongoing effort is more than just a technical exercise in storage and compute. It is a blueprint for other non-English speaking nations. As Husnes puts it, AI needs custodians, not just builders. By treating the LLM as a public utility and a cultural archive, Norway is positioning itself to maintain its intellectual autonomy in an era of algorithmic consolidation.