Study Finds Chinese State Media Content Is Embedded in AI Training Data

A pro-democracy protester using a laptop computer as he sits on an occupied road in the Admiralty district of Hong Kong early on Oct. 8, 2014. (Ed Jones/AFP via Getty Images)

By Michael Zhuang

Michael Zhuang

Michael Zhuang is a contributor to The Epoch Times with a focus on China-related topics.

View profile

May 24, 2026Updated: May 25, 2026

biggersmaller

New research suggests that content from Chinese state media is deeply embedded in the datasets used to train major artificial intelligence (AI) systems and may be subtly shaping how some models respond to politically sensitive questions.

A study published in the scientific journal Nature on May 13 found that large volumes of material from Chinese state outlets—including Xinhua News Agency and People’s Daily—appear in the training datasets of large language models.

According to the research, when prompted in Chinese on topics related to China’s political system or sensitive domestic issues, several leading AI systems—including ChatGPT, Claude, and Gemini—were more likely to generate responses that closely aligned with the official Chinese regime’s framing. English-language responses to the same questions, the study found, often differed in tone or emphasis.

The researchers stressed that they did not find evidence of hacking or direct manipulation of AI systems. Instead, they argue that the effect likely stems from the structure of the underlying training data.

Since Chinese state media outlets publish large volumes of content that is freely accessible, widely syndicated, and consistently formatted, their content is more easily collected by web crawlers used in AI training pipelines. By contrast, independent news organizations are more likely to operate behind paywalls, enforce copyright restrictions, or block automated scraping, limiting their presence in training datasets.

This asymmetry, the study suggests, may unintentionally give state-aligned narratives a greater footprint in machine learning systems that rely on open internet data.

How Training Data May Shape Model Behavior

Researchers analyzing a large open-source Chinese-language dataset known as CulturaX found that it contains roughly 189 million documents. Within that dataset, content from Chinese state media appeared at a scale far exceeding that of Chinese-language Wikipedia.

The analysis also found that in politically charged contexts—including references to the Chinese Communist Party or Chinese leadership—state media content accounted for a significant share of the dataset’s relevant material.

When researchers tested multiple AI models using comparable prompts in Chinese and English, they observed notable differences. In some cases, Chinese-language responses appeared more likely to incorporate official terminology or reflect narratives commonly used in Beijing’s political discourse. English responses, by contrast, tended to be more neutral or varied in framing.

Unlike traditional media channels, such as television or newspapers, AI systems generate synthesized answers that can appear neutral, even when they reflect patterns embedded in training data.

The study also expands its analysis across dozens of countries, suggesting a broader pattern in environments with lower press freedom where AI outputs trained on local-language data were more likely to reflect state-aligned framing.

Sun Chen contributed to this report.

Epoch Share