Vanishing Archives, Faltering Algorithms: The Hidden Cost of China’s AI Development Model
Part one in a series on AI in China.
In recent years, global media coverage of China’s AI development has been dominated by sweeping narratives of geopolitical tension and technological one-upmanship. In these narratives, the Communist Party is portrayed as mobilizing China's vast state apparatus and civil ingenuity to push its AI capabilities—hardware, software, deployment, and application—toward parity with the West. However, beneath these sweeping narratives lies a critical but widely overlooked issue: given that the successful development and application of AI models is fundamentally dependent on high-quality data, shortcomings in Chinese data quality and availability will be a key limiting factor in the effective development and deployment of China’s foundational AI technologies. These shortcomings represent a technological ceiling imposed by authoritarian control over information, one that will have significant implications not only for China's AI landscape and its developmental trajectory, but also for the broader geopolitical landscape in the years ahead.
Let us first briefly discuss why high-quality data is so important. At the heart of computing lies a fundamental principle: “Garbage in, garbage out.” Unlike human reasoning, which can draw on broad experience to make inferences and logical leaps, AI systems are entirely dependent on the quality of their training data. Poor-quality data can undermine even the most sophisticated of algorithms, compromising the accuracy, performance, and reliability of AI systems. A 2024 Appen report surveying 300 companies found data quality to be a top challenge in the construction of AI applications. A 2025 survey from Qlik noted that 81% of AI professionals acknowledged significant data quality challenges within their organizations, threatening returns on investment and business stability. As Drew Clarke, EVP and GM of Qlik's Data Business Unit explained, “Companies are essentially building AI skyscrapers on sand...The most advanced algorithms can't compensate for unreliable data inputs.” Data quality has become one of the main obstacles to AI project success.
In China, these issues are more pronounced, for a number of reasons. First, there is simply less Chinese-language information available. On Common Crawl, a massive, open-access web archive that contains petabytes of data collected from billions of webpages across the internet, Chinese comprises only 5.2% of content (compared to 43.2% for English). As of March 2025, English accounted for about 49% of all internet content, while Chinese made up only 1.1%. This trend shows no sign of improving; indeed, much of the Chinese-language web is being systematically erased. In 2023, China had only 3.9 million websites, down from 5.3 million in 2017. Although Chinese internet users make up nearly one-fifth of the global total, a mere 1.3% of global websites use Chinese, down from 4.3% in 2013—a 70% decline in a decade. To make matters worse, many of China’s public databases have been shut down or restricted in recent years. While the Supreme People’s Court used to publish an open database of court verdicts, with 23 million rulings posted online in 2020, only 3 million were released in 2023. Access to public health data is similarly restricted. And, due to concerns about IP and commercial competition, Chinese firms like Tencent and ByteDance, each of which controls huge swathes of the Chinese-language internet, are reluctant to share data with third parties for the purpose of training LLMs.
Second, long-term political censorship and authoritarian control of information have systematically degraded the quality of Chinese-language source data. A large portion of Chinese-language text in the public domain consists of policy propaganda, news reports, and official documents that often feature abstract, vague, and highly repetitive political language, such as “high-quality development,” “building a new development pattern,” “positive political ecology,” and “modernization of the social governance system.” This language lacks clear, verifiable semantic boundaries and is highly time- and agenda-sensitive. Meanwhile, prolonged and far-reaching censorship has resulted in the removal or suppression of vast amounts of content that do not align with a narrowly defined state-sanctioned narrative. This large-scale gap in public language corpora prevents AI models from developing complete knowledge structures and conceptual depth. All of these issues have contributed to what some critics have called “the death of the Chinese internet.”
This deterioration in data quality is already having visible consequences. The Chinese AI model DeepSeek has seen a marked drop in performance since launch. Once praised as China’s most advanced homegrown LLM, DeepSeek is now widely criticized for generating false, fabricated, or conceptually confusing content. Anecdotal stories of DeepSeek hallucinations abound. When one user in Xi’an asked the model to explain why the city’s main ring road detours at Andingmen, rather than taking a shorter, more direct path, DeepSeek reportedly cited a fictitious “silent zone” designation from an official urban plan, complete with false technical data. Upon investigation, the user discovered that this “silent zone” didn’t exist. When confronted, DeepSeek admitted the error—but then offered an even more fanciful series of justifications involving “quantum entanglement” and entropy.
Although AI-generated hallucinations are not unique to DeepSeek, its hallucination rate is notably high. According to a Vectara report dated April 29, 2025, DeepSeek-R1’s hallucination rate was 8%, compared to Google Gemini 2.0 pro-exp’s 0.7% and OpenAI’s GPT-4o’s 1.5%. In earlier tests, DeepSeek-R1’s hallucination rate was as high as 14.3%. Compounding the issue is DeepSeek’s news and information accuracy rate of only 17%. NewsGuard found that in news-related prompts, DeepSeek repeated false claims in 30% of responses and failed to provide any answer in 53%, leading to an 83% failure rate—second worst among 11 tested models. Furthermore, in responding to disinformation about China, Russia, and Iran, DeepSeek supported falsehoods in 35% of cases. In 60% of responses—even when not repeating disinformation—it framed issues from a Chinese government perspective, including on prompts that had no explicit China connection.
What, then, are the broader economic, social, and strategic implications of this issue for China in the coming years? First, China’s ambition to become a global leader in artificial intelligence is fundamentally challenged by the low quality and limited availability of high-value training data—particularly in Chinese. This has restricted the potential of domestic AI models and placed China at a structural disadvantage when compared to countries with more open access to diverse, high-quality datasets. The consequences are not merely technical. As numerous surveys highlight, data quality issues significantly impede AI deployment, leading to diminished returns on vast AI investments by both Chinese tech firms and the state. Tools like DeepSeek exemplify the wasted R&D spending and reduced reliability that result. As AI increasingly underpins core economic sectors like logistics and manufacturing, these foundational flaws risk stalling productivity gains and delaying broader digital transformation.
Second, China's shrinking digital archive is creating a "knowledge vacuum" that is further isolating citizens and researchers from global discourse. Such intellectual narrowing suppresses critical thinking and diminishes academic, technological, and cultural innovation, already major issues of concern for Beijing as it seeks to reduce technological and economic reliance on the West.
How, then, is Beijing responding? Recent policy directives, such as the 2024 “AI+ Initiative”, suggest that the CCP is shifting tactics, pivoting from a singular emphasis on cutting‐edge innovation to an “application‐oriented” AI strategy, in which state‐led directives channel resources toward deploying existing AI tools across industries, public services, and governance. Rather than pouring disproportionate funding into frontier model research, Beijing is increasingly prioritizing rapid prototyping and large‐scale rollouts— mandating, for example, that ministries embed AI into smart‐city management, social surveillance, healthcare automation, and industrial automation—so that even with imperfect data, AI strengthens social stability, administrative efficiency, and digital governance.
Thus, the erosion of data quality and the rising unreliability of AI outputs in China—exemplified by platforms like DeepSeek—reflect deeper systemic issues tied to state control over information. However, Beijing is demonstrating that it is not without options. By moving towards an application-based approach to AI, it is able to selectively curate and synthesize the narrowly scoped datasets needed for specific uses, rather than relying on real-world text whose breadth and cleanliness are compromised by censorship. As we discuss next week, this could have serious geopolitical ramifications—by demonstrating that AI systems can be tailored to enforce and export a values-aligned, “digital authoritarian” approach rather than pursuing open-ended, Western-style innovation, Beijing sets a precedent for other states to adopt similar governance-first AI models. Washington must respond in kind if it is to meet this challenge.