Baidu, a leading Chinese internet search provider, has implemented a significant update to its Baike service, akin to Wikipedia, to block Google and Microsoft Bing from scraping its content. This strategic move underlines the increasing value of large datasets essential for training artificial intelligence (AI) models and applications. The decision is a part of a broader trend where technology companies are reevaluating their data-sharing policies to safeguard their valuable digital resources. Industry observers have noted similar actions from other companies aiming to control how their information is accessed and utilized by third-party platforms.
In 2019, Microsoft considered similar restrictions on its internet-search data to limit access by rival search engine operators, particularly those developing chatbots and generative AI services. This evolving trend reflects the growing emphasis on data security and proprietary content management among tech giants. Parallelly, Reddit took steps to block multiple search engines from indexing its content, except for Google, which had a financial agreement for data access. Such measures indicate a shift towards monetizing data access and controlling its distribution across various platforms.
Updated Robots.txt File
The latest update to Baidu Baike’s robots.txt file now denies access to both Googlebot and Bingbot crawlers, effective from August 8, as noted by the Wayback Machine. Previously, these search engines were permitted to index Baidu Baike’s extensive repository, encompassing nearly 30 million entries, though some subdomains were already restricted. This change signifies Baidu’s proactive approach in managing access to its comprehensive online content.
Implications for AI Development
Baidu’s decision aligns with an industry-wide trend where AI developers seek high-quality content for training their models. Companies like OpenAI have established agreements with content publishers, such as Time magazine and the Financial Times, to gain access to extensive archives for their AI projects. This practice highlights the competitive landscape for securing valuable datasets critical for advancing AI capabilities.
Currently, the Chinese Wikipedia, with 1.43 million entries, continues to be available to search engine crawlers. Despite Baidu’s restrictions, entries from Baidu Baike still appear in Google and Bing searches, likely due to older cached content. This situation underscores the persistent demand for comprehensive data by AI developers and search engines.
Baidu’s restrictions on its Baike content reflect a broader industry trend towards controlling and monetizing valuable online data. As AI technology advances, access to extensive, curated datasets becomes increasingly critical. This move may prompt other companies to reassess their data-sharing practices, leading to more restricted access or commercial arrangements for data usage.