Clicky

Join the pushback against online censorship, cancel culture, and surveillance.

Google Says It Still Uses Blocked Website Content to Train Search AI, Even If Publishers Say No

Google sidesteps AI opt-outs by classifying web data as fair game once it's funneled into search.

Google AI logo displayed over a blurred web browser window showing search results with text and images.

If you’re tired of censorship and surveillance, subscribe to Reclaim The Net.

Under questioning in a federal antitrust trial, a Google executive disclosed that the company can still use web content to train AI features within its search engine, even when website publishers have explicitly opted out of such use for broader AI model development.

Eli Collins, vice president at Google DeepMind, testified that while Google’s AI lab respects publisher restrictions on data use, those limitations don’t necessarily bind other parts of the company. According to Collins, if a generative AI model like Gemini is transferred to the search division, it may then be refined using web data that was originally excluded from DeepMind’s training corpus, so long as it’s for the purpose of search.

This admission came during cross-examination by Department of Justice attorney Diana Aguilar. “Once you take the Gemini [AI model] and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?” she asked. “Correct — for use in search,” Collins replied.

Google’s AI-generated summaries, which appear above traditional search results, are at the core of publisher concerns. Many say that these features discourage users from clicking through to their sites, eroding web traffic and ad revenue, while simultaneously feeding Google’s AI with data derived from those same sites.

The tech giant maintains that website owners who wish to prevent this use must go beyond standard AI opt-outs and block Google Search indexing altogether, an approach governed by the well-established robots.txt protocol. “Google has a separate way for publishers to manage their content in Search via the well-established robots.txt web standard,” a spokesperson noted.

These revelations surfaced during a three-week trial in Washington, D.C., where Judge Amit Mehta is reviewing proposals aimed at curbing Google’s dominance in search. The proceedings follow a 2023 decision in which the court found that the company had unlawfully monopolized the market.

Government lawyers are pushing for sweeping remedies, including forcing Google to divest its Chrome browser, prohibiting it from paying to be the default search provider on devices and restricting the integration of AI services like Gemini into its broader ecosystem.

As part of their argument, DOJ attorneys pointed to a document dated August 26, 2024, titled “Search GenAI <> Gemini v3,” which outlined how Google filtered 80 billion out of 160 billion tokens, segments of online content, after applying publisher opt-out preferences. It also identified search sessions and YouTube videos as additional sources of training data.

Judge Mehta questioned Collins directly about these figures. “The 80 billion out of 160 billion tokens, 50% is removed by publishers opting out?” he asked. “That is correct,” Collins responded.

Later in the hearing, Google’s defense team emphasized that the company’s search strength doesn’t preclude rivals from building robust AI models. Collins gave the example of a chatbot retrieving accurate sports scores through commercial agreements with data providers, rather than scraping the open web.

Still, court exhibits revealed that Google has seriously considered how its treasure trove of search data might give its AI tools an edge. One internal briefing intended for DeepMind CEO Demis Hassabis mentioned experiments involving search queries and ranking data to test their effect on AI performance.

Aguilar pressed Collins on whether such a model had ever been created. “Not that I’m aware,” he responded. When asked whether Hassabis had considered it a worthwhile direction, Collins confirmed: “Yes.”

If you’re tired of censorship and surveillance, subscribe to Reclaim The Net.

Logo with a red shield enclosing a stylized globe and three red arrows pointing upward to the right, next to the text 'RECLAIM THE NET' with 'RECLAIM' in gray and 'THE NET' in red

Join the pushback against online censorship, cancel culture, and surveillance.

Reclaim The Net Logo

Defend free speech and individual liberty online. 

Share this post