Media companies’ lawsuit against OpenAI latest in growing number of challenges to AI data scraping

At least two other lawsuits have been filed against AI-driven companies since September

Media companies’ lawsuit against OpenAI latest in growing number of challenges to AI data scraping
Sana Halwani, Aaron Wenner

A lawsuit that Canadian media organizations filed against OpenAI last week is the latest in a string of lawsuits that claim artificial intelligence-driven companies are unlawfully scraping copyrighted data, but it likely won’t be the last as the courts’ stance on the issue remains unclear, experts say.

“As long as companies need data in order to run machine-learning tools and make AI offerings, and as long as they continue to take that data without compensating other companies for it, then you're going to see lawsuits like this,” says Aaron Wenner, chief strategy officer of legal research AI startup Jurisage.

“OpenAI is going to make the case that this is in the public domain, and that's genuinely new,” Wenner adds. “Those are open questions.”

Toronto Star Newspapers Limited, the Canadian Broadcasting Corporation, the Globe and Mail Inc., and Canadian Press Enterprises Inc. are among the plaintiffs that filed a statement of claim against OpenAI on Thursday. They allege that the company infringed on their copyright when it scraped content from the news organizations’ websites to train its ChatGPT service.

The news organizations alleged that by scraping content from their websites, OpenAI circumvented the news websites’ protective measures – like user accounts or subscription-only access – to develop products for profit. The plaintiffs also claim the scraping violated their websites’ terms of use, which only authorize using the websites’ content for personal, non-commercial purposes.

At least two other lawsuits alleging similar claims have been filed since September. That month, British Columbia artist and illustrator Michael Dean Jackson filed a proposed class action against OpenAI and Microsoft, alleging the companies reproduced or scraped copyrighted works owned by himself and other class members to train their generative AI models. Nonprofit legal database CanLII sued Caseway AI in November, alleging the company unlawfully scraped court decisions, legislation, and secondary sources that CanLII had analyzed, curated, catalogued, annotated, and otherwise enhanced.

In response to the news organizations’ lawsuit, an OpenAI spokesperson told Canadian Lawyer on Friday, “Our models are trained on publicly available data, grounded in fair use and related international copyright principles that are fair for creators and support innovation.

“We collaborate closely with news publishers, including in the display, attribution, and links to their content in ChatGPT search, and offer them easy ways to opt out should they so desire,” the spokesperson added.

In November, Caseway founder Alistair Vigier expressed a similar sentiment in response to CanLII’s lawsuit, stating that “court documents are public record, not owned by any organization, including CanLII.”

Allegations of companies scraping copyrighted content for profit are not new, predating the explosion of generative AI tools like ChatGPT in recent years, Wenner says. “As long as there's been data on the internet, there have been people scraping it.”

Litigation challenging scraping efforts is not new, either. For example, Wenner points to a lawsuit the Toronto Real Estate Board filed against Mongohouse in 2018, alleging that the property listings website’s for-profit business was based on unauthorized access to TREB data. A federal court issued a permanent injunction against Mongohouse in 2019.

The same year, the Ninth Circuit – a federal appellate court in the US – ruled that data analytics company hiQ Labs had the right to scrape content from public LinkedIn profiles. The US Supreme Court later vacated the ruling.

Still, the recent spate of lawsuits is set to chart new legal territory. If scraping is an age-old concept, Wenner says, “What’s new is doing it on the absolutely massive scale that companies like OpenAI have done – like internet scale as opposed to website scale.”

No Canadian court has determined whether scraping on this scale is acceptable or whether the companies engaged in the scraping have a legitimate defence, notes Sana Halwani, a partner at Lenczner Slaght and one of the lawyers representing the media organizations in last week’s lawsuit against OpenAI.

“We are in precedent-setting territory with this case,” Halwani says, adding that she anticipates seeing more claims with similar allegations.

In the US, “we've seen a number of lawsuits that have been launched in the last couple of years against OpenAI and other AI companies,” she says. “If the US is the leading edge sometimes on these things, then I would expect that we would start seeing more and more in Canada.”

Of course, one factor that would fuel more of these challenges is companies continuing to scrape copyrighted material. As generative AI tools like large language models continue to proliferate, so will the need for “a massive corpus of data,” says Wenner, citing the oft-repeated adage that AI tools are only as good as the data that trains them.

“The profit motives are so high across the board. I don't know if we're going to see a stop to it,” Wenner says. Still, he says, “What’s going on is we're redefining the terms of what's acceptable and what isn't acceptable. So, what I would expect to see is things will shake out, and there will be a set of rules of the game that are hopefully well-understood by all the parties involved.”

One possible route forward for AI-driven companies? They can pay for the data they need.

“We know from public reporting that OpenAI has entered into a number of partnerships and licensing agreements with other media and news media organizations, where those companies are making content available under a license,” Halwani says. “Which tells us they understand that they're supposed to pay for this content, that they're not supposed to just take it without paying for it.”