404 Media

614 readers

37 users here now

404 Media is a new independent media company founded by technology journalists Jason Koebler, Emanuel Maiberg, Samantha Cole, and Joseph Cox.

Don't post archive.is links or full text of articles, you will receive a temp ban.

founded 10 months ago

MODERATORS

[email protected]

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums (www.404media.co)

submitted 8 hours ago by [email protected] to c/[email protected]

0 comments fedilink hide all child comments

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

AI bots that scrape the internet for training data are hammering the servers of libraries, archives, museums, and galleries, and are in some cases knocking their collections offline, according to a new survey published today. While the impact of AI bots on open collections has been reported anecdotally, the survey is the first attempt at measuring the problem, which in the worst cases can make valuable, public resources unavailable to humans because the servers they’re hosted on are being swamped by bots scraping the internet for AI training data.

“I'm confident in saying that this problem is widespread, and there are a lot of people and institutions who are worried about it and trying to think about what it means for the sustainability of these resources,” the author of the report, Michael Weinberg, told me. “A lot of people have invested a lot of time not only in making these resources available online, but building the community around institutions that do it. And this is a moment where that community feels collectively under threat and isn't sure what the process is for solving the problem.”

The report, titled “Are AI Bots Knocking Cultural Heritage Offline?” was written by Weinberg of the GLAM-E Lab, a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law, which works with smaller cultural institutions and community organizations to build open access capacity and expertise. GLAM is an acronym for galleries, libraries, archives, and museums. The report is based on a survey of 43 institutions with open online resources and collections in Europe, North America, and Oceania. Respondents also shared data and analytics, and some followed up with individual interviews. The data is anonymized so institutions could share information more freely, and to prevent AI bot operators from undermining their countermeasures.

💡Do you know anything else about AI scrapers? I would love to hear from you. Using a non-work device, you can message me securely on Signal at ‪@emanuel.404‬. Otherwise, send me an email at [email protected].

Of the 43 respondents, 39 said they had experienced a recent increase in traffic. Twenty-seven of those 39 attributed the increase in traffic to AI training data bots, with an additional seven saying the AI bots could be contributing to the increase.

“Multiple respondents compared the behavior of the swarming bots to more traditional online behavior such as Distributed Denial of Service (DDoS) attacks designed to maliciously drive unsustainable levels of traffic to a server, effectively taking it offline,” the report said. “Like a DDoS incident, the swarms quickly overwhelm the collections, knocking servers offline and forcing administrators to scramble to implement countermeasures. As one respondent noted, ‘If they wanted us dead, we’d be dead.’”

One respondent estimated that their collection experienced one DDoS-style incident every day that lasted about three minutes, saying this was highly disruptive but not fatal for the collection.

“The impact of bots on the collections can also be uneven. Sometimes, bot traffic knocks entire collections offline,” the report said. “Other times, it impacts smaller portions of the collection. For example, one respondent’s online collection included a semi-private archive that normally received a handful of visitors per day. That archive was discovered by bots and immediately overwhelmed by the traffic, even though other parts of the system were able to handle similar volumes of traffic.”

Thirty-two respondents said they are taking active measures to prevent bots. Seven indicated that they are not taking measures at this time, and four were either unsure or currently reviewing potential options.

The report makes clear that it can’t provide a comprehensive picture of the AI scraping bot issue, the problem is clearly widespread though not universal. The report notes that one inherent issue in measuring the problem is that organizations are unaware bots are scraping their collections until they are flooded with enough traffic to degrade the performance of their site.

“In practice, this meant that many respondents woke up one morning to an unexpected stream of emails from users that the collection was suddenly, fully offline, or alerts that their servers had been overloaded,” the report said. “For many respondents, especially those that started experiencing bot traffic earlier, this system failure was their first indication that something had changed about the online environment.”

Just last week, the University of North Carolina at Chapel Hill (UNC) published a blog that described how it handled this exact scenario, which it attributed to AI bot scrapers. On December 2, 2024, the University Libraries’ online catalog “was receiving so much traffic that it was periodically shutting out students, faculty and staff, including the head of User Experience,” according to the school. “It took a team of seven people and more working almost a full week to figure out how to stop this stuff in the first instance,” Tim Shearer, an associate University librarian for Digital Strategies & Information Technology, said. “There are lots of institutions that do not have the dedicated and brilliant staff that we have, and a lot of them are much more vulnerable.”

According to the report, one major problem is that AI scraping bots ignore robots.txt, a voluntary compliance protocol which sites can use to tell automated tools, like these bots, to not scrape the site.

“The protocol has not proven to be as effective in the context of bots building AI training datasets,” the report said. “Respondents reported that robots.txt is being ignored by many (although not necessarily all) AI scraping bots. This was widely viewed as breaking the norms of the internet, and not playing fair online.”

We’ve previously reported that robots.txt is not a perfect method for stopping bots, despite more sites than ever using the tool because of AI scraping. UNC, for example, said it deployed a new, “AI-based” firewall to handle the scrapers.

Making this problem worse is that many of the organizations that are being swamped by bot traffic are reluctant to require users to log in, or complete CAPTCHA tests to prove they’re human before accessing resources, because that added friction will make people less likely to access the materials. In other cases, even if institutions did want to implement some kind of friction, it might not have the resources to do so.

“I don't think that people appreciate how few people are working to keep these collections online, even at huge institutions,” Weinberg told me. “It's usually an incredibly small team, one person, half a person, half a person, plus, like their web person who is sympathetic to what's going on. GLAM-E Lab's mission is to work with small and medium sized institutions to get this stuff online, but as people start raising concerns about scraping on the infrastructure, it's another reason that an institution can say no to this.”

From 404 Media via this RSS feed

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here