Technology

74914 readers

2383 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

503

2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow. (www.businessinsider.com)

submitted 2 years ago by [email protected] to c/[email protected]

137 comments fedilink hide all child comments

Two authors sued OpenAI, accusing the company of violating copyright law. They say OpenAI used their work to train ChatGPT without their consent.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 2 years ago* (last edited 2 years ago) (10 children)

Again, that's not comprehension, that's mixing in yet more data that was put into the model. If you ask an AI to do something that is outside of the dataset it was trained on, it will massively miss the mark. At best, it will produce something that is close to what you asked, but not quite right. It's why an AI model that could beat the world's best Go players was beaten by a simple strategy that even amateur Go players could catch and defeat--the AI never came across that strategy while it was training against itself, so it had no idea what was going on.

And fair use isn't the bulletproof defense you think it is. Countless fan games have been shut down over the decades, most of them far more transformative than my hypothetical example, such as AM2R. You bet your ass that if I tried to profit off of that hypothetical crossover roguelike, using sprites, models, and textures directly ripped from their respective games, it would be shut down immediately.

EDIT: I also want to address the assertion that AI isn't trained to recreate existing works; in my view, that's wholly irrelevant. If I made a program that took all the Harry Potter books, ran each word through a thesaurus, and sold it for profit, that would still be infringing, even if no meaningful words were identical to the original source material. Granted, if I curated the output and made a few of the more humorous excerpts available for free through a Mastodon or Lemmy post, that would likely qualify as fair use. However, that would be because a human mind is parsing the output and filtering out the 99% of meaningless gibberish that a thesaurus-ized Harry Potter would result in.

The only human input to an AI that gave consent to being part of its output is the miniscule input of the prompt given to it by the human, which does not meet the minimis effort required for copyright protection under law. The rest of the input--the countless terabytes of data scraped from the internet and fed into the AI's training model--was all taken without the author's consent, and their contribution vastly outweighs that of the prompt author and OpenAI's own transformative efforts via the LLM.

[–] [email protected] 5 points 2 years ago* (last edited 2 years ago) (9 children)

You seem to misunderstand what an LLM does. It doesn't generate "right" text. It generates "probable" text. There's no right or wrong since it only generates a single word ahead of where it currently is. Hence why it can generate information that's complete bullshit. I don't know the details about this Go AI you're talking about, but it's pretty safe to say it's not an LLM or uses a similar technique to it as Go is a game and not a creative work. There are many techniques for creating algorithms that fall under the "AI" umbrella.

Your second point is a whole different topic. I was referring to a "derivative work", which is not the same as "fair use". Derivative works are quite literally everywhere. https://en.wikipedia.org/wiki/Derivative_work A derivative work doesn't require fair use, as it no longer falls under the same copyright as the original. While fair use is an exception under which copyrightable work can be used without infringing.

And also, those projects most of the time do not get shut down because they are actually illegal, but they get shut down because companies with tons of money can send threatening letters all day and have a team of high quality lawyers to send them. A cease and desist isn't a legal enforcement from a judge, it's a "recommendation for us not to (attempt to) sue you". And that works on most small projects. It very very rarely goes to court over these things. And sometimes it's because it's totally warranted. Especially for fan projects it's extremely hard to completely erase all protected copyrightable work, since they are specifically made to at least imitate or expand upon what they're a fan project of.

EDIT: Minor clarification

[–] [email protected] 2 points 2 years ago* (last edited 2 years ago) (3 children)

Also, it should be mentioned that pretty much all games are in some form derivative works. Lets take Undertale since I'm most familiar with it. It's well known that Undertale takes a lot of elements from other games. RPG mechanics from Mother and Earthbound. Bullet hell mechanics from games like Touhou Project. And more from games like Yume Nikki, Moon: Remix RPG Adventure, Cave Story. And funnily enough, the creator has even cited Mario & Luigi as a potential inspiration.

So why was it allowed to exist without being struck down? Because it fits the definition of a derivative works to the letter. You can find individual elements which are taken almost directly from other games, but it doesn't try to be the same as what it was created after.

[–] [email protected] -1 points 2 years ago (2 children)

Undertale was allowed to exist because none of the elements it took inspiration from were eligible for copyright protection. Everything that could have qualified for copyright protection--the dialogue, plot, graphical assets, music, source code--were either manually reproduced directly by Toby Fox and Temmie Chang, or used under permissive licenses that allowed reproduction (e.g. the GameMaker Studio engine). Meanwhile, the vast majority of content OpenAI used to feed its AI models were not produced by OpenAI directly, nor were they obtained under permissive license.

So... thanks for proving my point?

[–] [email protected] 2 points 2 years ago* (last edited 2 years ago)

The AI models (not specifically OpenAI's models) do not contain the original material they were trained on. Just like the creators of Undertale consumed the games they were inspired by into their brain, and learned from them, so did the AI learn from the material it was trained on and learned how to make similar yet distinctly different output. You do not need a permissive license to learn from something once it has been publicized.

You can't just put your artwork up on a wall and then demand every person who looks at it to not learn from it while simultaneously allowing them to look at it because you have a license that says learning from it is not allowed - that's insane and hence why (as far as I know) no legal system acknowledges that as a legal defense.

[–] [email protected] 2 points 2 years ago

Meanwhile, the vast majority of content OpenAI used to feed its AI models were not produced by OpenAI directly, nor were they obtained under permissive license.

That's input, not output, so not relevant to copyright law. If your arguments focused on the times that ChatGPT reproduced copyrighted works then we can talk about some kind of ContentID system for preventing that before it happens or compensating the creators of it does. I think we can all acknowledge that it feels iffy that these models are trained on copyrighted works but this is a brand new technology. There's almost certainly a win-win outcome here.

load more comments (5 replies)