People will really yell at you if you suggest AI models should pay for a license to all the works they ingest and cite their sources in their answers. I think it’s in part bc they believe in the magic of a machine intelligence. But we see more and more undeniable examples of straight up copying.
Theres no such thing as magic, or creative machines. Creativity is based on the lived experience of being born, living and dying in a natural world – which machines will lack for a long time to come.
It's called Overfitting. The more specific your question, the more likely it'll find a single reference that answers your question, and it'll parrot that reference verbatim.
Normally a LLM will blend together all it knows about a topic, but if it only knows one thing that's what you'll get.
Not of it is A.I., of course. It's machine learning, which is, by definition, copying. "When I see this, put that next to it." Of course they should pay the creator of "that".
Do we charge human authors a license to all the works they ingest? And require them to cite their sources?
Not, I think, for fiction, and the latter mainly for verification, not credit.
Flagrant plagiarism is dealt with when it happens; not in case it might.
At this point, my immediate reaction to LLM engines scraping my publication's stuff is "fuck you, pay me." I never gave consent, and I refuse to grant that to simpering tech bros who sneer that they're "going to put us out of business"
I feel like actually producing true and accurate references is still hard in the extreme sense.
I mean they can fully produce copies of things but from it's point of view it's not referencing any particular thing.
(And you may have seen what happens when you ask for references *in* the output )
As far as I can tell, google didn't copy there, it is just a link to quora. This could be a presentation issue -- I did the experiment in Safari on macOS, and it certainly did show the link.
This has been my experience with AI-generated source code. The AI-generated solution is usually a direct copy of the top Stack Exchange hit. (When it isn't, it's often unusable and will hallucinate things like nonexistent packages or else it mixes up languages.)
The fact that this is how it operates and this is literally all it can do, despite the obvious total loss of context, should be a warning to the people buying in on it. And yet!
OpenAI argues in their copyright defense that getting it to spit out rote copies of training data requires the user to “hack” the platform in violation of the terms, but it happens a lot and it’s trivially easy to induce
I'm positive lawyers are working on how they can get AI classified as a person so all the scraping can be called education so they don't have to pay anyone shit.
It's because aside from the labor-saving push from the top, most enthusiasm for AI comes from a fundamental disdain for expertise.
Citing sources, providing compensation, such gestures of considering the people who actually produce things ruins the experience of mindless consumption for them.
my understanding has been that fair use ultimately permits this. though feel free to argue with Masnick for the next 48 hours about it, I think I might actually learn something compared to the usual interlocutors telling him that §230 makes Facebook a publisher