OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

IndustryStandard@lemmy.world · 5 months ago

OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

leftzero@lemmynsfw.com · 5 months ago

If the models are random then we shouldn’t be trusting them to do anything, let alone serious applications.

That’s not the reason we shouldn’t be using them for anything other than generating lorem ipsum style text or dialogue for non quest critical NPCs in games.

The reason is that, paraphrasing Neil Gaiman, LLMs don’t generate information, they generate information shaped sentences.

Specifically, an LLM takes a sequence of characters (not a word or text; LLMs have no concept of words, or text, or anything else for that matter; they’re just an application of statistics on large volumes of sequences of characters; no meaning or intelligence involved, artificial or not)… as I was saying, an LLM takes a sequence of characters, pushes it through its model, and outputs the sequence of characters most likely to follow it in the texts its model has been trained on (or rather, the most likely after discarding the ones its creators have labelled as politically incorrect).

That’s all they do, and they’ll excellent at it (or would be if it weren’t for the aforementioned filters), but that’ll never give you a cure for cancer unless there already was one in their training data.

They take texts written by humans, shred them, and give you their badly put back together dessicated corpses, drained of any and all meaning or information, but looking very convincingly (until you fact check them) like actually meaningful or informative texts.

That is what makes them dangerous. That and the fact that the bastards selling them are marketing them for the jobs they’re least capable of doing, that is, providing reliable information.

(And that’s while they can still be trained on meaningful and informative texts written by humans — inasmuch as anything found on reddit, facebook, or xitter can be considered to be meaningful or informative —, but given that a higher and higher percentage of the text on the internet is being generated by LLMs soon enough it’ll be impossible to train new models on anything but 99% LLM generated garbage, at which point the whole bubble will implode, as anyone who’s wasted time, paper, and toner playing with a photocopier or anyone familiar with the phrase “garbage in, garbage out” will already have realised… which is probably why the LLM peddlers are ignoring robots.txt and copyright laws in a desperate effort to scrape whatever’s left of the bottom of the barrel.)

lemmyvore@feddit.nl · 5 months ago

LLMs don’t generate information, they generate information shaped sentences

That is besides the point. A random number generator is more or less random but it still has applications.

The problem is not them being random, it’s hiding that they are being random so they can be used for applications where randomness is not a feature.

leftzero@lemmynsfw.com · 5 months ago

The problem is not them being random.

They are not random, that’s the point. They’re entirely deterministic and very precise, and they aren’t hiding anything; they will give you the most likely (not blacklisted) sequence of characters to follow your input according to their model. What they won’t give you is information, except by accident.

If they were random (hidden or not) they’d be harmless, no one would trust them any more than one of those eight ball toys, or your average horoscope.

The issue is that they’re very not random, so much that there’s no way to know if what they are saying bears any accidental semblance to the truth without fact checking… and that very soon they’ll have replaced any feasible way to fact check them, since all the supposed “facts” we’ll have access to will have been generated by LLMs train on LLM generated garbage.