Continue reading with 20% off? Not today thanks
Just get GoodLinks and “share” it ;)
“Open Source” is mostly the right term. AI isn’t code, so there’s no source code to open up. If you provide the dataset you trained off of, and open up the code used to train the model, that’s pretty close.
Otherwise, we need to consider “open weights” and “free use” to be more accurate terms.
For example, ChatGPT 3+ in undeniably closed/proprietary. You can’t download the model and run it on your own hardware. The dataset used to train it is a trade secret. You have to agree to all of OpenAI’s terms to use it.
LLaMa is way more open. The dataset is largely known (though no public master copy exists). The code used to train is open source. You can download the model for local use, and train new models based off of the weights of the base model. The license allows all of this.
It’s just not a 1:1 equivalent to open source software. It’s basically the equivalent of royalty free media, but with big collections of conceptual weights.
AI isn’t code
Yes it is. It defines a function from input to output. It’s not x86 or Arm code. It’s code that runs on a different type of machine. It’s a type of code that you may not be able to read, but it’s still code.
The problem is: Data is code, and code is data. An algorithm to compute prime numbers is equivalent to a list of prime numbers, (also, not relevant to this discussion, homoiconicity and interpretation). Yet we still want to make a distinction.
Is a PAQ-compressed copy of the Hitchhiker’s guide code? Technically, yes, practically, no, because the code is just a fancy representation of data (PAQ is basically an exercise in finding algorithms that produce particular data to save space). Is a sorting algorithm code? Most definitely, it can’t even spit out data without being given an equally-sized amount of data. On that scale, from code to code representing data, AI models are at least 3/4th towards code representing data.
As such I’d say that AI models are data in the same sense that holograms (these ones) are photographs. Do they represent a particular image? No, but they represent a related, indexable, set of images. What they definitely aren’t is rendering pipelines. Or, and that’s a whole another possible line of argument: Requiring Turing-complete interpretation.
I think it comes down to how it’s used.
An LLM model is nothing unless it’s used to process some other things. It does something. It predicts the likeliness of words following a sequence of other words. It has no other purpose. It can’t take the model, analyse it in a different way and extract different conclusions. It is singular in function. It is a program.
Data has no function. It is just data.
The “battle” is the result of copyright people trying to use open source people for their ends.
In the past, for software, the focus was completely on the terms of the license. If you look at OSI’s new definition, you will find no mention of that, despite the fact that common licenses in the AI world are not in line with traditional standards. The big focus is data, because that is what copyright people care about. AI trainers are supposed to provide extensive documentation on training data. That’s exactly the same demand that the copyright lobby managed to get into the european AI Act. They will use that to sue people for piracy.
Of course, what the copyright people really want is free money. They’re spreading the myth that training data is like source code and training like compiling. That may seem like a harmless, flawed analogy. But the implication is that the people who work and pay to do open source AI have actually done nothing except piracy. If they can convince judges or politicians who don’t understand the implications then this may cause a lot of damage.