I’ve been writing a blog post about AI Sovereignty this week, and as I did it felt that I was going to get responses that ‘open-source was the answer’, so I thought it would be useful delve into this as well.
The particular issue I want to explore is that some of the concepts behind open-source software don’t translate directly into AI, and, I think, are leading to some misunderstandings of the benefits and limitations of ‘open-source’ AI. You’ll sometimes see the ‘open-weight’ AI used as an alternative name to ‘open-source’, which is perhaps more accurate (we’ll see why later) but ‘open-source AI’ is becoming common place without a good, shared understanding of what it actually means.
There’s a lot of work going on to try and solve this, and to define what the term open-source AI should be used for (see work by OSI). However there’s not necessarily full agreement (see MIT Tech Review article: ‘The tech industry can’t agree on what open-source AI means. That’s a problem). In my view (and that of others: see this Techcrunch article), it’s never going to be directly equivalent to open-source software – one that give the same benefits and freedoms. That’s not to say it’s not useful, but it’s important in many cases to understand the difference.
Why is open-source AI and open-source software different?
The basic reason for this is that open-source software is made up lines of source code (hence the name!) that humans can understand, and an AI model isn’t.
Furthermore, as there is no commonly used definition of open-source AI, many providers of ‘open-source AI’ don’t do everything they could to make their AI model usefully open in most senses – something sometimes termed ‘open-source washing’
First, let’s have a closer look at open-source software. The internet and many of the tools you use every day are built on open-source code. Even if the application itself is proprietary, it almost certainly uses open-source code library and components. If you want to see the extent of this, as an example, here’s the list of components used in Microsoft Word, most of which are open-source. The open-source initiative gives a full definition of open-source software – I’ll just pull out and paraphrase the bits I’m most interested in for this discussion:
- Free Redistribution – there are no restrictions on giving the software away for free.
- Source code – this must be freely available and understandable.
- Derivative works – you must be able to modify and create your own versions and be able to distribute these under the same license.
End users probably focus on the first point – that the software can be used for free. But this only works if development of the software thrives, and this is enabled by the second point – any developer can take the code, improve it, and, if they don’t like the direction of travel – create their own version of the software.
It’s the last two points that I think don’t translate to AI. First, though, let’s explore why they are important.
Two of the big benefits of open-source software are as follows:
- Anyone can read the code of the software, and make sure it’s fit for their purpose, that it’s not doing anything you don’t want it to, and it meets your needs, for example around security and privacy. This is one of the reasons open-source software is trusted to power the bulk of the internet.
- No single company can control open-source software. They can exert a lot of control, by for example, heavily funding the development, but in the end, if the community don’t like the direction of travel they can simply create a new branch and form their own community. This happened, for example with the database MySQL when it was acquired by oracle.
So why exactly is AI different? Isn’t it just code too?
Traditional computer software, at its basic level, is lines of computer code. There might be a lot of them, and the software might require other applications such as databases to run, but in the end it’s just lines of code designed to be understandable by humans.
Here’s some code. Even if you are a developer you can probably read it and get the gist of what might be happening (adding 2 numbers together and displaying the result).
# Doing a simple calculation a = 5 b = 3 sum = a + b print("The sum of", a, "and", b, "is", sum)
Obviously, things get a lot more complex, but the idea remains the same. To give an idea of scale, Moodle contains about 800,000 lines of code.
Now let’s look at the Generative AI model. Again, we’ll simplify.
There are two stages
1) The training stage – which creates the models.
2) The inference stage – where we use the model.
Training Stage:
The training stage takes the training data, encodes it in a format useful for training, and then runs a training process to create the model, which is used in the inference stage. So the main parts are the training source code, and the training data, and the outputs are the AI model, which is essentially a big (very big!) set of numbers representing the inner workings of the model (weights, biases and other parameters).
The training stage of a large AI model typically costs a few million pounds in compute hours, and many, many processors and hours to complete.
Inference Stage:
At the inference stage, we use the model to create responses to inputs. When people talk about running a GenAI model, it’s almost always the inference stage.
The inference stage can often be run on a decent desktop or laptop computer, at least for a small number of users.
Exploring the model in more detail
We’ve probably all seen pictures of the neural nets that look something like this:
The model is essentially a list of weights – a number representing the size of the connections between ‘neurons’ (the circles) and the other parameters such as biases (in the mathematical rather than society meaning). A generative AI model will have several billions of these – for example the open-source Llama 3.1 model has between 8 Billion and 405 Billion. ChatGPT’s numbers are secret, but, estimated to be a trillion parameters. There is a standard file format for these (GGUF – GPT Generated Unified Format) which allows models to be easily shared. A typical file for a large language model is a few gigabyte in size, so big, but it will fit on a standard computer.
These weights and parameters are, in many ways the equivalent to the lines of code in traditional computer software – they are the important information that makes the AI work. Unlike computer code, these weights cannot be understood by any human. Not just because of the sheer number of them, but also because they don’t have any human understanding meaning beyond being a number.
This means that the Open AI models can’t be modified in same ways as open-source software. You can’t simply change the code and change the behaviour.
You can ‘fine tune’ the model – essentially giving it new data, but this isn’t always a difference between open-source and closed source AI models. For example, OpenAI allow you to fine tune their models even though they are closed. There isn’t really an equivalent concept to this in the traditional open-source world.
The Importance of Training Data
It’s also important to understand that almost all open-source generative AI model don’t release perhaps the most important component – the training data. There is not really an equivalent to this in open-source software, but it would be like releasing software that was impossible to recreate. You might argue this is irrelevant as you don’t have the computer power or money to run the training stage, but having access to the training data helps you understand the working and output of the model.
So let’s recap:
Earlier on, we noted that open-source software includes the following characteristics: free redistribution, available source code, and freedom to create derivative works. Let’s now see how these compare with open-source software and AI.
Free Redistribution – there are no restrictions to giving the software away for free.
- This concept works for both traditional open-source software and AI.
Source code – this must be freely available and understandable.
- This concept does not – in my opinion, work for open-source AI. The equivalent of the source code – the model weights etc – is not understandable.
Derivative works – you must be able to modify and create your own versions and be able to distribute these under the same license.
- This, I think, probably does translate to a degree, in that you can potentially fine-tune the model.
So from an education perspective, you might ask, why does this matter?
If open-source AI models are ‘free’ for us to use, and run ourselves, why does this matter? After all, it does solve some problems, like concerns about data security and age restrictions on the use of software.
I mentioned at the beginning that I was writing a separate blog post on AI sovereignty, and that prompted this. The main thing for me is that with open-source AI, at the moment, the control still fundamentally sits with the big company creating the model. They might be sharing the trained model, but they aren’t sharing the training data, or the means to train the model, just run it, and tune it slightly.
OSI are working on a definition of Open Source AI. It’s a thorough and well thought through definition, and perhaps highlights just how different open-source software and open-source AI are. I’m perhaps a bit disappointed that they are suggesting you can class open-source AI as such, even if you don’t release the training data.
So to cut to the chase:
Open-source Gen AI is like open-source software because:
- You can use it without paying a fee apart from compute costs.
- You can run it on your own equipment and be in full control of what happens with the data
Open-source Gen AI is different to open-source software because:
- Control of the product remains with the technology provider as they control the training stage
- You cannot inspect all the code and understand its workings in the same way
- You can’t fundamentally modify it to suit your needs, only fine tune
- There is no agreed definition of what constitutes open-source AI, leading to huge differences between approaches and what is actually released.
On the last point, if you are interested in exploring more, Andreas Liesenfeld and Mark Dingemanse have a git hub keeping track of the openness of various open source models. You might be surprised how ‘unopen’ some heavily touted open-source AI models from large companies actually are. Meta’s Llama for example ranks very close to the bottom.
From an education perspective, that doesn’t mean you shouldn’t use open-source AI, just that I think it’s important to understand the differences between open-source AI and open-source software, and to make sure any use of open-source AI is well informed, particularly around the openness (or not) of any model you might be looking to use.
Find out more by visiting our Artificial Intelligence page to view publications and resources, join us for events and discover what AI has to offer through our range of interactive online demos.
For regular updates from the team sign up to our mailing list.
Get in touch with the team directly at AI@jisc.ac.uk