AI and Data Security – Let’s Worry About the Right Things

There is, rightly, a lot of concern around AI and data security and privacy. However, it’s important we worry about the right things, and ideally for the right reasons.

In particular, we are concerned that too much consideration is being given to the relatively low risks around AI training data obtained through the use of AI tools. Not enough attention is being paid to other more substantial risks, which we’ll discuss here.

At the extreme end, we see a fear that any data put into an AI tool like ChatGPT is somehow ‘learned’ and exposed to the world. On the less dramatic side, some believe that OpenAI collects all the information used by ChatGPT to train the model, and it will eventually become public. Neither of these are true in any meaningful way.

This is leading to institutions considering developing their own AI tools and solutions, which, whilst it might be useful for other reasons, is not necessary, or even potentially helpful, as a solution to data privacy.

To be clear, we must prevent our private data from being used for purposes other than those necessary and contractually agreed, and that includes for AI training. However, the risks most likely to impact us lie elsewhere – not with the actual AI tools. In particular we think the focus should be on:

Users understanding they should only ever put personal data in tools which the college or university has a contract with, because of standard cyber security and data risks, rather than the use of data for training.
A focus on data security issues around AI systems exposing data that is incorrectly secured, both internally and externally – not as training data but through use of the AI tool.
A broad review of all systems, as any could be sources of AI training data, not just AI tools. In fact, other systems are more attractive sources of AI training data.

AI and Training Data

To help us focus on where the most pressing risks lie, it’s worth thinking a bit more about where AI training data comes from, and how it’s used. LLMs (Large Language Models) are language models, not knowledge models, and they need data the contributes to their handling of language. Good, well written text.

In the first phase of training, the models are trained on vast amounts of text (around 10 trillion words!) to enable them to predict the next word. If information is frequently repeated in the training set, the model may get some facts right through word prediction. If not, it will get it wrong (hallucinations).

In a separate phase of the training, chat models are also trained to be helpful in answering questions. This is done by giving lots of examples of good question and answer pairs, and by humans providing feedback.

Let’s think about what makes good data. In the first training phase, high-quality text is the most important training data set. It’s typically scraped from the internet or includes books obtained by potentially unethical means. The companies do not want any personal data in this set (it would be bad for business!). They go to great lengths to review for any personal data, even though it’s all publicly available data anyway so shouldn’t have any. Why do they need to do this? Because data security isn’t perfect, and private data does find its way onto the public web.

The things we type Into ChatGPT, etc. aren’t generally good initial training data. They are random chats, fragments of text, etc.

Why do the companies ask for permission to use them then? It’s because they are sources of conversations, and our interactions with the tool. They can be analysed to see if the tool is working well. Sources of conversation are helpful in the second training phase. However, our discussion with ChatGPT etc are not the best sources of conversation, as one side is with AI. Forums, social media, etc. are much better sources of this data – see, for example, Mumsnet and how it was used.

But what happens if the data is used for training the model?

Just because something does somehow find its way into the training data doesn’t mean it will appear in the outputs. In reality, a single document among the 10 trillion words has no significant impact on the outputs. Is it absolutely impossible that personal information will be output? The answer is no (it might even be randomly generated correctly!).

But where does it rank in our concerns?

In reality it just doesn’t really happen with normal use of the applications – we just don’t see lots of examples of personal data being exposed by ChatGPT. Some researchers have managed to extract raw(ish) fragments of training data via very special techniques, and this includes some personal information. But, and this is the important part, this information was almost certainly on the open web in the first instance, and repeated multiple times. It didn’t come from users using the AI tool.

So, We Shouldn’t Worry at All?

As mentioned earlier, we shouldn’t allow private data to be used for training, even if the practical risk of it ever appearing as output is close to non-existent. It’s not what the data was collected for.

The AI companies want good quality data for training, and any system that holds good data at scale is potentially useful for AI model training, whether it’s for an LLM or some other form of AI. It’s got nothing to do with whether the system itself is an AI system. LinkedIn is a good example of this – a non-AI tool that is a good source of AI training data, at scale, and, at one point, asked users to opt-out of data being used for training.

So we should review all systems equally in terms of potential sources of AI training, and review and check all contracts, not just ‘AI tools’. Ideally this should be done each time a company issues a new contract or set of terms and conditions.

A More Pressing AI Data Risk?

Very few Gen AI systems and tools produce a response directly from the large language model. ChatGPT did this in the early days, but now almost all tools that process or produce information do so via some sort of more traditional search. They are really good at this too! One of the main techniques is known as Retrieval Augmented Generations– this blog post goes into more detail.

It gives a new risk – data that was inadvertently shared or more public than it was previously now easier to find. Microsoft calls this oversharing, and it’s a real issue with Copilot 365 and other similar tools that search intranets. They will surface information which has technically always been available, but perhaps just hard to find (is anyone good at SharePoint search?). The same is potentially true of the open web. There has always been private information available that shouldn’t be, but GenAI is going to make it easier to find.

So, ensuring any data that AI systems access has correct permissions is vitally important, but not because it will form part of the training data.

And What’s the Biggest Risk?

Our headline recommendation is, and always has been:

You should never put personal or private data into any IT system for which you don’t have a robust contract in place, and this includes tools incorporating AI.

The primary reason for this is the same as any other IT system. If it’s holding your personal data you need it to be secure, safe from hackers, and processed in compliance with UK law, including GDPR.

For example, you need to make sure that data isn’t used for other purposes or sold on to third parties. Your existing data protection and legal review of tools will already cover this.

This is the core practical reason we say you mustn’t put personal or private data into AI tools, not because of fears around training data.

So Why Shouldn’t We Build Our Own AI Tools to Ensure Data Privacy?

We started noting that we sometime hear institution that advocate that we should build our own AI tools to keep data secure.

Hopefully we’ve shown that this doesn’t solve any of the main risks. Arguably it’s increasing some. Producing secure software is not simple and producing something as secure as a commercial solution is a significant challenge.

And don’t forget, our word processors and spreadsheets, for example, handle our most sensitive data, but we don’t build our own. We manage the risks through contracts, user training, policies and technical controls.

We aren’t saying you should never build you own tools – it may be your institutions policy to build not buy, or that you may want a tool with a very specific feature, and have done a cost-benefit analysis that shows it’s worth building.

Finally, we know open-source models will be mentioned. It’s true that running an open source LLM on your laptop is generally secure, but it’s not the user experience most people want, and managing security updates for locally run software is always a challenge.

Summary

We’ve talked about the various risks around data and AI systems. We advocate taking a broad view, and not just focusing on training data from AI tools. Our main recommendations discussed are:

Users should never put personal or private data into any IT system for which you don’t have a robust contract in place, and this includes tools incorporating AI.
Be mindful of how data in all your systems, not just AI tools, might be used for AI training, but also understand the limited impact of single instances of data in large language models.
Focus on overall data security – the risks of incorrectly secured data being surfaced by AI are more of a real-world issue than the more theoretical risks around training data.

Find out more by visiting our Artificial Intelligence page to view publications and resources, join us for events and discover what AI has to offer through our range of interactive online demos.

For regular updates from the team sign up to our mailing list.

Get in touch with the team directly at AI@jisc.ac.uk