What is Autoregression-based Image Generation and how will it impact Document Fraud?

We recently saw a new methodology for Image Generation integrated into OpenAI’s GPT 4o that has resulted in significant improvements to the quality of images that Generative AI can create. To see a more in-depth comparison, please read Catherine Barker’s blog post if you haven’t already. You may have seen the viral trend of people using this new technology to turn pictures of their pets into people but these new capabilities, like improved text generation, have resulted in some concerns within the cyber security space. But before we get into that, what has caused this considerable improvement in image generation and how does it work?

How does Image Generation Work?

In short, this has to do with the utilisation of autoregression. This isn’t a new concept in Generative AI, autoregressive modelling is used predominantly to generate text. It has also been used within Image Generation before, but not with the level of multi-modality we are witnessing now, being built into GPT-4o and readily available to ChatGPT users. Historically, AI image generation integrated into this service utilised diffusion models.

Diffusion Models are a method of machine learning that takes an image and progressively adds “noise” to it until it is an unrecognisable collection of pixels. This process is then reversed, iteratively taking the collection of pixels back to the original image. This is how the model learns to replicate this process. So when a user asks the Generative AI model for an image it utilises this learned methodology, turning a data distribution of pure noise into an image based on a user prompt, with varying degrees of success.

A series of six images containing a woman writing on a clear board in an office setting that depicts noise being added until reaching a distribution of pure noise.

Autoregression models work differently from diffusion models. They still require observation of a training data set but instead learn through determining the probabilistic correlation of an observed token to previous tokens in the sequence. For instance, in text generation, the sequence type could be a sentence, and a token could be a word, part of a word, or punctuation. Considering the user prompt and following a start-of-sequence (SOS) token, the model will then start predicting the word that would most likely appear next in the sentence based on the previously observed correlations from the training dataset. It does the same thing for images, but instead of words, it generates parts of an image. In previous iterations of Autoregression-based image generation this has happened from left to right, row by row, from top to bottom; the same way we read and write in English.

A series of four images, screen-captured from OpenAI's GPT-4o Image generation process, the image is of a woman writing on a whiteboard about Autoregression Models. It appears to be being unveiled from the top left, with blur being progressively removed. The four images are titled, "Getting started", "Creating image. May take a moment", "Adding details" and "Image created".

Historically, it was found that this method of Image Generation was slow, inefficient and computationally expensive. We can reasonably infer that this has been rectified in recent iterations of autoregression-based image generation with the utilisation of patches. Patches break down the image from individual pixels into collections of pixels. For example, an image made up of 256 x 256 pixels could be broken down into 16 x 16 patches made up of 16 x 16 pixels. This likely allows the model to carry out some planning into what the image will contain in each patch and work on generating them simultaneously, improving its efficiency.

We can also logically conclude that OpenAI has unified different token types, e.g. image, audio and text, into one model architecture in GPT-4o. Essentially, the model treats images as sequences of tokens, allowing them to be processed by the same transformer layers used for text and audio. Additionally, this means that it can refer to interaction context throughout text and image generation. You may find that in your interactions with ChatGPT since this update, the model will offer to create a visual representation to aid your understanding. This also allows for better quality iterative interactions with the model, as it has a unified context for the content of the image and the prompts and generated responses within your interaction.

Diffusion VS Autoregression

Diffusion models tend to struggle with accuracy in certain areas, predominantly when generating text within an image. There are several possible reasons for this limitation; for example text featured within images in the dataset being out of focus, in the background, or bearing little relevance to the image’s content. Additionally, diffusion models carry out each stage of the de-noising process on each pixel in parallel, compared with the sequential nature of images generated with autoregression models. This allows autoregression models to consider the information that has come before future generated pixels.

The autoregression methodology does result in slower, less efficient generation, but from what we can see, better accuracy. Diffusion models do not appear to carry out any semantic planning or reasoning regarding the pixels in the generated images. Again, this is likely where OpenAI’s new GPT-4o model differs. Whilst OpenAI have not released the specific details about how the image generation in GPT-4o works, it is likely that the model’s unified token architecture and therefore, context-aware generation, contributes to the improved accuracy of text in their images.

How will Autoregression-based Image Generation affect Document Fraud?

Following the release of this new methodology, we have seen that the model has alarming capability when creating fraudulent documents like forms of identification. There are many authentication processes built into everyday society that involve image verification, and AI is again lowering the barrier to entry for bad actors to create malicious resources.

We decided to try our hand at creating something to test its accuracy. For this exploration, we tried to create a fraudulent receipt. On our first attempt, after several iterative prompts attempting to tweak the generated images, including providing a reference image of a real receipt, it did not create something that we believe would result in a successful fraudulent transaction. The images lacked a degree of realism, confused the content of the receipt and featured inaccurate calculations.

A series of four AI generated receipts, they are all for a restaurant called "Fakey's Diner" and feature a caveat at the bottom of the receipt indicating that is not a real receipt and is for demonstrative purposes only. The receipts do not appear to be very realistic and are unlikely to be believed to be images of genuine receipts.

However, we decided to give it one more chance, and surprisingly, it performed much better, creating the following image based on this initial prompt; “Can you please create me a realistic image of a receipt for a restaurant called Fakey’s Cafe, that shows the order for an omelette and side salad, with a cup of tea? The currency should be in pounds and show that the customer paid in cash and left a 10% tip. The receipt should have a note at the bottom that says this is not a real receipt and it was created for demonstrative purposes. Please make this look like a realistic image of a receipt taken with a phone camera”.

A fake receipt from “Fakey’s Cafe” generated using GPT-4o. A disclaimer at the bottom states, “This is not a real receipt, it was created for demonstrative purposes.” The receipt is set against a wooden tabletop background, designed to realistically mimic a printed paper format.

AI Image Generators attempt to ensure users are aware of the origin of an image by including watermarks or a signature in the metadata. Whilst this is a hurdle for bad actors to overcome, it is not a foolproof solution. This provenance data can be removed or changed to imitate authenticity. Whilst a signature validator could detect any attempt at removal, it may not be able to detect how the information has been altered. Another simple solution is simply taking a screenshot or photograph of the image, resulting in all original metadata being lost. The success of such attempts to verify an image’s authenticity relies on a lot of trust; trusting the original inputter of the metadata, trusting each new signer of the metadata and trusting the validation tools assessing the data. It leaves a lot of the onus on the person checking to trust what they are being presented with, but as we well know, not everyone on the internet is trustworthy.

Whilst it is easy to get excited by such tangible improvements to the quality of images Generative AI can create, we must remember that these capabilities also become available to those who wish to use them maliciously. Vigilance in finding ways to combat attacks of this variety is imperative as it becomes less and less easy to determine AI generation at first glance. Taking a unified approach to our openness around AI usage and creating trustworthy provenance data about the origin of images will be essential to discount images that do not.

Find out more by visiting our Artificial Intelligence page to view publications and resources, join us for events and discover what AI has to offer through our range of interactive online demos.

Join our AI in Education communities to stay up to date and engage with other members.

Get in touch with the team directly at AI@jisc.ac.uk

How does Image Generation Work?

Diffusion VS Autoregression

How will Autoregression-based Image Generation affect Document Fraud?

Leave a Reply Cancel reply