ChatGPT image generation: What’s changed and why it matters

OpenAI have built their most advanced image generator yet into GPT-4o, which ‘excels at accurately rendering text, precisely following prompts, and leveraging 4o’s inherence knowledge base and chat context’. This was of immediate interest, as we recently wrapped up our image generation pilot. The pilot revealed promising use cases, however, a few recurring limitations emerged. Of these, the most notable were around text accuracy and photorealism. We’re currently finalising the report, which will dig deeper into these findings, but for now, an update from OpenAI appears to address some of these challenges.

This blog will provide a comparison between a selection of images I created during the pilot using ChatGPT’s DALL-E model and new images created with GPT-4o. It will also include images with text and reflect on my experience using the new model, while considering what this could mean for both education and the environment.

What’s new

OpenAI are now using an autoregression model rather than a diffusion model. Some information on how this works can be read in Michael’s blog, a few recent LLM Trends. We’ll also be publishing a blog that explores the technology in more detail shortly. 4o image generation is available to all users, although those who pay monthly will benefit from faster generation and higher usage limits.

Image generation is now native to GPT-4o, which means you can refine images through natural conversation and ‘build upon images and text in chat context, ensuring consistency throughout’. Further details can be read about in full on OpenAI’s website.

Side by side comparison

Fashion editorial images

I used ChatGPT (paid) to create images for our image generation pilot and had been pleased with the results. I explored some use cases that I thought were useful for a textiles/fashion student, having studied this back at school. I began by creating the following:

“Create a high-fashion editorial image of a clothing design inspired by Victorian era mourning dress. The garment is crafted from lace and velvet. The model poses with elegance, set against a minimalist backdrop with dramatic lighting that highlights texture and ambience. The setting evokes a sense of sophistication, with muted tones that allow the clothing to be the focal point. The style and quality should match a fashion magazine aesthetic, with attention to detail in every element, from makeup to lighting to fabric texture, capturing an aura of luxury and refinement.”

For reference, here are the images generated using DALLE-3:

A high-fashion editorial image featuring a woman in a Victorian-inspired black gown. The dress is made of intricate lace and velvet, with puffed sleeves, a high ruffled collar, and flared lace cuffs. The left side shows a close-up of the model’s face and bodice; the right shows the full-length view, with the model standing in a dimly lit, elegant setting. Produced by DALLE.

A high-fashion editorial image of a man wearing an elaborate Victorian-inspired black outfit. The design features rich black velvet, intricate lace detailing, puffed shoulders, and flared lace cuffs. The model is shown in both a close-up and full-body view, highlighting the ornate textures and structured tailoring of the garment. Produced by DALLE.

I thought these were great. The fabric and detail were exactly what I had imagined, and I thought it really captured the lighting and setting perfectly. The focus had been on the clothing, so I wasn’t overly concerned about the people, however you can see that some of the features (eyes, ears, hands) didn’t look quite right. I thought that it was quite clearly produced by generative AI too, but I wasn’t surprised by this.

Fast forward a few months, I input the exact same prompt into ChatGPT-4o and this was the output:

A woman in a black Victorian-style dress poses against a dark, neutral background. The dress features lace detailing, puffed sleeves, and a high ruffled collar. She stands with eyes closed, exuding a calm and composed presence. Produced by GPT-4o.

A man in a dark, formal Victorian-style suit poses against a muted backdrop. He wears a high-collared shirt with a cravat and a fitted velvet coat, looking off to the side with a composed expression. Produced by GPT-4o

I was genuinely impressed by the progress and thought the level of photorealism was remarkable. Many of the usual giveaways, such as facial features and hands, have been significantly improved. The poses also felt much more natural, and the overall image quality (including lighting, shadows, clothing, make up and textures), was noticeably enhanced.

I wanted to test the tool’s contextual understanding next, so I asked two separate questions, which would provide separate settings, lighting and even add a new accessory. These were my prompts:

Can you now put these two examples (the man and the woman in Victorian inspired dress) into a new location? I’d like them to be walking down a street in Rome at sunset
Can you now have them wearing black sunglasses, give the woman a light pink handbag, and change the setting so they are walking in Sorrento, Italy at midday?

A man and woman dressed in formal black attire inspired by Victorian era mourning dress walk down a cobblestone street at sunset in Rome, with the dome of St. Peter’s Basilica in the background. The scene is moody and elegant, with warm golden light casting long shadows. Produced by GPT-4o

A fashionable man and woman dressed in elegant black outfits inspired by Victorian era mourning dress walk confidently down a cobblestone street lined with colourful buildings in Sorrento. Both wear dark sunglasses and the woman carries a pale pink handbag. Produced by GPT-4o

I was so impressed with the results. I thought it had captured the setting perfectly and I really liked how the ‘sunset’ image had a much softer tone to it. The lighting at midday was spot on in the ‘Sorrento’ image, and I particularly liked the skin texture it had given the models. I hadn’t been expecting such attention to detail, but I think that was a great feature. You’ll also notice that the hands and facial features are much more lifelike.

A limitation noted during our pilot was that multiple iterations often led to image distortion. This no longer appears to be an issue, as the quality now remains consistent throughout.

Text in images

Text in AI generated images has often been complicated, with frequent spelling errors or completely warped letters. I provided my prompt, which was around teaching idioms to students learning English as a foreign language:

Create an educational infographic for students learning English as a second language. It should show four common English idioms with simple definitions and illustrations. The idioms should include: ‘break the ice’, ‘under the weather’, ‘piece of cake’, and ‘let the cat out of the bag’. Use clear, readable fonts and a colourful, student-friendly design.

A colourful poster titled “English Idioms” featuring four idioms with illustrations and meanings: “Break the Ice” (to start a conversation), “Under the Weather” (sick or unwell), “Piece of Cake” (very easy), and “Let the Cat Out of the Bag” (to reveal a secret). Produced by GPT-4o

I thought the output was fantastic. The spelling was correct, the design was indeed colourful, and it had followed my instructions perfectly.

I also wanted to see how it dealt with another language, so I created an educational poster with basic food vocabulary in both English and Italian:

Create an educational poster with basic food vocabulary in English and Italian side by side. Include ten words (e.g., “Coffee – Caffè” “Strawberry – Fragola,” “Cake – Torta,” etc.) in a grid layout with icons. Title: “Food Vocabulary – English & Italian”.

A food vocabulary chart showing English and Italian translations with icons. Items include: Coffee (Caffè), Strawberry (Fragola), Cake (Torta), Bread (Pane), Pizza (Pizza), Carrot (Carota), Apple (Mela), and Wine (Vino). Produced by GPT-4o

It didn’t quite get the ten words I requested, and the bottom of the image was slightly cut off. However, this is a known limitation, and OpenAI has stated that GPT‑4o can occasionally crop longer images, like posters, too tightly, especially near the bottom. That being said, it successfully handled all translations and spellings, and it generated images that were both appropriate and relevant. While there are already numerous infographics available, this feature provides educators the opportunity to personalise teaching content, which is an incredibly valuable tool.

Finally, I asked:

Now generate a POV of a person reading this diagram in their notebook, while sat at a table, with an espresso, looking over the Colosseum in Rome

A person holding an English and Italian food vocabulary guide (the same mentioned previously) while sitting at an outdoor café table with a cup of espresso. In the background, the Colosseum in Rome is visible. The text is now distorted and there are some spelling/translation mistakes (for example, 'Fragola' is 'Fradola' and 'Apple' has been incorrectly translated to 'Patata'. Produced by GPT-4o

As you can see, this encountered a number of problems. Although the setting generated was correct, the notebook seemingly blends into the table and more importantly, it struggled with the text. The spelling is wrong, and it’s given the wrong Italian word for ‘apple’. Considering this was the first time I’d encountered errors with the new model, I still found it impressive overall. The improvement in quality over just a few months is excellent and opens up exciting possibilities for image generation.

Education

This advancement in image generation has the potential to support and enhance education and there are excellent use cases for students and staff alike. For educators, it can reduce the time spent sourcing suitable images or designing their own materials, something that I often found time consuming when I was teaching. It can also encourage greater creativity in storytelling, creative writing prompts, infographic creation, and media for marketing, to give a few examples.

However, it’s important to ensure that learners are prepared to navigate these technological developments with the right skills and knowledge. They need to have an understanding of the ethical and practical challenges. As these AI generated images become harder to distinguish from real photographs, concerns around misuse, such as deepfakes, must be addressed.

Environmental impact

While I understand that many will be interested in other ethical issues, particularly around bias and intellectual property, I don’t have the space here to explore them in full. However, I will briefly touch on the environmental impact of advanced image generation, as this is an area I’m currently exploring in more depth.

While this technology brings exciting creative potential, it comes with a real environmental cost. Image generation (especially high quality) relies on powerful GPUs which consume a large amount of energy and water. GPT-4o takes longer to generate images, which uses greater resource, and consequently increases emissions.

Limits

During testing, I experienced image generation rate limits, which started with just a two minute cool down period and extended to twenty minutes. Limits are in place to ensure fair usage, keep systems stable, and manage GPU strain. This hints at the sheer scale of resources required and poses the question, as these systems continue to improve, will the energy demand required to run them continue to rise?

Tech companies are moving towards renewable-powered data centres, but it’s still fair to suggest that this level of image generation, especially at scale is highly energy-intensive. If you’re interested in reading more about the environmental impact of AI, we have a couple of blog posts available, including taking a responsible approach and more recently, a round up of the current landscape.

Final thoughts

Overall, the latest developments in image generation show a promising step towards making this technology a practical and creative tool, especially within education.

That being said, it’s important to remain aware of limitations and risks. Taking a mindful approach to using the tool still stands. As with everything, it needs to be used in moderation, while maintaining an ongoing conversation about the real-world costs of these technologies. This is an exciting time for developments, and it’ll be interesting to see what’s next.

Find out more by visiting our Artificial Intelligence page to view publications and resources, join us for events and discover what AI has to offer through our range of interactive online demos.

For regular updates from the team sign up to our mailing list.

Get in touch with the team directly at AI@jisc.ac.uk