OpenAI have built their most advanced image generator yet into GPT-4o, which ‘excels at accurately rendering text, precisely following prompts, and leveraging 4o’s inherence knowledge base and chat context’. This was of immediate interest, as we recently wrapped up our image generation pilot. The pilot revealed promising use cases, however, a few recurring limitations emerged. Of these, the most notable were around text accuracy and photorealism. We’re currently finalising the report, which will dig deeper into these findings, but for now, an update from OpenAI appears to address some of these challenges.
This blog will provide a comparison between a selection of images I created during the pilot using ChatGPT’s DALL-E model and new images created with GPT-4o. It will also include images with text and reflect on my experience using the new model, while considering what this could mean for both education and the environment.
What’s new
OpenAI are now using an autoregression model rather than a diffusion model. Some information on how this works can be read in Michael’s blog, a few recent LLM Trends. We’ll also be publishing a blog that explores the technology in more detail shortly. 4o image generation is available to all users, although those who pay monthly will benefit from faster generation and higher usage limits.
Image generation is now native to GPT-4o, which means you can refine images through natural conversation and ‘build upon images and text in chat context, ensuring consistency throughout’. Further details can be read about in full on OpenAI’s website.
Side by side comparison
Fashion editorial images
I used ChatGPT (paid) to create images for our image generation pilot and had been pleased with the results. I explored some use cases that I thought were useful for a textiles/fashion student, having studied this back at school. I began by creating the following:
“Create a high-fashion editorial image of a clothing design inspired by Victorian era mourning dress. The garment is crafted from lace and velvet. The model poses with elegance, set against a minimalist backdrop with dramatic lighting that highlights texture and ambience. The setting evokes a sense of sophistication, with muted tones that allow the clothing to be the focal point. The style and quality should match a fashion magazine aesthetic, with attention to detail in every element, from makeup to lighting to fabric texture, capturing an aura of luxury and refinement.”
For reference, here are the images generated using DALLE-3:
I thought these were great. The fabric and detail were exactly what I had imagined, and I thought it really captured the lighting and setting perfectly. The focus had been on the clothing, so I wasn’t overly concerned about the people, however you can see that some of the features (eyes, ears, hands) didn’t look quite right. I thought that it was quite clearly produced by generative AI too, but I wasn’t surprised by this.
Fast forward a few months, I input the exact same prompt into ChatGPT-4o and this was the output:
I was genuinely impressed by the progress and thought the level of photorealism was remarkable. Many of the usual giveaways, such as facial features and hands, have been significantly improved. The poses also felt much more natural, and the overall image quality (including lighting, shadows, clothing, make up and textures), was noticeably enhanced.
I wanted to test the tool’s contextual understanding next, so I asked two separate questions, which would provide separate settings, lighting and even add a new accessory. These were my prompts:
- Can you now put these two examples (the man and the woman in Victorian inspired dress) into a new location? I’d like them to be walking down a street in Rome at sunset
- Can you now have them wearing black sunglasses, give the woman a light pink handbag, and change the setting so they are walking in Sorrento, Italy at midday?
I was so impressed with the results. I thought it had captured the setting perfectly and I really liked how the ‘sunset’ image had a much softer tone to it. The lighting at midday was spot on in the ‘Sorrento’ image, and I particularly liked the skin texture it had given the models. I hadn’t been expecting such attention to detail, but I think that was a great feature. You’ll also notice that the hands and facial features are much more lifelike.
A limitation noted during our pilot was that multiple iterations often led to image distortion. This no longer appears to be an issue, as the quality now remains consistent throughout.
Text in images
Text in AI generated images has often been complicated, with frequent spelling errors or completely warped letters. I provided my prompt, which was around teaching idioms to students learning English as a foreign language:
Create an educational infographic for students learning English as a second language. It should show four common English idioms with simple definitions and illustrations. The idioms should include: ‘break the ice’, ‘under the weather’, ‘piece of cake’, and ‘let the cat out of the bag’. Use clear, readable fonts and a colourful, student-friendly design.
I thought the output was fantastic. The spelling was correct, the design was indeed colourful, and it had followed my instructions perfectly.
I also wanted to see how it dealt with another language, so I created an educational poster with basic food vocabulary in both English and Italian:
Create an educational poster with basic food vocabulary in English and Italian side by side. Include ten words (e.g., “Coffee – Caffè” “Strawberry – Fragola,” “Cake – Torta,” etc.) in a grid layout with icons. Title: “Food Vocabulary – English & Italian”.
It didn’t quite get the ten words I requested, and the bottom of the image was slightly cut off. However, this is a known limitation, and OpenAI has stated that GPT‑4o can occasionally crop longer images, like posters, too tightly, especially near the bottom. That being said, it successfully handled all translations and spellings, and it generated images that were both appropriate and relevant. While there are already numerous infographics available, this feature provides educators the opportunity to personalise teaching content, which is an incredibly valuable tool.
Finally, I asked:
Now generate a POV of a person reading this diagram in their notebook, while sat at a table, with an espresso, looking over the Colosseum in Rome
As you can see, this encountered a number of problems. Although the setting generated was correct, the notebook seemingly blends into the table and more importantly, it struggled with the text. The spelling is wrong, and it’s given the wrong Italian word for ‘apple’. Considering this was the first time I’d encountered errors with the new model, I still found it impressive overall. The improvement in quality over just a few months is excellent and opens up exciting possibilities for image generation.
Education
This advancement in image generation has the potential to support and enhance education and there are excellent use cases for students and staff alike. For educators, it can reduce the time spent sourcing suitable images or designing their own materials, something that I often found time consuming when I was teaching. It can also encourage greater creativity in storytelling, creative writing prompts, infographic creation, and media for marketing, to give a few examples.
However, it’s important to ensure that learners are prepared to navigate these technological developments with the right skills and knowledge. They need to have an understanding of the ethical and practical challenges. As these AI generated images become harder to distinguish from real photographs, concerns around misuse, such as deepfakes, must be addressed.
Environmental impact
While I understand that many will be interested in other ethical issues, particularly around bias and intellectual property, I don’t have the space here to explore them in full. However, I will briefly touch on the environmental impact of advanced image generation, as this is an area I’m currently exploring in more depth.
While this technology brings exciting creative potential, it comes with a real environmental cost. Image generation (especially high quality) relies on powerful GPUs which consume a large amount of energy and water. GPT-4o takes longer to generate images, which uses greater resource, and consequently increases emissions.
Limits
During testing, I experienced image generation rate limits, which started with just a two minute cool down period and extended to twenty minutes. Limits are in place to ensure fair usage, keep systems stable, and manage GPU strain. This hints at the sheer scale of resources required and poses the question, as these systems continue to improve, will the energy demand required to run them continue to rise?
Tech companies are moving towards renewable-powered data centres, but it’s still fair to suggest that this level of image generation, especially at scale is highly energy-intensive. If you’re interested in reading more about the environmental impact of AI, we have a couple of blog posts available, including taking a responsible approach and more recently, a round up of the current landscape.
Final thoughts
Overall, the latest developments in image generation show a promising step towards making this technology a practical and creative tool, especially within education.
That being said, it’s important to remain aware of limitations and risks. Taking a mindful approach to using the tool still stands. As with everything, it needs to be used in moderation, while maintaining an ongoing conversation about the real-world costs of these technologies. This is an exciting time for developments, and it’ll be interesting to see what’s next.
Find out more by visiting our Artificial Intelligence page to view publications and resources, join us for events and discover what AI has to offer through our range of interactive online demos.
For regular updates from the team sign up to our mailing list.
Get in touch with the team directly at AI@jisc.ac.uk