How to Choose the Right Model: LLMs in Education Explained

Introduction

With new models launching almost weekly, it is increasingly difficult to choose the right Large Language Model (LLM) for a given task. As model providers push to improve the performance of their LLMs, we are seeing greater divergence between the types of tasks models perform well at. Most people who use AI likely open their chosen AI provider and interact with the default model, consequently not always receiving the highest quality generated response. LLMs often have names that bear little correlation to the type of task they excel at, and so this article aims to provide clarity on which models to use, when, and why, given the currently available options. We will clarify the technological differences between models and the current options available to measure model performance, helping to inform future decisions as the AI model landscape continues to evolve. The information included in this blog post is accurate at the time of publishing but is subject to change. You can find an overview on model performance at the end of this post.

The Release of GPT-5

We recently saw the release of GPT-5 and the retirement of OpenAI’s legacy models as part of the rollout, resulting in some of the information relating to model performance within this blog becoming instantly outdated. Whilst waiting for the dust to settle, OpenAI have, following user complaints, temporarily reinstated access to some of the models mentioned in this blog post. As access to previous models is currently still available will still include them in this blog, but will add GPT-5 and GPT-5 Thinking as an option throughout. We have released a more in-depth blog about GPT-5, how it works and users’ initial impressions, but for the context of this blog here is a brief explanation.

Benchmarks tend to be automated processes tested against the Application Programming Interface (API) versions of LLMs. For most other providers, this is a similar version of a model you would likely interact with within the AI provider’s chat user interface. This differs slightly with GPT-5, in that the version hosted on ChatGPT includes model switching technology. Without OpenAI releasing the specific details of how this works, we can only assume that it automatically assigns reasoning power to most appropriately respond to your request. This will include switching between multiple models collected under GPT-5 and GPT-5 Thinking titles, based on required reasoning capabilities. GPT-5’s performance will vary, based on your prompt, the model and the amount of reasoning allocated to your request. In most instances, GPT-5 benchmarking data utilise the API version of GPT-5 with reasoning capabilities categorised as minimal, low, medium or high.

Model Overview and Cost

Several AI providers offer a range of LLMs through their platforms at varying costs to users. We understand that people are likely tied to one provider and will therefore cover which models are best for different tasks from three prominent AI providers: OpenAI, Google and Anthropic. For those who have the flexibility to choose between providers, here is some information that may be worth considering:

For each of the providers we discuss in this blog post, there are three main levels of plan on offer. The first being their free plan, this normally includes access to one reliable model for daily tasks and limited access to their more specialised, high performance models. This level of subscription will require you signing up for an account to keep track of any usage limits in place. It will also typically provide you the lowest level of security and privacy with regards to the information you input. Your data will often be used to train future models or fine-tune the performance of existing models, it may help to classify any of your interactions had with free-level accounts as in the public domain.

The next step up are the lowest cost paid plans, OpenAI’s ChatGPT Plus, Google’s AI Pro and Anthropic’s Claude Pro, which will provide increased access to more specialised models. This is normally a similar level of access to education, team and enterprise accounts and is likely sufficient for most educational use cases. These plans normally cost between £13-£20 per user, per month, varying based on monthly or annual payment schedules.

Finally, each of the providers offers high-level accounts, OpenAI’s ChatGPT Pro, Google’s AI Ultra and Anthropic’s Claude Max, which include the most model usage and access to new and cutting-edge AI capabilities. These plans can cost anywhere from £90-£250 per user per month and will likely go beyond the average user’s requirements within education, and so we haven’t included further details about those plans within this article.

Below, we have created a table that roughly summarises each AI provider’s offering into general categories: daily tasks like summarising information and document planning, creative writing tasks, complex problem-solving like programming, research, and finally, multi-modal queries like image generation. Models with usage limits within free tier plans are indicated with a star (*), and those that are only included within a paid plan are indicated with a pound sign (£). All models included with limited usage in the free mode are available with increased usage limits within paid plans.

Overview of recommended activities of available Large Language Models (LLMs) from main AI providers OpenAI, Google and Anthropic and their availability on free or paid plans. This information is accurate at the time of writing this blog post 05/08/2025 but is constantly subject to change. Updated 15/08/2025 following the release of GPT-5.
AI Provider	Daily Tasks	Creativity, Writing, Emotional Intelligence	Problem Solving	Research	Multi-modality
Open AI	GPT-5 (*) GPT-4o (£)	GPT-5 (*) GPT-5 Thinking (£) o3 (£)	GPT-5 (*) o4-mini (£)	Deep Research (*)	GPT-5 (*) GPT-4o (£)
Google	Gemini 2.5 Flash (*)	Gemini 2.5 Pro (*)	Gemini 2.5 Pro (*)	Deep Research () NotebookLM ()	Gemini 2.5 Flash Gemini 2.5 Pro
Anthropic	Claude Sonnet 4(*)	Claude Opus 4 (£) (Creative / Longform Writing) Claude 4 Sonnet (*) (Emotional Intelligence)	Claude Opus 4 (£) (Thinking)	Claude Research (£)	All Models: Can process but not generate images. Cannot process video or audio.

Which Model should I use and why?

We know that most users will commit to one of the above providers and therefore utilise the models within their chosen subscription. However, hypothetically, let’s say you have a paid account for all the above AI providers. Which tool should you use for which task?

Benchmarks

In some areas, we will utilise AI benchmark leaderboards to compare the models against one another. AI benchmarks are standardised tests that allow us to assess the performance of LLMs within different domains, such as multitask performance, maths, and agentic reasoning. We are aware that benchmarks are by no means a foolproof way of judging the performance of LLMs, but assessing the quality of LLMs and their associated risks is one of the most challenging areas in the development of AI.

Some of the limitations with AI benchmarking include over-promising capabilities, ability to be gamed, measuring the wrong thing and being not well-suited for practical, real-world applications. AI benchmarks are also generally judged by other LLMs which have their own built in perception of quality and biases when measuring the performance of other models. The benchmark creators often attempt to control for these biases, but in some instances, this is not possible. Whilst better methods of evaluating LLMs are in development, we have used some benchmark leaderboard data, which we ask that you view with some scepticism. Some benchmarks are more trustworthy than others, so we have aimed to use those deemed more reputable, but a large quantity of benchmarking data is self-declared by AI providers who have an economic interest in their models performing better than others. Benchmarks can be used to guide your model choice, but should not be used to dictate it.

Context Windows

A characteristic that will likely play a role within the decision-making process for many tasks is the context size or token limit associated with an LLM. A token is a bite-sized chunk of information that is processed by an LLM. This may take the form of a word, part of a word, a form of punctuation or in multimodal models, a pixel. So, a user prompt or AI-generated response will consist of a collection of tokens. The context window of an LLM is the number of tokens a model will consider in its response, essentially, the quantity of information it will “remember” in your current interaction.

From a user perspective, context window size may be important if you wish to process a large quantity of information within one interaction, for example, having a first pass at grading a cohort’s submitted assessments and want to provide any corresponding rubrics and mark schemes to the LLM. For some LLMs the context window size or token limit will likely not be sufficient to carry out certain tasks. Below are the current context window values for our three providers’ models when accessing them via the web-based chat interface. In this table and others within this article, the highest performing models in each area will be highlighted in bold and followed by a * symbol.

Context window size per Large Language Model (LLM). This information is accurate at the time of writing this blog post 05/08/2025 but is subject to change. Updated 15/08/2025 following the release of GPT-5.
AI Provider	Model	Context Window
OpenAI	GPT-4o	128,000
OpenAI	o3	200,000
OpenAI	o4-mini	200,000
OpenAI	o4-mini-high	200,000
OpenAI	GPT-5 (main)	256,000
OpenAI	GPT-5 Thinking	196,000
Google	Gemini-2.5-Flash*	1,000,000
Google	Gemini-2.5-Pro*	1,000,000
Anthropic	Claude Sonnet 4	200,000
Anthropic	Claude Opus 4	200,000

Reasoning Capabilities

There is another large factor that determines LLM usage, and that is the presence of reasoning or chain of thought functionality. This was a feature that emerged in September 2024, which gives an LLM the ability to “think” more deeply about a problem. You would look to use a reasoning model for any tasks that require complex decision making, large and multi-factorial problems, exploration of novel concepts and tasks requiring deductive or inductive reasoning, like solving complex maths problems or restructuring programmatic code bases. Reasoning models often take longer to return answers and are more computationally expensive, so they should only be utilised for the correct use cases.

Each of the providers discussed offers access to reasoning capabilities within their paid plans, and some with limited use within their free plans. OpenAI previously utilised standalone reasoning models like o3, o3-mini and o3-pro until the release of GPT-5. Interactions with GPT-5 within ChatGPT will result in queries overflowing to a GPT Thinking model when more reasoning is required. In contrast, Anthropic’s Claude Sonnet 4 and Opus 4 have opted for a hybrid approach, allowing you to toggle “extended thinking” on and off during your usage. Google has implemented reasoning into the fabric of their Gemini 2.5 base models, so no intervention is required by the user to activate a reasoning mode or select a reasoning model; it will automatically assign more “thinking” power to a request when required.

Honourable Mentions

Google offers an additional tool called NotebookLM, and whilst not exactly an LLM right at its source, it does come included within Google’s Pro plan and offers some free usage. This tool utilises the latest Gemini models and, similarly to the deep research tools, allows you to select from the internet or upload 10 sources, which it will then use to create a chat interface for you to query the information. It can then take this information and generate an audio overview in the style of a podcast narrated by two people, which you can ask questions of in real time. It also offers the ability to create a mind map or study guide of the information in the sources to help boost understanding around a topic. Any of these materials can then be shared publicly. This application offers several use cases across the areas mentioned below, from student support to content creation to administrative efficiency, so we thought it was worth an honourable mention.

Another useful feature with useful applications across most educational use cases is ChatGPT’s Memory feature. This allows ChatGPT to reference and utilise past conversations to gain a better idea of the types of tasks you may be carrying out more frequently and, therefore, how to create more relevant responses for you. This removes the need to continually explain context for example your course, what modules you are studying or teaching and any current assessments. ChatGPT might also become aware of your preferred working style and tools and make personalised recommendations to boost performance accordingly. You can view “Saved Memories” within the settings of your ChatGPT account if you have memory switched on and remove any or all of them.

Application within Education

Below are some specific use cases of how these models might be used within education, to try and offer further distinction between these models and their capabilities. It is worth noting that the rest of this blog post only discusses the use of LLMs at their source, not those that are integrated into products and services, like those being evaluated in our AI in Assessment pilot.

Assessment and Feedback

We are currently exploring the best way to use AI to help with marking and feedback in our pilots, to explore both accuracy and the right way to use them to meet the needs of both staff and students. LLMs should not be used for autonomous assessment grading and feedback but can assist educators under supervision. When looking for LLMs that will excel at assisting in grading and providing feedback for students, we want to ensure it has a good grasp of natural language to perform accurate judgment. We also require competency in the process of teaching, allowing models to more suitably assist educators in carrying out assessment and feedback tasks.

A female Muslim Student wearing a hijab and working at a computer. This prompted the inclusion of the two following AI benchmarks. The first is an extension of the Massive Multi-task Language Understanding (MMLU) benchmark. MMLU measures how well an LLM understands the language in the materials it is presented with or generates. This benchmark score is created by testing LLMs against multiple-choice questions relating to 57 subjects from high school to expert professional levels. We have opted for the MMLU-Pro benchmark here, as it aims to be more robust and challenging for LLMs and is a better metric for testing models with reasoning capability. The second is the Pedagogy benchmark created by FabInc, which tests LLMs’ ability to pass teacher exams, measuring their efficacy at learning related information and their ability to assess students.

MMLU-Pro and Pedagogy benchmark data per Large Language Model (LLM). This information is accurate at the time of writing this blog post 05/08/2025 but is subject to change. Updated 15/08/2025 following the release of GPT-5.
AI Provider	Large Language Model (LLM)	MMLU-Pro Benchmark	The Pedagogy Benchmark
Google	Gemini 2.5 Pro*	86.2%	88.8%
Google	Gemini 2.5 Flash	83.2%	85.5%
OpenAI	o3*	85.3%	87.9%
OpenAI	o4-mini	83.2%	82.0%
OpenAI	GPT-4o	80.3% (March ‘25)	78.3%
OpenAI	GPT-5	80.6-87.1%	80.3%
Anthropic	Claude Opus 4 (Thinking)*	87.3%	87.4%
Anthropic	Claude Opus 4	86.0%	86.3%
Anthropic	Claude Sonnet 4 (Thinking)	84.2%	86.7%
Anthropic	Claude Sonnet 4	83.7%	84.8%

Gemini 2.5 Pro, o3 and Claude Opus 4 (Thinking) perform the best in the aforementioned benchmarks, offering an insight into how they may perform at assessment and feedback tasks. A common feature of these LLMs is their classification as reasoning models, as discussed earlier. Their reasoning capabilities mean they could be utilised to plan more complex assignments that require a deeper level of thinking, provide higher quality feedback and support the assessment and moderation of student work.

OpenAI and Google offer a methodology of creating customised iterations of their LLMs known as GPTs and GEMs respectively. This allows institutions to create custom assistants for individual assignments with all relevant documentation present to support doing this effectively for feedback or assessment. The ability to create these assistants is only available on the mid-tier paid subscription levels and they use the latest models from their respective providers. Based on the above considerations on model performance, for an agent focused on student assessment and feedback, you would likely opt for a GPT with GPT-5 or a GEM with Gemini 2.5 Pro.

Personalised Learning and Student Support

A model well-suited for personalised learning and student support will need to have the capacity to gain an understanding of a student’s learning needs, which will require data or prolonged observation. The first, we know, can be achieved with a large context window and the provision of some relevant data, like previous assignment feedback. For an educator, this would also be a useful feature should you wish to personalise your learning materials for multiple students in one interaction.

A useful feature for personalisation is OpenAI’s Memory feature, as described earlier, it remains a consistent and automated method for creating situational awareness in your content generation. All the AI providers discussed here have methods for you to provide contextual information manually – within your account settings. This allows you to input guidance on how you would like these LLMs to respond to you or the context around your usage, like your role within your institution or the course you are enrolled on. Whilst this is not an automated feature, this allows you to dictate your usage preferences and context from the outset and on your terms, not needing to wait for the AI to learn this information about you.

People are also creating benchmarks attempting to measure the Emotional Intelligence of a model, which is important if any student support is being delegated to LLMs. The EQ-Bench 3 aims to assess the performance of LLMs in challenging roleplays, measuring factors like safety, assertiveness, warmth, empathy and pragmatism. An overall score is then calculated based on the following metrics: Demonstrated Empathy, Pragmatic Emotional Intelligence, Depth of Insight, Social Dexterity, Emotional Reasoning and Message Tailoring. EQ-Bench 3 is an experimental benchmark and has not been independently validated, therefore it is not an estabilished standard like with MMLU. A blog post by Freedom2Hear highlight additional, emotional intelligence benchmarking specific limitations, that arise due to the lack of universally agreed upon answers in emotional reasoning. Again, we can use benchmark data to guide our usage but not govern it. Out of the models we are covering, OpenAI’s o3 scores the highest, coming second overall, closely followed by Google’s Gemini 2.5 Pro in third.

EQ-Benchmark data and overall ranking per Large Language Model (LLM) included in this blog post. This information is accurate at the time of writing this blog post 05/08/2025 but is subject to change. Updated 15/08/2025 following the release of GPT-5.
AI Provider	LLM	EQ-Benchmark Score	EQ-Benchmark Rank
OpenAI	o3*	1500	3
Google	Gemini 2.5 pro (5.6.2025)	1468.1	4
OpenAI	ChatGPT 4o (27.3.3025)	1371.8	5
OpenAI	GPT-5 (gpt-5-chat-latest-2025-08-07)	1357.1	6
OpenAI	ChatGPT 4o (25.4.2025)	1318.7	7
Anthropic	Claude Opus 4	1295.0	8

Content Creation and Lesson Planning

Currently, not all models offer native multimodal capabilities built into their LLMs. This means they cannot create images, video or audio within your standard chat interaction. For users, this could restrict the range of content they can create, for example, limiting the capability to add AI generated visual learning aids or convert materials into alternative formats, like podcasts. The four models that offer these capabilities are OpenAI’s GPT-4o, GPT-5 and Google’s Gemini 2.5 Flash and Gemini 2.5 Pro. These multimodal models allow images to be created utilising all the content within the context window, such as learning content and curriculum information.

A teacher holding a dna model engaging with students in a classroom.

Of course, when creating learning content, there is more to consider than multimodality. Again, the MMLU-Pro and Pedagogy benchmarks are useful metrics here. We want to ensure that a model that helps us create learning content has a good understanding of language and the process of learning. Referring back to Table 3, we know that Claude Opus 4, OpenAI’s o3 and Gemini 2.5 Pro all perform well at both benchmark tests.

People have been attempting to measure the creativity of large language models for some time now, in the form of research papers and benchmarks. The Creative Writing (v3) Benchmark measures the creativity of a piece of writing by evaluating the following metrics: character authenticity and insight, interestingness and originality, coherence, instruction following, world and atmosphere, and the ability to avoid clichés, verbosity, gratuitous metaphors, and poetic overload. Whilst not all these metrics are relevant to the creation of lesson content, the model we select for these types of tasks must be able to demonstrate a level of creativity. Again, this is an experimental benchmark that has not been independently validated. The Creative Writing (v3) benchmark finds OpenAI’s o3 the most creative at writing, followed by Claude Opus 4 and GPT-5.

For most educators creating learning content, Gemini 2.5 Pro’s comprehension of language and pedagogy, multimodality and large context window makes it a strong choice.

Accessibility and Inclusivity

Several assistive products are being created that utilise LLMs, but again, we will only discuss the accessibility and inclusivity of LLMs natively here. Several techniques can be utilised to make an LLM perform better for users with accessibility requirements that aren’t satisfied out of the box. Whilst LLMs struggle to recognise neurodivergent language patterns, it is possible to create system prompts that explain how a user wishes to have their responses structured and the type of language used. Again, this might be an area where personalised customisation features are beneficial. For example, if you were choosing to manually customise your chosen AI provider profile, you could include information explaining that you have dyslexia, so you would like any responses featuring larger chunks of text broken down into smaller, more readable chunks. OpenAI’s Memory feature would also prove useful in providing further learned customisation and alter performance accordingly in future interactions.

Some models have multilingual capabilities, which can assist with creating content in multiple languages, supporting inclusion. We have mentioned a couple of times the MMLU, which measures general language understanding. Multilingual Massive Multitask Language Understanding (MMMLU) by OpenAI extends this benchmark to test LLM performance across 14 different languages. This has been extended even further in the AI Language Proficiency Monitor by Fair Forward, which tests LLMs across multiple benchmark datasets, measuring multilingual capabilities and collating this information into one place. This dashboard compares model performance across 150 languages. This is a computationally expensive task and the dashboard is under active development at the time of writing this blog with plans to complete more extensive evaluations later this year. It refreshes every 24 hours to display the most popular models, so to get the most up to date information we recommend checking the leaderboard.

We have previously considered the Pedagogy benchmark by FabInc, which offers another benchmark leaderboard focusing on teacher training questions related to teaching students with Special Education Needs and Disabilities (SEND). The models that score the highest on the SEND Benchmark are Gemini 2.5 Pro at 86%, Claude Opus 4 Thinking at 84% and o3 and GPT-5 both scoring 82%.

Another accessibility consideration is the methods by which you can interact with an LLM. OpenAI’s ChatGPT interface allows you to use voice dictation (speech-to-text) with no limits or spoken conversation in voice mode for a limited time each month. Their voice mode utilises the Memory feature mentioned previously, so it will implement any specified interaction needs you may have already established. OpenAI have also recently integrated a “Record Mode” into ChatGPT that allows you to record interactions like meetings or course content, with participants’ permission, and it will convert this into a canvas containing a structured summary which you can then edit or ask ChatGPT to refactor into another format, like a project plan.

Google AI offer an option for vocal interaction called Gemini Live, which is in preview and free to all Gemini users. It currently offers interaction with Gemini 2.5 Flash, or alternative versions of Gemini 2.5 Flash named Native Audio Dialog or Native Audio Thinking Dialog. Users can talk, share their screen or use photos and video to provide context to the model whilst receiving live text and audio responses. Gemini Live also allows users to specify system instructions to provide details around the tone and format of generated responses. There is a toggleable proactive audio feature that gives the user the option for Gemini to ignore audio that is not relevant to the interaction at hand, which may be helpful within a busy learning environment.

Anthropic also offers a voice mode, currently in beta, on their mobile phone app that utilises their Claude Sonnet 4 Model. Whilst free users can expect to get 20-30 conversations per month with Claude within their allocated usage, paid users receive more based on their plan. The modality of the Claude models means it offers less than is provided by Gemini Live, as it can only accept images and documents uploaded manually within the chat panel, compared with the continual livestreaming options of Gemini Live.

For Research

LLM providers offer specific research tools built into their offerings, again, varying in methodology from provider to provider. There are three tools which share a similar methodology; OpenAI Deep Research, Google Gemini Deep Research and Anthropic Claude Research. These are all multi-stage agentic processes that search the internet for information relating to a query, carry out synthesis and analysis on found sources and collate this information into a report, which is claimed by some to be at the level of a research analyst or assistant.OpenAI’s deep research feature, pre GPT-5 release, used its o4-mini model that maintains a similar quality to its predecessor, o3, but is more cost-efficient. Plus, team, enterprise and edu users are allocated 25 queries per month compared with free users who get 5 per month. After the release of GPT-5 it appears that toggling on the “Research” tool within ChatGPT does not result in a change in model, so is able to utlilise GPT-5. Google Gemini’s Deep Research also offers you the ability to upload your own resources and you are able to allocate this task to either Gemini 2.5 Flash or Gemini 2.5 Pro. A Google AI Pro account gains users 20 Deep Research report generations per day, whereas a free account will also limit the user to 5 per month. Anthropic’s Claude Research mode is currently only available to its Pro level users and above, with no free usage available. Anthropic’s usage limits are a little more complex and are implemented via sessions and timeslots. On a Pro plan, you will roughly be allowed around 45 messages per 5-hour timeslot, beyond which your usage will be throttled. Claude Research usage is included within your standard Claude allowance, which varies based on your plan, but it may use more allowance due to gathering resources and providing more comprehensive responses.

For Administrative Efficiency

Most of the default models preselected for you when you interact with an LLM provider will perform more than well enough for most administrative tasks, but there are a couple of things that may affect efficiency when using an LLM. Some of the interesting things to mention here are context window size, response latency and cost per token. Vellum AI offers a useful leaderboard that compares all these stats in one place. The three general workhorse models, those that will be the most readily available with the fewest usage limits across both free and paid for plans are Gemini 2.5 Flash, GPT-4o (ChatGPT), GPT-5 and Claude Sonnet 4. Google Gemini 2.5 Flash has the largest context window, lowest cost for both input and output tokens, the highest token generation speed, and also the lowest latency at 0.35 seconds, so seems like a clear winner as a general daily use model.

Model Performance Summary

The article has explained why you might opt for certain LLMs for a variety of educational use cases based on what types of tasks models excel at and other additional features that may further boost their efficacy. To assist reflection we have collated this information into one place. The table below details model strengths and suitability at a quick glance.

An overview of model performance, features and subscription level availability. This information is accurate at the time of writing this blog post 05/08/2025 but is subject to change.
AI Provider	Model Name	Required Plan	Model Overview
OpenAI	GPT-4o	Free	– Fast, low-latency for daily tasks – Multimodal – Utilises OpenAI’s Memory feature – Voice mode
OpenAI	GPT-5	Free	– Fast, low-latency for daily tasks – Scalable reasoning and compute for more complex tasks – Multimodal – Utilises OpenAI’s Memory feature – Voice mode
OpenAI	o3	Plus	– Reasoning model – Scores highly in Language, Pedagogy, SEND, Creativity and Emotional Intelligence Benchmarks – Utilises OpenAI’s Memory feature – Voice mode
Google	Gemini 2.5 Flash	Free	– Fastest, lowest-latency, cheapest model for daily tasks – Reasoning capabilities – 1 Million token context window – Multimodal – Voice mode
Google	Gemini 2.5 Pro	Free	– Scores highly in Language, Pedagogy, Emotional Intelligence benchmarks – Reasoning Capabilities – 1 Million token context window – Multimodal – Voice mode
Anthropic	Claude Sonnet 4	Free	– Fast, low-latency for daily tasks – Scores highly on Creativity benchmarks – Voice mode
Anthropic	Claude Sonnet 4 with Extended Thinking)	Pro	– Extension of the model above with toggleable reasoning capabilities
Anthropic	Claude Opus 4	Pro	– Scores highly on Language, Pedagogy, SEND and Creativity Benchmarks – Voice mode
Anthropic	Claude Opus 4 (Extended Thinking)	Pro	– Extension of the model above with toggleable reasoning capabilities

In the evolving landscape of AI in education, selecting the right large language model (LLM) for the right task is valuable for effective and efficient use. To help educators and learners make informed decisions, we’ve compared leading LLMs by task: from lesson planning to research and our findings are:

Feedback and assessment support: Claude Opus 4, Gemini 2.5 Pro, and OpenAI’s o3 stand out, thanks to their reasoning capabilities and strong performance on language understanding benchmarks.

Personalised learning and student support: o3 and Gemini 2.5 Pro lead, with high emotional intelligence scores and memory-like features that enhance long-term interactions.

Content creation and lesson planning: Gemini 2.5 Pro proves most versatile, with multimodal capabilities and strong creative and pedagogical performance, and its large context window.

Accessibility and inclusivity: Gemini Live offers the most advanced interaction options through screen sharing, video and voice modes.

Research tasks: all providers offer specialist tools, with OpenAI’s Deep Research, Gemini Deep Research, and Claude Research delivering structured, high-quality synthesis from multiple sources.

Administrative tasks: Google’s Gemini 2.5 Flash offers the best performance, with low latency, high speed and low cost.

Taken together, these insights aim to help educators make more informed decisions about which tools are most appropriate for their specific teaching and learning contexts.

Find out more by visiting our Artificial Intelligence page to view publications and resources, join us for events and discover what AI has to offer through our range of interactive online demos.

For regular updates from the team sign up to our mailing list.

Get in touch with the team directly at AI@jisc.ac.uk