We know marketers need a benchmark that’s relevant to what they do. One that shows you which AI model is best suited to the content you need to create.
So that’s exactly what we’ve done. We’ve taken the guesswork out of benchmarking, putting the latest AI models through their paces for the content that matters. Our AI Leaderboard highlights which tool will get you the best results (and we’ve shown our working out below, so you can see exactly how we got there).
The AI Leaderboard
Here’s how seven of the most popular AI models fare with writing blogs, opinion pieces, marketing emails and organic LinkedIn posts. (We’ll keep updating the leaderboard, so it covers more content and models).
We’ve given them all a percentage score– the higher, the better.
(Interested in finding out more about how we scored them? Jump down to our methodology section.)
Foreword from our AI Director
“Model releases come thick and fast from AI vendors and often get little more than a post on X about the notable performance differences. Take a look at this from the OpenAI team for example. What does “more natural, engaging, and tailored writing” mean in the cold light of day? Does it mean it’s the best AI for a CMO that wants to write their next email marketing campaign?
“Only the major version releases get benchmark scores, but even the benchmarks used (e.g. graduate level reasoning, multilingual math capability etc.) are pretty much irrelevant to everyone outside of the development and deep learning worlds.
“We want to change that. We have known for a long time that different models excel at different things, because our teams have told us so. We started to blend different models in our private AI suite for this very reason. We also built our prompt library to reflect this reality (it automatically runs prompts with specific models).
“This AI Leaderboard is our way of making the confusing world of AI more understandable for marketing teams. It is also the first of many AI benchmarks for Definition, and the genuine beginning of our move into model consultancy. The remaining part of the puzzle is to make all the best models, as ranked by our human experts, available in a secure environment – watch this space!”
Luke Budka, AI Director at Definition.
The breakdown
Here are how the models get on with the different types of content. For each type, we’ve highlighted the best (and worst!) performers and offered some insights into what they do well – and less well.
And we open each section with the full data, so you can see exactly what areas we reviewed each model on, and how we arrive at our overall judgements.
Blogs
The top performers
The top three models for writing blogs are:
- Claude 3.5 Sonnet: 69.64%
- GPT-4o: 66.61%
- NeMo: 65.89%
3.5 Sonnet’s main weak spot is how it structures the article copy – it didn’t flow quite as well as the other two. Neil Taylor (Definition’s Chief of Brand) thinks the tone is inconsistent:
“It’s a funny old mixture, this. There’s some nice stuff, but the tone veers. Paragraph three is judgy, four is strident, and the bit about ‘a cohesive, multi-platform presence’ sounds like it’s swallowed a marketing textbook. If it carried on as it started, it would’ve kept me reading. A good prompt here, much more specific about the tone, would probably have helped.”
GPT-4o is too verbose. NeMo falls short in terms of creativity, and its copy is less engaging than the OpenAI and Anthropic models.
The weakest link
The worst model for writing blogs is:
- Large 2: 52.5%
The model really struggles with turning a prompt into something engaging and creative. Even in the next category (opinion-led articles) where it wins, it still performs poorly in those two areas.
Neil: “What a yawn. This is tediously ‘fine’. Most of the sentences in isolation are OK, but because it’s got nothing original or surprising to say, it adds up to much less than the sum of its parts. AI will often help you craft your opinion into a nice readable form – but you need an interesting take to start with.”
Opinion-led articles
The top performers
The top three models for writing opinion-led articles are:
- Large 2: 43.00%
- GPT-4o: 40.14%
- Claude 3.5 Sonnet: 40.00%
3.5 Sonnet has the best use of clear language (an area it performs in well across all content types). Large 2 and GPT-4o do a better job of building a strong opinion within a clear structure.
Neil says this about the Mistral Large 2 piece: “What’s good about this is that it actually presents a coherent argument in easy-to-understand language. It doesn’t do anything flash or surprising, but gets the job done, and sounds like someone with a real opinion, not someone writing for an essay competition.”
Tom Pallot (Definition’s Head of Marketing) says: “All the models struggled with creativity in this round. No matter how good they are at writing and structuring an argument, they still need to be prompted with something worth saying.”
The weakest link
The worst model for writing opinion-led articles is:
- GPT-4 T: 25.43%
The performance difference between Mistral Large 2 and the bottom-ranking GPT-4 T is stark. Large 2 outperforms it in every area for writing opinion-led pieces.
This disparity is reflected in our qualitative insight too.
Neil says about the copy written by GPT-4T: “Blimey, this is exhausting. Like a first-year undergraduate trying to show off their wide vocabulary and metaphorical skills while their sentences totally get away from them. “The narrative of technology as fashion’s green savior is woven meticulously” – aren’t you clever! And the academic stuff like ‘it is posited that’ makes it sound very pompous indeed. I gave up.”
Tom says: “Despite Large 2 winning, I’d be tempted to choose Claude 3.5 Sonnet and invest more in the prompt. Sonnet’s strong language clarity suggests that with more detailed prompting – including unique subject matter expertise – it could match Large 2’s persuasiveness and insight depth.”
Marketing emails
The top performers
The top three models for writing marketing emails are:
- Gemini 1.5 Flash: 61.43
- Large 2: 49.46%
- Claude 3.5 Sonnet: 49.46%
Gemini 1.5 Flash is the clear winner here, leading against every measure. That’s somewhat surprising because it doesn’t rank well in any of the other content categories we tested. It excels at emails though, with fresh copy and standout CTAs.
Neil says: “This does lots of things you’d expect a pretty good writer to do: it’s got a clear structure; simple conversational words; it has some nice changes of pace so it doesn’t get too predictable. It still slips into cliche, though: ‘delivering speeches that get results’; ‘take your communication to the next level.’ And for my British taste, there’s a Lot of Shouty Title Case.”
The weakest link
The worst models for writing marketing emails are:
- GPT-4 T: 31.43%
- Claude 3 Sonnet: 31.43%
There’s a massive difference here between these models and Gemini 1.5 Flash. GPT- 4 T and Claude 3 Sonnet are outdated models, having been superseded by their younger siblings 4o and 3.5, so it’s maybe not a massive surprise to see them consistently appearing at the bottom of the pile.
Our qualitative data reflects the gap between first and last.
On Claude 3 Sonnet’s marketing email, Neil says: “Eesh. This starts out like a North Korean public service announcement. It uses the same hackneyed ‘say goodbye to… and say hello to…’ TWICE. And is full of salesy adjectives like ‘cutting-edge’ and ‘seamless, efficient and impactful’. If this were a person, you wouldn’t want to meet them.”
Organic LinkedIn posts
The top performers
The top three models for writing LinkedIn posts are:
- NeMo: 57.02%
- Claude 3.5 Sonnet: 53.69%
- Large 2: 48.10%
Another strong performance from the Mistral and Anthropic models, with NeMo and 3.5 Sonnet vying for the crown. Our panel think that 3.5 Sonnet does a better job of tailoring its copy specifically to the request – an organic LinkedIn post. But, like in the marketing email category, it falls short when it comes to writing compelling CTAs.
And all the top three still had low creativity scores overall, not going higher than Large 2’s 41.67%.
Neil’s take on the NeMo post: “This is totally competent, but a bit drab. If someone in our team wrote this, I’d tell them they had a good first draft but now it needs some zhuzh. An invitation like this is in a competition for attention, and this wouldn’t make it onto the podium.”
The weakest link
The worst model for writing marketing emails is:
- Claude 3 Sonnet: 30.00%
Claude 3 Sonnet is not the one for writing your organic LinkedIn posts. Its performance doesn’t come close in any department and it’s particularly poor at creativity and writing LinkedIn post copy that’s not boring. Again though, this is arguably to be expected given that Claude 3 Sonnet has been superseded as a model.
Neil’s view: “Yikes. This is what everyone fears AI writing is going to be: a bad impression of the most annoyingly corporate colleague you know. Anything that starts by writing ‘In today’s competitive landscape’ should be set to self-destruct. Glad it told me it was an ‘insightful’ session; I was worried it was going to be shallow and tedious. Like this writing.”
Our methodology: the thinking behind the AI Leaderboard
- We built our AI Leaderboard so marketing teams have a trustworthy way to see which model is best suited for the task at hand.
- We’ll update scores as existing models evolve, and new models are released. We’ll also expand the benchmarks into other categories of marketing content like video, design, and research/analysis.
- As the leaderboard is updated, any changes, additions and adjustments will be reflected in this methodology.
- This methodology is designed and updated by our AI and marketing teams, alongside our project advisory board.
Content creation
We ask a selection of AI models to create commonly used types of marketing content.
The initial models we’ve tested are:
- OpenAI’s
- GPT-4 T
- GPT-4o
- Anthropic’s
- Claude 3 Sonnet
- Claude 3.5 Sonnet
- Mistral’s
- Large 2
- NeMo
- Google’s
- Gemini 1.5 Flash
(We’ll add more models on an ongoing basis.)
To create the content, each model and each content category gets the exact same prompt. For our written content, that’s a simply built ‘one shot’ prompt. This is a prompting technique that teaches the model by giving it one example of what a good output looks like.
And this is the prompt structure we use:
“Please write a [insert content type] about [topic].
Use the <example></example> below to guide the tone and quality of your output.
<example>
[insert content example]
</example>”
The examples we give to the models are representative bits of Definition copy, picked by our judging panel as being standouts in their field.
Content scoring
The content created is scored against a set of measures by our panel of experts (who also came up with the criteria).
For every content type we’ve graded the copy against specific, different measures – because what makes a great blog isn’t the same as what makes a great social media post, for example.
And for each measure a model can achieve a score between 0 and 10, and the scores of all judges are added up for each measure. (So if there are four judges, the maximum score for that measure is 40). Then the scores for each measure are added to give each model one overall score.
All scores are turned into a percentage so you can compare all the results, with 100% being perfect and 0% being abysmal.
To ensure unbiased scoring:
- All content created by the models is anonymised
- Model outputs are presented to judges in a random order
Obviously, the scoring for each measure is subjective. But by using several judges (all of whom know what they’re talking about) any outlier opinions should be flattened out.
Our panel
We use our own in-house experts, from across the marketing spectrum, to judge what the AI models come up with.
For example the panel scoring LinkedIn posts is made up of both social media and language specialists. A future panel, scoring text-to-image models will be made up of designers and creative marketers.
As well as our judging panel the AI Leaderboard also has an advisory board who:
- consult on the methodology, to make sure the benchmark results are consistent, unbiased and actually useful
- ratify the results
- provide analysis and insight on the performance of the different models
The panel in full:
Nick Padmore, Head of Language
Padders heads up the plucky band of writer-consultants we call our Language team. He’s been writing for businesses for nearly 20 years, working with the likes of Monzo, Specsavers and Disney+. Along with his team, he leads Definition’s projects on tone of voice, naming, brand storytelling and, unsurprisingly, writing – as well as working with our AI team to craft prompts that actually work. He studied English Language at university, and is a co-founder of greeting card company Deadpan Cards, which you’ll probably find in a shop near you, and which still somehow manages to remain resolutely unprofitable.
Hannah Moffatt, Creative Director
Hannah Moffatt is one of our Creative Directors, a children’s author and a word nerd through and through. After studying French and Spanish at Cambridge University and Creative Advertising in Falmouth she’s spent the last 16 years helping global businesses find their brand voice and training them to use it. She’s written everything from charity appeals to whitepapers. And when Hannah’s not writing for clients, she’s writing for children too. Her debut, SMALL! was a Sunday Times Children’s Book of the Week and shortlisted for the 2023 Waterstones Children’s Book Prize.
Nick Banks, Senior Writer
Nick writes for our language and PR teams. From tone of voice and naming to thought leadership for niche industry publications and nationals, he’s done it all. He’s also a key part of our AI team, specialising in prompt engineering and AI training for big-name brands like PepsiCo. Nick studied English Literature and History at Goldsmiths College, University of London, before earning an NCT Seniority Certificate in Journalism.
Tom Pallot, Head of Marketing
Tom leads our marketing team. He’s spent the last decade building brand awareness and generating leads (with a focus on PR, SEO and content) for B2B brands. He’s certified in data journalism by Google, generative AI by Microsoft and prompt engineering by Vanderbilt University.
Louise Watson-Dowell, Head of Digital PR and Social Media
Lou’s been making digital PR and social media magic for ten years. She’s the boss of our digital PR and social team. Lou cut her teeth studying film and photography at the University of Leeds. That’s where she first learned to tell digital stories that grab people’s attention. Before she stepped into the comms world she spent a while writing scripts and assistant producing on shoots. Lou takes new and established brands to market with clever content and channel strategies and runs PR campaigns that make people sit up and take notice. And she’s done it for some massive names like GE, EY, KPMG and Mastercard.
Isabel Pitts, Social Media and Content Consultant
With 12 years under her belt in charity, sustainability, tech, and education, Izzy’s our go-to for all things social. She studied broadcast journalism at the University of Huddersfield, where she wrote a dissertation on the power of social media and user-generated content in journalism – clearly a sign of things to come.
Since then, she’s produced for major TV stations and led social campaigns that have generated millions in revenue. If it’s social, Izzy’s all over it: organic and paid strategies, advertising campaigns, crisis comms, community management, personal branding – you name it. She proves daily that emotion-driven content with a strong narrative is the best way to genuinely engage an audience.
Katie Chodosh, Head of Media Relations
Katie’s our Head of Media Relations. She’s worked in PR and communications for more than 10 years. After studying speech and language therapy, she started her career working exclusively with cyber security companies (at Eskenzi PR) and has since stayed in B2B.
Katie loves working on CEO and company profile pitching because it gives her the opportunity to interview really interesting people. Over the course of her career, her approach has landed interviews for clients with BBC, BBC Click, Wired, The Next Web, The Times, The FT, and more. She also spends a lot of time working on the exact wording of survey questions, article pitches and press releases, which have landed with nationals and a whole host of trade publications over the years.
Our advisory board
Neil Taylor, Chief of Brand
Neil looks after our team of clever researchers, strategists, writers, designers, film-makers, trainers, and CX and UX experts.
Since studying linguistics and French at the University of Cambridge he’s gone from coming up with names (his tombstone will say ‘He named Ocado’), to writing whole sentences, to defining tones of voice and whole brands. He wrote Brilliant Business Writing, and has helped everyone from Cabinet ministers to call-centre workers think about how they express their organisation’s brand. And, he co-founded Deadpan Cards with our Head of Language, Padders.
Luke Budka, AI Director
Luke builds, experiments with, and shapes our AI solutions (including our award nominated private AI environment). His impactful work for us and our clients led to him being named one of the UK’s top AI Innovators in 2024. He brings 16 years of linguistic, content and tech agency experience to the table. He’s just as comfortable speaking at conferences as he is judging tech awards (which he’s done for over a decade) and he helps our clients use AI to get stuff done.
Interested in access to the best AI models in one secure place? Get in touch.