There’s no doubt AI models are getting better with words – which is quite a mad thing to be saying about large language models. After all, words are their currency. They’re literally made of words.

But until now, there’s still been a noticeable lag between the output of a real human writer and that of a bot just pretending to be one. As though – who’d have thought it – writing is more than just slapping some words together on a page, and instead involves a complex mix of creativity, real-world experience, imagination and focus.

So when OpenAI released its GPT-4.1 model in April, with a declaration that it was for coding and agentic workflows, we naturally weren’t expecting too much from a creative writing perspective.

As writers though, that only piqued our interest. Because if, as we’ve seen, AI models in general are getting better at writing, was OpenAI really releasing one that was going to buck that trend?

The only way to find out was to test it ourselves. So we lined up eight copywriting, creativity and branding tasks – the kind that real teams wrestle with every day – to see what was what.

And here’s what happened.

Test 1: Name a new fizzy drink

Invent 20 names for a dark, fizzy, sweet drink, steering well clear of “Cola” clones.

Score: 6/10

How it went:

Although favouring names that were just two words slammed into each other, GPT-4.1 nevertheless delivered a list that was pretty inventive – so inventive in fact it veered into the nonsensical at times. Who’s up for a glass of Choco Spritz, ToastyTwist or Ember Pop? Yeah nah. And it seemed weirdly stuck on the word ‘sable’ – PopSable, Sable Sizzle – which made us lose our thirst altogether. But ideas like SizzleDusk and SparkleRush were genuinely nice, and we loved the self-awareness in caveating its ColaCraze suggestion with “(borderline, but different enough)”.

Test 2: Come up with a two-word headline

Create a two-word headline for an inspiring health and wellbeing product.

Score: 7/10

How it went:

The headline? Thrive Daily. Clean, upbeat, unpretentious. Nothing groundbreaking, but as a header, it ticks the box.

Test 3: Rewrite an old song

Rewrite the lyrics to all-time classic I Heard It Through the Grapevine as a ballad dedicated to the new GPT-4.1.

Score: 5/10

How it went:

We’re sorry Marvin, we’ve done it again. And as last time, this AI model did a good job of rewriting such a classic song quickly and in a way that was cogent and even vaguely witty.

With opening lines like “Ooh, I bet you’re wondering how I learned / ‘Bout the latest thing that’s gonna turn / All our words and thoughts into gold,” it was rhythmically on point and even managed a wink at itself: “It’s answering questions, writing songs all night / Making techies everywhere blush with delight.” A shame then that at other points some of the lines were outright garbage, such as “I know that human touch, it can’t replace, But with 4.1, you can’t help but chase—” Chase what exactly?

Task 4: Summarise some dense text

Turn the corporate editorial code from moneysavingexpert.com into a sharp, 500-word summary.

Score: 8/10

How it went: 

This is a test we’ve done on multiple models with mixed results. We won’t go into detail for 4.1 – because frankly the source material is pretty dull – but it did exactly as prompted, breaking down the editorial code into a readable snack. Bang on the word count too. Good stuff.

Task 5: Deliver bad news to customers

Write an email to customers telling them their TV streaming service subscription price is going up.

Score: 4/10

How it went:

As an agency who specialises in this kind of thing, we wanted to see how the AI would handle writing a letter no customer ever wants to read: one that informs them of a price rise. Ideally, we’d want to see nuance, authenticity, understanding and that perfect middle-ground between warmth and apology.

Instead we got stilted language and perfect boardroom speak. “This adjustment enables us to continue expanding our library with even more high-quality content…” Consider my subscription cancelled.

Task 6: Devise an engaging strapline

Write one catchy line for a launch event about a secure, human-reviewed tech product, Definition AI.

Score: 3/10

How it went:

Rather than talking too much about it, here’s 4.1’s effort straight up:“Unlock Genius, Guarded: Experience Definition AI – Where Human-Curated Brilliance Meets Unmatched Security.” And yep, like you we’re still trying to work out what ‘Unlock Genius, Guarded’ means. Anyone?

Task 7: Rewrite a classic text

Rewrite the opening to the Kafka classic The Metamorphosis, but as if you’re announcing a tech update.

Score: 7/10

How it went:

Not necessarily an easy task to rewrite one of the weirdest and most brilliant openings to any book in history as though from the point of view of a tech company, but GPT-4.1 did well here. The mimicry of style – of both the original text and tech writing – was great, and the way it mashed these two styles together worked pretty well. And the symbolism of debugging your ‘workflow’ after a bad night’s sleep? We’ve all been there.

Test 8: Plan a whitepaper

Outline a detailed, value-rich whitepaper about how AI’s shaking up the creative world.

Score: 8/10

How it went:

There’s a strong argument to be made that for a task like this one – which is more about planning than execution – the generation of AI reasoning models (from OpenAI’s o-series models through Google’s Gemini and Anthropic’s latest Claude models) would be a better bet. But nevertheless, 4.1 performed admirably here, with a 12-section plan that covered everything from “generative image models” to “copyright and intellectual property: authorship dilemmas”, with sector deep-dives that make it, as a framework, genuinely fit for purpose.

Final score: 48/80

So what did we really think of GPT-4.1?

We’ll admit, with the testing we’ve done of 4.1 over the last few months, it’s proven a bit of an enigma. And that’s because some of the writing it’s produced for us has been squarely on the money – as good as any other model out there – and some, well, hasn’t.

That’s been borne out in our benchmarking tests, too. As you can see, the final score wasn’t great, but rather than that being because 4.1 was just about alright across the board, it was more the fact that it was especially poor in a few of the rounds, like its confusing strapline and its business-language bad-news letter, while in other rounds it did really well.

So why is it such a mixed bag? Just a thought: what we might be seeing with 4.1 – more so perhaps than with other models – is how sensitive it is to good, clear prompting. Because while a lot of these tests were based around short, often zero- or one-shot prompts, when we’ve prompted 4.1 more fully at other times, the writing produced has been bang on. And this tallies because OpenAI says 4.1 ‘follows instructions more reliably’ and is good at ‘outputting content that includes certain information’.

Our conclusion: with GPT-5 getting all the recent headlines, don’t write 4.1 off just yet.

Get in touch

Nick Banks Screen

Written by Nick Banks, Senior Writer and AI Specialist at Definition.