Jump to:

Not all AI image generation models produce the same level of results. Some look convincing at first glance, then fall apart the closer you look – warped anatomy or spatial logic that doesn’t make sense.

For brand and marketing teams integrating AI into their workflows, knowing which tools can genuinely deliver good results matters. So we tested six leading text-to-image models: Nano Banana, Flux.2 [max], GPT Image 1.5, Luma Photon 1, Imagen 4 Ultra, and Stability Ultra (all available in one secure place via Definition AI).

Jump straight to the results or keep reading for the methodology.

Here’s what we did

The criteria

We worked with our creative team to establish five criteria that reflect what matters in professional image work.

Criterion What we assessed
Accuracy Anatomical realism – correct finger counts, natural joint positions and realistic skin texture
Compositional quality Framing, rule of thirds, balanced negative space and clear focal hierarchy
Lighting and shadow Consistent light direction, accurate shadow casting and realistic interaction with materials
Background and perspective Spatial logic, correct scale relationships, accurate linear perspective and no warped proportions
Understanding Did the AI interpret the prompt correctly? Could it handle conceptual nuance and conflicting instructions?


The prompts

We created five prompts designed to stress-test each criterion (each written to expose weaknesses). The accuracy test, for example, focused on hands – a known weak spot for AI. The understanding test asked models to combine conflicting concepts (a formal business meeting in a forest) to see whether they could maintain coherence when instructions pulled in two directions.

The blind test

To eliminate bias, we presented all generated images to our creative team without revealing which model produced which image. Each image was scored 1–5 across all five criteria.

We know some tools respond better to certain prompting styles, but for a fair test, the same prompt was used across all six models – no model-specific optimisation.

The judges

Red Howell, Senior Designer

Gen Reichel, Designer

According to Red: “Rather than using a narrative format for prompts, better results appear when [you] use instructional lists.” So that’s what we did.

And these were the results:

Accuracy test

Focus: Anatomical realism – hands, fingers, facial features and body proportions

Scores

Model Gen Red
Nano Banana 2 2
Flux.2 [max] 3 4
GPT Image 1.5 5 5
Luma Photon 1 2 1
Imagen 4 Ultra 4 3
Stability Ultra 1 1

Winner: GPT Image 1.5

What Gen thinks:

In the lead, we’ve got GPT image 1.5. Gen says, “the anatomy, skin texture and proportions are all aligned with one another.”

But the other models scored much lower because “the proportions of the anatomy become skewed and feel uncanny…”. For example, in Stability Ultra’s image, the rings have become fused into the skin of the hands, leaving Gen “feeling a bit unsettled.”

What Red thinks:

Nano Banana failed on a fundamental level. As Red puts it: “Unfortunately, when you’re dealing with human anatomy, it has to be spot on. The moment an extra finger appears, the realism is shattered.”

Flux.2 [max] fared better but wasn’t without issues.“Proportions are great, but the rendering becomes a little plastic when you look hard enough. It has that classic ‘over-sharp’ look.”  The same could be said for Imagen 4 Ultra.

GPT Image 1.5 performed the best, with Red saying, “Other than the way the bottle is being held with just a pinch, no comments.”

As for Luma Photon 1 and Stability Ultra – let’s just say the words “horror” and “all kinds of wrong” were used.

Compositional quality test

Focus: Framing, rule of thirds, visual hierarchy, negative space and focal point

Scores

Model Gen Red
Nano Banana 4.5 3
Flux.2 [max] 4 4
GPT Image 1.5 5 4
Luma Photon 1 5 5
Imagen 4 Ultra 3 2
Stability Ultra 4 4

Winner: Luma Photon 1

What Gen thinks:

Luma Photon 1 and GPT Image 1.5 took the lead. Gen puts it down to “The vibrant colours, field of depth and rule of thirds composition follow through into the images beautifully.”  But, it was the orange ripples of sunlight reflected against the tall grass that made GPT Image 1.5 the personal winner for Gen.

Flux.2 [max] follows closely behind but falls slightly short, because “the scene feels a little more sparse, from the tree branches to the rolling hillsit’s missing the little gritty details that make up a realistic scene.

What Red thinks:

Though it needs more negative space, Stability Ultra was the closest “to being really good that it becomes annoying.”

Nano Banana “understood the task”,  and it’s also “the most accurate recreation of an amateur photographer trying some clever framing for the first time.”

The depth blur on the overhanging oak helps Flux.2 [max]’s image look a little more considered, but as with GPT Image 1.5, Red had suggestions on what they would change, like: “the tree in the focal point should be framed a few steps to the right to take advantage of that curve in the branch framing above.”

Luma Photon 1 “uses the most basic framing, but it works for a reason. The subject is fully focused in the centre, the sun is obstructed just enough to backlight without distracting us, and the falloff of the hills behind are really well framed.”

Imagen 4 Ultra “plays closest to the rule of thirds – but there are situations where following the rule works against you. This one, with the heaviness of the shaded overhang and the dark grass, feels claustrophobic.”

Lighting and shadow test

Focus: Light direction, shadow accuracy, multiple light sources and shadow consistency

Scores

Model Gen Red
Nano Banana 4 5
Flux.2 [max] 5 5
GPT Image 1.5 4 4
Luma Photon 1 3 3
Imagen 4 Ultra 2 3
Stability Ultra 3 3

Winner: Flux.2 [max]

What Gen thinks:

Nano Banana and Flux.2 [max] took the lead, with “a clear sense of direction, vibrant shadows, and well-lit textures.” The water ripples and slight discrepancies within the water shadow of Flux.2 [max]’s image set it apart from the rest, making it a clear winner for this category.

Imagen 4 Ultra scored the lowest because, “the textures seem dull and lifeless, the shadows don’t feel as though they quite match up with the objects (the left and right shadows feel a little too angled for me), and the light source feels too staged for my liking, the overall feel is more like the BTS of a photoshoot, rather than the final results.”

What Red thinks:

Red agrees that Nano Banana “hits the mark” and also scored Flux.2 [max] high, saying: “a harsher light means harder shadows and it seems to honour that. Extra points for the hotspots of focused light – this is difficult to critique.”

There’s some hesitation with the other models. For GPT Image 1.5, Red was “not wholly convinced by the succulent’s shadow”. On Imagen 4 Ultra: “despite the contact shadows and edge diffusion, the apple and glass seem to float a little.”

Background and perspective test

Focus: Perspective, depth, scale relationships, spatial logic and proportional accuracy

Scores

Model Gen Red
Nano Banana 4 5
Flux.2 [max] 4.5 3
GPT Image 1.5 5 1
Luma Photon 1 1 1
Imagen 4 Ultra 2 2
Stability Ultra 1 1

Winner: Nano Banana

What Gen thinks:

Gen found that a few of the models struggled with this test, with Luma Photon 1, Imagen 4 Ultra and Stability Ultra being scored the lowest scores she’s given so far. She said it was “due to their smoothed over skin textures, slightly off body proportions and unclear background compositions.”

What Red thinks:

Nano Banana is “A pretty convincing UK street, right down to the moss and dried gum.”

Flux.2 [max] also does alright, but loses points because “There is a strange plastic nature to the streets behind – they seem overly generic and geographically unplaceable, while also feeling too clean.”

Imagen 4 Ultra, isn’t terrible either, “but it looks far too polished to hold any credibility.”  It’s trying to “mimic a European city street, but the cleanliness, the parking situation, the style of lampposts – it all makes it a bit confusing.”

And Stability Ultra? “Sir, you can’t park there.”

Understanding test

Focus: interpreting complex/conflicting concepts, following nuanced instructions and conceptual coherence

Scores

Model Gen Red
Nano Banana 5 5
Flux.2 [max] 3 4
GPT Image 1.5 5 4
Luma Photon 1 1 2
Imagen 4 Ultra 4 3
Stability Ultra 1 2

Winner: Nano Banana

What Gen thinks:

“When testing conceptual coherence, it’s important to push the programmes to the limit when creating ideas that AI can’t pull from photoshoot examples.”

Nano Banana and GPT image 1.5 rose to the challenge. Gen says, “the lack of symmetry and reflections in the table really sell this as a genuine image, the background feels appropriately lit, and the characters feel warm and human.”

But the rest didn’t quite make the cut. Luma Photon 1 and Stability Ultra both scored a 1, because of “the inaccuracy of the image”. Both had the concept down, but “they haven’t been able to pull together the details that make AI imagery feel realistic. The images feel flat and overworked, the plug sockets in the middle of the table are misaligned, the bottom of the table is missing.”

What Red thinks:

Red agrees and scores Nano Banana highly, saying, “There isn’t anything technically wrong here”, just that “those black lace up Oxford’s would never survive the forest dirt.”

On GPT Image 1.5: Maybe it’s the choice of angle, but this feels the most natural – as if it has been shot as a real stock image. I just don’t know if I’m understanding how grey-haired dude’s legs are interacting with the table legs.”

The overall verdict*

Model Accuracy Composition Lighting Perspective Understanding Total
Nano Banana 2 3.75 4.5 4.5 5.0 19.75
Flux.2 [max] 3.5 4.0 5.0 3.75 3.5 19.75
GPT Image 1.5 5.0 4.5 4.0 3.0 4.5 21.0
Luma Photon 1 1.5 5.0 3.0 1.0 1.5 12.0
Imagen 4 Ultra 3.5 2.5 2.5 2.0 3.5 14.0
Stability Ultra 1.0 4.0 3.0 1.0 1.5 10.5

*Based on the average scores from our creative specialists.

The winner is: GPT Image 1.5

GPT Image 1.5 is the standout performer for accuracy – if anatomical realism and getting the details right matter most to your workflow, it’s the most reliable choice.

Nano Banana wins on perspective and conceptual coherence, delivering the most spatially convincing scenes and the strongest results when prompts get complex.

Flux.2 [max] deserves a mention for lighting – and performs consistently across the board without a single catastrophic failure.

At the other end, Stability Ultra and Luma Photon 1 struggled most, particularly on accuracy and understanding – the areas where mistakes are hardest to fix in a professional workflow.

No single model dominates every category. The best tool depends on what you’re generating – and how much you’re willing to prompt, test and refine.

Let's chat

Written by Roxanne Relusco, Senior Marketing Executive. Models reviewed by Senior Designer, Red Howell and Designer, Gen Reichel.