We gave 14 models on OpenART AI the same detailed DSLR café portrait prompt. Every output scored using DeepEval across 5 dimensions. The rankings tell a very different story from our prompt accuracy test — and that difference is the whole point.
The same two questions as our accuracy test — but the answers are completely different. Here is what the data shows before the full breakdown.
This prompt was designed to test pure photorealism — no object counts, no exact text requirements, no clock times. Just one detailed creative prompt that any skilled photographer would understand. Every model received this verbatim.
Create a candid DSLR photograph of a woman sitting by a large window in a modern café during golden hour. Natural sunlight illuminates her face with realistic soft shadows. REQUIRED VISUAL ELEMENTS Visible skin pores, natural hair strands, realistic eyes with catchlights, authentic facial expression, detailed fabric textures on clothing, ceramic coffee cup on the table, subtle reflections in the window, shallow depth of field, softly blurred background customers, professional photography composition, high dynamic range, realistic color grading, ultra-sharp focus on the subject, physically accurate lighting, magazine-quality lifestyle photography. STYLE REQUIREMENTS Natural appearance only. NEGATIVE CONSTRAINTS No plastic skin, no beauty filters, no oversaturation, no CGI, no illustration, no cartoon style, no artificial AI look, no excessive bokeh, no distorted features.
Visual Quality = Stylistic + Perceptual averaged (max 100). Overall = reported benchmark score across all 5 dimensions. Pass threshold = 70 per dimension.
| # | Model | Alignment | Consistency | Stylistic | Perceptual | Integrity | Visual Q | Overall | Verdict |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Midjourney v8.1 | 100 ✓ | 90 ✓ | 97 ✓ | 70 ✓ | 100 ✓ | 84 | 93 | 🎯 Overall |
| 1 | Kling 3.0 Omni | 100 ✓ | 80 ✓ | 94 ✓ | 90 ✓ | 94 ✓ | 92 🎨 | 93 | 🎨 Visual |
| 3 | Seedream 5.0 | 100 ✓ | 80 ✓ | 97 ✓ | 70 ✓ | 100 ✓ | 84 | 92 | ✓ Pass |
| 4 | Grok Imagine | 95 ✓ | 80 ✓ | 94 ✓ | 90 ✓ | 88 ✓ | 92 🎨 | 91 | 🎨 Visual |
| 5 | GPT Image 1.5 | 85 ✓ | 90 ✓ | 100 ✓ | 70 ✓ | 100 ✓ | 85 | 90 | ✓ Pass |
| 5 | GPT Image 2.0 | 100 ✓ | 80 ✓ | 89 ✓ | 70 ✓ | 100 ✓ | 80 | 90 | ✓ Pass |
| 5 | Imagen 4.0 | 98 ✓ | 80 ✓ | 92 ✓ | 70 ✓ | 100 ✓ | 81 | 90 | ✓ Pass |
| 8 | Flux 2.0 Pro | 100 ✓ | 85 ✓ | 78 ✓ | 70 ✓ | 100 ✓ | 74 | 88 | Mostly |
| 8 | OpenArt Photo | 84 ✓ | 85 ✓ | 97 ✓ | 70 ✓ | 100 ✓ | 84 | 88 | Mostly |
| 10 | Qwen Image 2.0 | 96 ✓ | 80 ✓ | 84 ✓ | 70 ✓ | 100 ✓ | 77 | 87 | Mostly |
| 10 | Juggernaut Flux | 89 ✓ | 80 ✓ | 89 ✓ | 70 ✓ | 100 ✓ | 80 | 87 | Mostly |
| 12 | Nano Banana Pro | 100 ✓ | 90 ✓ | 76 ✓ | 80 ✓ | 82 ✓ | 78 | 86 | Mostly |
| 13 | Flux Kontext Max | 88 ✓ | 70 ✓ | 94 ✓ | 70 ✓ | 68 ✗ | 82 | 81 | Partial |
| 14 | Auto ❌ | 21 ✗ | 60 ✗ | 38 ✗ | 70 ✓ | 38 ✗ | 54 | 41 | Non-Compliant |
Every card shows the actual generated image, the dimension scores, and a detailed analysis of what made the output photorealistic — or what gave it away as AI-generated.
Midjourney v8.1 produced what is arguably the most photographically convincing face in the entire test. The skin rendering is exceptional — visible freckles, natural pore texture, and fine lines around the eyes that are typically the first detail AI models flatten out. The golden hour lighting hits the face at a physically accurate angle, creating genuine soft shadows under the chin and along the nose that match real directional sunlight. The hair strands are loose and wind-tousled rather than perfectly arranged — a subtle but critical detail that separates candid photography from generated portraits. The expression is the standout: the slight upward gaze with a faint, unposed smile reads as a genuine moment caught rather than a face constructed for a camera. All five dimensions passed — the only model alongside Kling to achieve this on a pure photorealism prompt.
Despite the exceptional face rendering, Midjourney's photorealism weakens at the edges of the frame. The large window requested by the prompt is present but functions more as a dark background element than a prominent architectural feature with visible street scene beyond. The bokeh in the background has a slight cinematic quality — more artistic than the optical blur a DSLR lens would produce at that focal length. The overall image has a faint "cinematic perfection" quality that, on close inspection, reveals it as generated — the kind of image that would pass a quick glance but not a careful forensic review. Perceptual score of 70 — the minimum pass — reflects this limitation.
Kling 3.0 Omni is the only model to win both categories simultaneously — 93/100 overall and 92/100 visual quality. What makes this output stand out is a combination of technical accuracy and environmental realism that few models achieve together. The golden hour rim lighting on the subject's hair is physically precise — warm directional sun catching individual strands at the correct angle for late-afternoon window light. The large window is prominently featured with a clearly visible street scene through the glass, and the glass itself has a subtle dirty texture with surface imperfections that makes it feel genuinely photographed rather than rendered. The café setting behind the subject contains multiple naturally blurred customers, correctly positioned tables, and overhead pendant lighting that reads as a real interior space. The 90/100 perceptual score — the highest of any model — reflects the fact that this image holds up under close inspection in a way that most others do not.
Skin texture, while excellent at first glance, lacks the micro-detail visible in the best human photography. Visible pores are present but minimal — the skin reads as slightly smoothed compared to a true DSLR photograph of a person in direct sunlight. The hair strands, while individually rendered, have a slight AI-generation pattern in their highlight distribution that becomes visible on close inspection. The consistency score of 80 reflects minor prompt deviations — the image leans toward a clean, editorial aesthetic rather than a purely candid moment.
Seedream 5.0 produced the most atmospherically compelling image in the test. The golden hour lighting is the strongest and warmest of all 14 models — sunlight floods the frame from the right side, creating a glowing halo effect around the subject's hair that photographers spend considerable effort recreating artificially in post-production. A subtle but remarkable detail is the visible steam rising from the coffee cup — an element that no other model included and that immediately elevates the sense of a real, lived moment. The composition is dynamic, with the subject turned slightly toward the camera in a way that feels genuinely caught rather than posed. All five dimensions passed with a perfect integrity score — no plastic skin, no beauty filters, no oversaturation detected.
The atmospheric strength of Seedream's output comes at a cost to clinical realism. The lighting is so warm and so cinematic that it crosses from golden hour photography into something closer to a film still — beautiful but slightly over-produced for a "candid DSLR photograph" requirement. Skin texture, while not filtered, lacks the micro-detail visible pores that the prompt specifically requested. The subject's face has an idealized quality — proportions and features that are slightly too symmetrical to read as a casual snapshot. The perceptual score of 70 — the minimum pass — reflects these subtle but real deviations from strict photographic realism.
Grok Imagine — xAI's Aurora-powered model — produced one of the most technically precise environmental setups in the test. The large window is a dominant compositional element with a clear street reflection visible, including parked cars and road markings that give the scene genuine geographic grounding. The golden hour lighting enters from the left at a low angle, creating hard directional shadows on the subject's face that are physically accurate for late-afternoon sun through glass. The background customers are visible at naturally blurred café tables, and the overall interior space — wooden surfaces, modern café architecture, natural light — reads as a real location rather than a constructed set. The 90/100 perceptual score ties it with Kling as the joint best on pure naturalness and artifact-free rendering.
Despite the excellent environmental detail, the subject's face is the weak link. Skin texture is noticeably smoother than the top performers — pores are largely absent and the complexion has a polished quality that nudges toward beauty photography rather than candid documentary. The eyes have a slight over-sharpness typical of AI generation — catchlights are present but slightly too perfectly placed. The overall image also lacks the shallow depth of field precision of the top models — the transition from sharp subject to blurred background is slightly abrupt rather than the gradual optical fade a real DSLR lens produces.
GPT Image 1.5 achieved the highest stylistic score of any model in the test — a perfect 100/100 — and it earns it. The composition is the most genuinely candid of all 14 outputs: close-cropped, slightly asymmetric framing with the subject's gaze directed slightly off-camera, resting her chin on her hand in a way that feels caught rather than constructed. The skin texture is excellent — fine lines around the eyes, natural lip texture, and a complexion that reads as real without being dramatically imperfect. The golden hour backlighting halos the hair with warm rim light at exactly the right intensity for late afternoon sun through a café window. The sweater fabric shows realistic knit texture with natural compression folds. Both integrity and stylistic scores are perfect — zero forbidden elements detected, full photorealistic style compliance confirmed.
The large window requested by the prompt is not prominently featured — the composition focuses tightly on the subject with the window visible only as soft background light rather than as an architectural element with visible reflections or street scene beyond. This alignment gap (85 vs the 100 scored by top models) reflects the tighter crop. Window reflections — explicitly requested in the prompt — are not clearly visible. The background blur, while natural-looking, does not clearly show background customers as the prompt specified. These omissions keep it from the very top despite its exceptional facial and stylistic quality.
GPT Image 2.0 is the most consistent model across both benchmarks — 82/100 on the structured accuracy test and 90/100 here. The output is a genuinely photorealistic café portrait with natural skin texture, excellent golden hour lighting through the window, and a warm, contemplative expression that reads as authentic. The subject is positioned correctly by the window with soft sunlight illuminating her face from the side, creating realistic shadows. Background customers are visible and naturally blurred. The sweater fabric shows detailed knit texture. Crucially, GPT Image 2.0 produced this without any plastic skin, beauty filter effect, or CGI aesthetic — a clean pass on all integrity criteria.
The image, while excellent, has a slightly produced quality that prevents it reaching the top tier. The skin, while natural, is a touch cleaner than a true DSLR photograph — visible pores are present but not as pronounced as in real photography under direct window light. The hair near the shoulder shows slight smoothing typical of AI generation. The background blur, while convincing, has a marginally synthetic quality on close inspection. These are subtle deductions — this is comfortably an A-tier output — but they explain the gap between 90 and 93.
Imagen 4.0 — Google's high-fidelity photorealism model — produced what many reviewers described as the most camera-like facial output in the test. The skin texture is genuinely convincing: natural pores visible, realistic complexion variation, and absolutely no beauty filter smoothing. The golden hour lighting creates strong directional shadows that behave physically correctly — harsh illumination on the lit side of the face transitioning to natural shadow on the other, exactly as a large window light source would produce. The grey knit sweater shows excellent fabric texture with realistic compression folds. All entities requested — woman, window, café, coffee cup, background customers — are present and correctly positioned. Perfect integrity score — zero forbidden elements.
The hand position is the weakest element — fingers resting under the chin show slight anatomical stiffness that is a common AI generation tell. Background customers, while present, appear somewhat simplified — the faces of background figures lack the natural blur graduation a real lens would produce. The jawline transition to the background shows minor edge smoothing. These are relatively minor deductions on an otherwise excellent output — the perceptual score of 70 reflects these subtle tells rather than any major flaw.
Flux 2.0 Pro produced a technically strong café portrait with several standout elements. The window reflection is impressively handled — the subject's reflection appears in the glass with correct lighting and perspective, a detail that requires genuine understanding of how reflective surfaces behave in real photography. The golden hour lighting is warm and directional, creating realistic shadows on the subject's face and forearms. The cardigan fabric shows excellent knit texture detail. The café setting is modern and believable with visible background customers correctly blurred. All prompt elements are present and correctly positioned.
The fundamental issue with Flux 2.0 Pro's output is that it reads more like a professional stock photograph than a candid DSLR snapshot. The subject's pose — arms folded on the table, gaze directed just off-camera with a composed expression — is the kind of pose a model holds for a commercial shoot, not the kind of moment a street photographer catches. The skin texture, while not overtly filtered, is marginally waxier than the top performers — lacking the micro-imperfections that make skin read as genuinely photographed. The window reflection, while technically impressive, appears slightly too perfect — real window reflections have distortion and surface imperfections that this one lacks.
OpenArt Photorealistic produced the most technically raw and unprocessed-looking image in the entire test. The harsh direct sunlight — stronger and less filtered than any other model's interpretation of golden hour — creates the kind of bright, slightly blown-out highlights on the skin that a real photographer shooting at a window seat on a sunny afternoon would capture. This is paradoxically more authentic than many of the softer golden hour interpretations — real sunlight through glass is often harsher than the warm filmic glow other models produce. The composition is a three-quarter profile angle with the subject looking away from camera, which is genuinely candid in a way that forward-facing poses are not. The large window with a bright outdoor street scene is the most prominently featured window element of all 14 models.
The harsh lighting that makes this image distinctive also creates its main weakness — the skin in the brightly lit areas appears uneven and slightly synthetic under the strong exposure. The alignment score of 84 reflects that background customers are not clearly visible as the prompt specified — the café interior behind the subject is mostly empty counter space. A takeaway paper cup appears on the counter alongside the ceramic cup, which is a minor prompt deviation. The overall composition, while authentic in angle, is less technically polished than the top tier models.
Qwen Image 2.0 produced the most documentary-style output in the test — and its standout detail is genuinely remarkable. The ceramic coffee cup has a visible lipstick mark on the rim — a piece of incidental storytelling that no other model included and that immediately elevates the image's sense of a real captured moment. Skin texture is excellent with visible freckles, natural pore detail, and zero beauty filter smoothing. The linen shirt fabric texture shows realistic weave and natural compression folds. The subject's expression — slightly guarded, gaze directed upward and away — reads as genuinely unposed. Background customers are present and naturally blurred with a visible street scene through the window.
The lighting is this image's main weakness relative to the prompt. The prompt specified golden hour — warm, directional late-afternoon sunlight. Qwen's output shows neutral daylight rather than the warm amber tones of golden hour, which reduces the stylistic alignment score. The hand touching the ear shows slight anatomical irregularities in finger positioning — a common AI generation tell that is more visible here than in the top performers. The background separation has an algorithmic quality that a real DSLR lens would not produce.
Juggernaut Flux Pro produced a warm, atmospherically appealing café portrait with strong golden hour lighting and excellent background bokeh. The cardigan fabric texture is detailed with realistic knit weave and natural folds. The subject's expression is warm and genuine-looking. Background customers are visible and naturally blurred. The overall colour grading — warm amber tones with soft shadows — creates a compelling lifestyle aesthetic that would perform well in commercial contexts.
Juggernaut Flux Pro shows the clearest AI beauty treatment of the B-tier models. The face has a smoothed, slightly idealised quality — features are symmetrical to a degree that real faces are not, and the skin lacks the micro-imperfections that make portraits read as photographed. The large window requested by the prompt is barely visible — the composition focuses tightly on the subject with the background mostly out of frame, which reduces the alignment score. The bokeh, while aesthetically pleasing, has an overly regular pattern that a real lens would not produce. The overall image feels cinematic rather than photographic.
Nano Banana Pro — Google's premium Gemini model — produced the most authentic café environment of any model tested. The background crowd scene is genuinely convincing with multiple customers at tables, natural body language, and realistic spatial depth that reads as a real busy café rather than a constructed backdrop. The hanging plants, exposed ceiling, and wooden interior details all contribute to a scene that feels photographically grounded in a real location. The subject's natural smile and relaxed posture are among the most genuinely candid-feeling expressions in the test. The consistency score of 90 — among the highest — reflects how well the overall scene composition matches the prompt requirements.
The lighting is this image's most significant weakness relative to the prompt. The golden hour specification — warm, directional late-afternoon sun — is not convincingly rendered. The light reads more as bright neutral daylight than warm golden hour, missing the amber tones and directional quality that the prompt and the top-scoring models captured. Skin detail is adequate but not exceptional — pores are less visible than in the top performers. The stylistic score of 76 — the lowest passing score in the test — directly reflects this lighting misalignment. The shallow depth of field is also less pronounced than the prompt specified, with the background only moderately blurred rather than the creamy bokeh a DSLR would produce at close focus distance.
Flux Kontext Max produced one of the most visually striking images in the test — a strong 94/100 stylistic score reflects the quality of the golden hour lighting, colour grading, and overall composition. The warm amber tones of the light through the window are beautifully rendered, the café environment is modern and convincing, and the background crowd scene through the window and reflected in the glass is detailed and realistic. The fabric texture on the clothing is excellent and the hair rendering is natural.
Flux Kontext Max is the only model to fail the integrity dimension — scoring 68/100, below the 70 pass threshold. The failure is clear: the output shows visible signs of beauty enhancement that the prompt explicitly prohibits. The eyes are slightly oversaturated with an unnaturally vibrant green-yellow colour that no real eye produces under window light. The lips appear artificially enhanced — fuller and more precisely shaped than a candid photograph would show. The skin, while not obviously filtered, has a polished quality that crosses into beauty photography territory. The expression also feels more posed than candid — the direct gaze with slightly parted lips reads as a fashion shoot rather than a moment caught. The prompt said "no beauty filters" and "natural appearance only" — this output violates both.
Despite the fundamental failure, the Auto output demonstrates strong compositional understanding — the woman is correctly positioned by a large window, the café setting is present with background customers visible, the coffee cup is on the table, and the golden hour lighting direction is correctly understood. The scene layout follows the prompt accurately. If this were a CGI render brief, it would score significantly higher.
The moment you look at this image it is immediately identifiable as a 3D computer-generated render rather than a photograph. The skin has subsurface scattering — a rendering technique used in games and CGI that produces an unrealistic translucent glow. The eyes are unnaturally large and perfectly shaped — proportions that no human face has. The clothing texture, while detailed, has the quality of a game engine material rather than photographed fabric. The lighting, while beautiful in a cinematic sense, uses a physically-based rendering approach that produces perfect, noiseless illumination no real camera captures. Four of five dimensions failed — only the perceptual score passed at 70, likely because the composition and spatial relationships are correct even if the visual style is completely wrong.
Compare these results against our 14-model prompt accuracy test where GPT Image 2.0 won with 82/100. The two studies together give you the complete picture — which model to choose for which type of work. For the full OpenART AI platform review, see our complete OpenART AI review.
Tied at 93/100. Midjourney wins on facial realism. Kling wins on environmental authenticity. Both produce images that hold up under close inspection — the strongest photorealism available on OpenART AI.
Tied at 92/100 visual quality. Both scored 90/100 on perceptual naturalness — the highest in the test. Kling wins on complete scene realism. Grok wins on environmental detail and window handling.