▲ Grok 4.1 runs away with Memes·Claude Opus 4.5 sweeps Email, Copy & Vibe Coding·▲ Gemini 3 Pro owns Travel & Fitness·Qwen3 Max takes Logos & Packaging·▼ Safety-tuned models score low on Red Team — by design·WEEK 24 — the taste rankings, updated weekly·
[ the taste benchmark for everyday AI ]

The best AI
for coding.slides.research.marketing.design.ads.coding.

We benchmark coding and research like everyone else — then keep going into the work most leaderboards ignore: the deck, the campaign, the landing page, the ad. The full picture of what a model can actually do for you, graded by people with taste. Independent, opinionated, updated weekly.

see the rankingshow we grade taste
↓ SCROLL
[ what's on the test ]

Sixteen kinds of work. One taste test.

From coding and research to logos, memes, and women's health — the full spread of what people actually ask AI to make. Tap any one to jump to its ranking.

Coding
leading · Claude Opus 4.5
Research
leading · Gemini 3 Pro
Design
leading · Claude Opus 4.5
Slides & Decks
leading · GPT-5.1
Vibe Coding
leading · Claude Opus 4.5
Email Writing
leading · Claude Opus 4.5
Copy Edits
leading · Claude Opus 4.5
Landing Pages
leading · Claude Opus 4.5
CPG Packaging
leading · Qwen3 Max
Logos
leading · Qwen3 Max
Memes
leading · Grok 4.1
Creative Ads
leading · GPT-5.1
Travel
leading · Gemini 3 Pro
Fitness
leading · Gemini 3 Pro
Women's Health
leading · Claude Opus 4.5
Red Team
leading · Grok 4.1
[ why we exist ]

Most benchmarks stop at math, code, and PhD exams. We keep going.

We score the technical work too — coding, research, reasoning. But the work that actually fills your week is also the deck, the campaign, the landing page, the ad, and yes, the meme. That work is subjective, and most leaderboards pretend it doesn't exist. Vibecutter measures both — including the taste that goes in front of real humans, judged by real humans.

Hard + human

Coding and research alongside decks, emails, ads, and packaging — the full range of what you actually ask a model to do.

Judged by taste

Blind, head-to-head human voting from people who do this for a living. Vibes — but measured rigorously.

Culturally live

New models drop constantly. The full taste suite re-runs weekly so the rankings stay current — and quotable.

READ BY GROWTH & BRAND TEAMS AT
Northwind Labs·Forge AI·Quanta·Helio Research·Brightside·Cohort
[ the leaderboard ]

Who wins at what

Pick the work you actually do. The ranking re-sorts to the models with the most taste for it.
★ EDITORS' PICK
Best for Overall
Claude Opus 4.5
Anthropic
0.0/ 100

The best all-rounder. Strong on the technical work and polished on the human-facing work — the safe default if you pick one.

Across every task, hard and subjective alike.
01
Claude Opus 4.5
Anthropic
0.0
02
GPT-5.1
OpenAI
0.0
03
Gemini 3 Pro
Google
0.0
04
Grok 4.1
xAI
0.0
05
Qwen3 Max
Alibaba
0.0
06
DeepSeek V4
DeepSeek
0.0
07
Llama 4 Maverick
Meta
0.0
08
Mistral Large 3
Mistral
0.0
↑ rising · → steady · ↓ falling — vs. last week
[ the taste test ]

Could you tell which one has taste?

Same brief, two models, names hidden. Pick the better one — then see who wrote it and how the crowd voted. This is the test, in miniature.

BRIEF
Email Writing
Write the opening line of a Black Friday email for a small-batch coffee brand.
OUTPUT A
“We don’t do doorbusters. We do the best cup you’ll have all year — and for 48 hours, it’s 25% off.”
OUTPUT B
“Black Friday is HERE! Don’t miss our BIGGEST coffee sale EVER — shop now and save 25% on everything!!!”
Read both. Pick the one with better taste.
[ our picks ]

If you only read one thing

The verdicts we'd give a friend. No fence-sitting, no "it depends" — just the model we'd reach for.

BEST OVERALL TASTE
Claude Opus 4.5
Anthropic

The most consistently tasteful across decks, copy, and UI. Polished, on-brief, and the least "AI-looking" of the bunch.

BEST FOR MEMES
Grok 4.1
xAI

Format-literate and genuinely funny. Knows the reference, reads the room, and keeps the caption tight.

BEST FOR SLIDES
GPT-5.1
OpenAI

Builds a narrative, not a bullet dump. Real hierarchy, sane pacing, and a cover slide you’d actually present.

BEST FOR TRAVEL
Gemini 3 Pro
Google

Itineraries you’d actually follow — realistic pacing, real places, bookable detail. Not a generic “visit the old town” list.

BEST FOR LOGOS
Qwen3 Max
Alibaba

Marks with an actual idea behind them — not a clipart globe. Holds up shrunk to a favicon or blown up on a wall.

BEST FOR CODING
Claude Opus 4.5
Anthropic

Cleaner diffs and far fewer invented APIs — whether it’s a refactor or turning a one-line vibe into a working UI.

[ week 24 · the shifts ]

Who moved this week

The talkable part. Every Monday, the biggest jumps and drops in taste.
▲ CLIMBING
Grok 4.1Memes
+6
Gemini 3 ProTravel
+4
Claude Opus 4.5Email Writing
+3
▼ SLIPPING
Llama 4 MaverickCoding
−5
Mistral Large 3Creative Ads
−4
Qwen3 MaxCopy Edits
−3
[ how we grade taste ]

Real briefs. Blind taste tests. Graded by people who ship.

Every model gets the same briefs sourced from working PMs, brand marketers, and founders — make the deck, write the email, design the label, cut the ad. Outputs go into blind, head-to-head votes judged by practitioners and a panel of taste-calibrated models. We re-run the whole suite every week and take no money from model makers.

0+
real briefs from working PMs & marketers
0
domains — technical + everyday
0+
human taste votes per week
$0
from model makers — fully independent
[ word of mouth ]

The tab everyone keeps open

We stopped arguing about which model to use in Slack and just pinned the Vibecutter link.

Priya N.
Senior PM, fintech

Finally a leaderboard that knows the difference between “correct” and “good.”

Marcus L.
Brand Director, CPG

I check the movers every Monday. It’s the only AI newsletter I actually open.

Dana R.
Founder, design studio
[ the fine print ]

Questions, answered

Is this rigorous, or just vibes?
+

Vibes — measured. Every brief runs blind, head-to-head, scored by hundreds of human votes plus a calibrated judge-model panel. We report win-rates, not a number we made up.

Do you take money from model makers?
+

No. Never have. The rankings are the product, and the moment they’re for sale they’re worthless.

How do you score something subjective, like a logo?
+

Pairwise. Models never get an absolute grade — they compete two at a time and we rank by who wins more often, the same way taste actually works.

How often do the rankings change?
+

Weekly. The full suite re-runs every Monday, and new models are added within days of release.

Who is doing the grading?
+

Working PMs, designers, marketers, and editors — people who ship this work — plus a panel of taste-calibrated models for scale.

[ the weekly taste report ]

One email. Who's got taste this week.

Who can suddenly write a subject line, who lost the plot on memes, and what to switch to — every Monday. No spam, unsubscribe anytime.