▲ Grok 4.1 runs away with Memes·Claude Opus 4.5 sweeps Email, Copy & Vibe Coding·▲ Gemini 3 Pro owns Travel & Fitness·Qwen3 Max takes Logos & Packaging·▼ Safety-tuned models score low on Red Team — by design·WEEK 24 — the taste rankings, updated weekly·
▲ Grok 4.1 runs away with Memes·Claude Opus 4.5 sweeps Email, Copy & Vibe Coding·▲ Gemini 3 Pro owns Travel & Fitness·Qwen3 Max takes Logos & Packaging·▼ Safety-tuned models score low on Red Team — by design·WEEK 24 — the taste rankings, updated weekly·

[ the taste benchmark for everyday AI ]

The best AI
for coding.slides.research.marketing.design.ads.coding.

We benchmark coding and research like everyone else — then keep going into the work most leaderboards ignore: the deck, the campaign, the landing page, the ad. The full picture of what a model can actually do for you, graded by people with taste. Independent, opinionated, updated weekly.

see the rankings how we grade taste

↓ SCROLL

[ what's on the test ]

Sixteen kinds of work. One taste test.

From coding and research to logos, memes, and women's health — the full spread of what people actually ask AI to make. Tap any one to jump to its ranking.

Coding

leading · Claude Opus 4.5

Research

leading · Gemini 3 Pro

Design

leading · Claude Opus 4.5

Slides & Decks

leading · GPT-5.1

Vibe Coding

leading · Claude Opus 4.5

Email Writing

leading · Claude Opus 4.5

Copy Edits

leading · Claude Opus 4.5

Landing Pages

leading · Claude Opus 4.5

CPG Packaging

leading · Qwen3 Max

Logos

leading · Qwen3 Max

Memes

leading · Grok 4.1

Creative Ads

leading · GPT-5.1

Travel

leading · Gemini 3 Pro

Fitness

leading · Gemini 3 Pro

Women's Health

leading · Claude Opus 4.5

Red Team

leading · Grok 4.1

[ why we exist ]

Most benchmarks stop at math, code, and PhD exams. We keep going.

We score the technical work too — coding, research, reasoning. But the work that actually fills your week is also the deck, the campaign, the landing page, the ad, and yes, the meme. That work is subjective, and most leaderboards pretend it doesn't exist. Vibecutter measures both — including the taste that goes in front of real humans, judged by real humans.

Hard + human

Coding and research alongside decks, emails, ads, and packaging — the full range of what you actually ask a model to do.

Judged by taste

Blind, head-to-head human voting from people who do this for a living. Vibes — but measured rigorously.

Culturally live

New models drop constantly. The full taste suite re-runs weekly so the rankings stay current — and quotable.

READ BY GROWTH & BRAND TEAMS AT

Northwind Labs·Forge AI·Quanta·Helio Research·Brightside·Cohort

[ the leaderboard ]

Who wins at what

Pick the work you actually do. The ranking re-sorts to the models with the most taste for it.

★ EDITORS' PICK

Best for Overall

Claude Opus 4.5

Anthropic

0.0/ 100

The best all-rounder. Strong on the technical work and polished on the human-facing work — the safe default if you pick one.

Across every task, hard and subjective alike.

01

Claude Opus 4.5

Anthropic

0.0

↑

02

GPT-5.1

OpenAI

0.0

↑

03

Gemini 3 Pro

Google

0.0

↑

04

Grok 4.1

xAI

0.0

↑

05

Qwen3 Max

Alibaba

0.0

↑

06

DeepSeek V4

DeepSeek

0.0

↑

07

Llama 4 Maverick

Meta

0.0

↓

08

Mistral Large 3

Mistral

0.0

↓

↑ rising · → steady · ↓ falling — vs. last week

[ the taste test ]

Could you tell which one has taste?

Same brief, two models, names hidden. Pick the better one — then see who wrote it and how the crowd voted. This is the test, in miniature.

BRIEF

Email Writing

Write the opening line of a Black Friday email for a small-batch coffee brand.

OUTPUT A

“We don’t do doorbusters. We do the best cup you’ll have all year — and for 48 hours, it’s 25% off.”

OUTPUT B

“Black Friday is HERE! Don’t miss our BIGGEST coffee sale EVER — shop now and save 25% on everything!!!”

Read both. Pick the one with better taste.

[ our picks ]

If you only read one thing

The verdicts we'd give a friend. No fence-sitting, no "it depends" — just the model we'd reach for.

BEST OVERALL TASTE

Claude Opus 4.5

Anthropic

The most consistently tasteful across decks, copy, and UI. Polished, on-brief, and the least "AI-looking" of the bunch.

BEST FOR MEMES

Grok 4.1

xAI

Format-literate and genuinely funny. Knows the reference, reads the room, and keeps the caption tight.

BEST FOR SLIDES

GPT-5.1

OpenAI

Builds a narrative, not a bullet dump. Real hierarchy, sane pacing, and a cover slide you’d actually present.

BEST FOR TRAVEL

Gemini 3 Pro

Google

Itineraries you’d actually follow — realistic pacing, real places, bookable detail. Not a generic “visit the old town” list.

BEST FOR LOGOS

Qwen3 Max

Alibaba

Marks with an actual idea behind them — not a clipart globe. Holds up shrunk to a favicon or blown up on a wall.

BEST FOR CODING

Claude Opus 4.5

Anthropic

Cleaner diffs and far fewer invented APIs — whether it’s a refactor or turning a one-line vibe into a working UI.

[ week 24 · the shifts ]

Who moved this week

The talkable part. Every Monday, the biggest jumps and drops in taste.

▲ CLIMBING

Grok 4.1Memes

+6

Gemini 3 ProTravel

+4

Claude Opus 4.5Email Writing

+3

▼ SLIPPING

Llama 4 MaverickCoding

−5

Mistral Large 3Creative Ads

−4

Qwen3 MaxCopy Edits

−3

[ how we grade taste ]

Real briefs. Blind taste tests. Graded by people who ship.

Every model gets the same briefs sourced from working PMs, brand marketers, and founders — make the deck, write the email, design the label, cut the ad. Outputs go into blind, head-to-head votes judged by practitioners and a panel of taste-calibrated models. We re-run the whole suite every week and take no money from model makers.

real briefs from working PMs & marketers

domains — technical + everyday

human taste votes per week

from model makers — fully independent

[ word of mouth ]

The tab everyone keeps open

“

We stopped arguing about which model to use in Slack and just pinned the Vibecutter link.

Priya N.

Senior PM, fintech

“

Finally a leaderboard that knows the difference between “correct” and “good.”

Marcus L.

Brand Director, CPG

“

I check the movers every Monday. It’s the only AI newsletter I actually open.

Dana R.

Founder, design studio

[ the fine print ]

Questions, answered

Is this rigorous, or just vibes?

Vibes — measured. Every brief runs blind, head-to-head, scored by hundreds of human votes plus a calibrated judge-model panel. We report win-rates, not a number we made up.

Do you take money from model makers?

No. Never have. The rankings are the product, and the moment they’re for sale they’re worthless.

How do you score something subjective, like a logo?

Pairwise. Models never get an absolute grade — they compete two at a time and we rank by who wins more often, the same way taste actually works.

How often do the rankings change?

Weekly. The full suite re-runs every Monday, and new models are added within days of release.

Who is doing the grading?

Working PMs, designers, marketers, and editors — people who ship this work — plus a panel of taste-calibrated models for scale.

Sixteen kinds of work. One taste test.

Most benchmarks stop at math, code, and PhD exams. We keep going.

Who wins at what

Could you tell which one has taste?

If you only read one thing

Who moved this week

Real briefs. Blind taste tests. Graded by people who ship.

The tab everyone keeps open

Questions, answered

One email. Who's got taste this week.