Benchmarking 5 Local LLMs for Content Generation. Only One Survived.
Local LLMs are practical for content generation, legal document processing, and internal knowledge bases. I benchmarked five Qwen models on my MacBook Pro. Qwen 3 14B scored 91/100 avg vs 62 for Qwen 2.5 14B -- same size, dramatically better. Newer models performed worse.
There are good reasons to run language models on your own hardware instead of calling a cloud API. Maybe you are processing confidential legal documents and cannot send them to a third party. Maybe you are building an internal knowledge base from proprietary company data and need a model you can fine-tune without uploading trade secrets. Maybe you are generating content at scale and the API bill is starting to look like a mortgage payment.
For me, it started with tool reviews. At modern-datatools.com, I use local LLMs as part of my AI-assisted methodology for enhancing and validated structured content. The models run on my laptop via Ollama — no cloud APIs, no recurring costs, just whatever fits in 32 gigs of RAM. The same approach works for any repeatable text generation task: summarizing research papers, drafting product descriptions, generating documentation from code, or creating training materials from internal wikis.
The catch is that not all local models are equal, and the difference between "good enough" and "not even close" can be enormous. Last week I generated content for 19 new tools. Exactly one passed my quality threshold. That is a 5% success rate, which is the kind of number that makes you question your life choices.
So I ran a proper benchmark. What I found changed not just which model I use, but how I think about model selection entirely.
The Problem
The model I had been using was Qwen 2.5 14B, released by Alibaba in September 2024. It had been producing passable output for months when working with well-known subjects that have plenty of public information available. But "passable" turned out to mean "barely passing" — and when it started processing less common topics with thinner source material, "barely passing" became "not passing at all."
I score every piece of generated content on a 100-point scale across four dimensions:
| Sub-score | Max Points | What It Measures |
|---|---|---|
| Content Depth | 30 | Word count (need 1,200+), section completeness, detail in each section |
| Accuracy | 25 | Concrete facts and numbers, absence of hedging like "may offer" or "pricing not disclosed" |
| SEO & Structure | 20 | Required H2 sections, keyword placement, proper formatting |
| Specificity | 25 | Real details like dollar amounts, plan names, concrete comparisons |
Content scoring below 60 gets rejected. The single piece that passed scored 70, which is like getting a C-minus and celebrating because you did not fail. I needed a better model.
The Benchmark
Rather than blindly swapping models and regenerating everything each time (which takes 85 minutes and a lot of patience), I wrote a small benchmark script. It takes three representative topics with rich source data, generates a full article for each with each model, and scores them all with the same quality checker that gates my production output.
Three topics, multiple models, about 45 minutes of compute time. Much cheaper than trial and error.
I ran everything on my MacBook Pro with an M1 Pro chip, 10 cores, and 32 GB of RAM. No external GPU, no cloud instance. If a model cannot produce good output on a laptop, it is not practical for a workflow that needs to run regularly without babysitting.
Here are the five Qwen models I tested:
| Model | Parameters | Disk Size | Released |
|---|---|---|---|
| Qwen 2.5 14B | 14B | 9.0 GB | September 2024 |
| Qwen 2.5 32B | 32B | 19 GB | September 2024 |
| Qwen 3 14B | 14B | 9.3 GB | April 2025 |
| Qwen 3.5 9B | 9B | 6.6 GB | February 2026 |
| Qwen 3.5 27B | 27B | 18 GB | February 2026 |
The Results
Quality Scores
| Topic | Qwen 2.5 14B | Qwen 3 14B | Qwen 3.5 9B | Qwen 3.5 27B |
|---|---|---|---|---|
| Apache Superset | 60 | 92 | 59 | 38 |
| Temporal | 68 | 92 | 15 | killed |
| Structly | 58 | 89 | 87 | killed |
| Average | 62 | 91 | 54 | 38 |
Qwen 2.5 32B never finished a single article. After watching my laptop struggle for 25 minutes with no output, I killed it and deleted the model. Some things are not meant to run on consumer hardware.
Qwen 3.5 27B managed to complete one article in 24 minutes — and scored 38. That is worse than the model half its size. I killed it after the first attempt.
The real surprise was Qwen 3.5 9B. It scored 87 on one topic (impressive for a 9B model), then 59 on another, and then produced literally zero usable words on the third. Zero. The output was empty. I am still not sure what happened there, but "works great one third of the time" is not a quality I look for in production systems. I nearly deleted it from my machine right then — but more on that later.
Sub-Score Breakdown
The interesting part is where Qwen 3 14B wins. Here is the average across all three topics:
| Sub-score (max) | Qwen 2.5 14B | Qwen 3 14B |
|---|---|---|
| Content Depth (30) | ~15 | 30 |
| Accuracy (25) | ~17 | 21 |
| SEO & Structure (20) | ~15 | 20 |
| Specificity (25) | ~15 | 20 |
Qwen 3 14B hit a perfect 30/30 on content depth across all three topics. It consistently generated 1,600 to 1,900 words compared to 800 to 1,070 from Qwen 2.5 14B. The older model simply could not produce enough detail to fill out all required sections, and the quality checker penalized it accordingly.
Speed
| Model | Avg Time per Article | Practical? |
|---|---|---|
| Qwen 2.5 14B | ~3 min | Fast, but quality too low |
| Qwen 3 14B | ~6 min | Yes |
| Qwen 3.5 9B | ~8.5 min | Slower and inconsistent |
| Qwen 2.5 32B | 25+ min (incomplete) | No |
| Qwen 3.5 27B | 24+ min (incomplete) | No |
Qwen 3 14B takes about twice as long as Qwen 2.5 14B because it uses internal chain-of-thought reasoning before generating the actual output. Think of it as the model planning the article before writing it. For a batch of 20 articles, the difference is 60 minutes versus 120 minutes. I will happily trade an extra hour for output that actually passes quality.
After switching to Qwen 3 14B, the next production batch saw pass rates jump from 5% to roughly 90%. One config change, same hardware, same parameter count — the model was the only variable.
Why Newer Is Not Always Better
The Qwen 3.5 models were released in February 2026, four months after Qwen 3. You would expect them to be strictly better. They are not, at least not for long-form content generation.
The issue is that Qwen 3.5 models allocate too many tokens to internal reasoning and not enough to the actual output. They "think" extensively but produce shorter, less detailed text. This is great for coding and math benchmarks where the answer is a few lines, but terrible for structured prose where you need 1,500+ words with concrete details in every section.
It is a useful reminder that LLM benchmarks like MMLU and HumanEval measure something, but they do not measure your specific task. A model that tops the leaderboard for code generation might produce mediocre articles. A model that writes brilliant essays might hallucinate numbers. The only benchmark that matters is the one you run on your own workload.
But this cuts both ways. A model that fails at one task might be exactly what you need for another.
The Right Model for the Right Job
Remember Qwen 3.5 9B — the model I nearly deleted? After the benchmark, I had a thought: the property that makes it bad at generating content — reasoning deeply and producing compact output — might make it excellent at a very different task.
The site has an AI agent that runs as part of a broader site audit pipeline. Its job is to act as an LLM judge: read every published article, evaluate it against a structured rubric, flag specific issues, and output a verdict. Here is what a typical output looks like:
{
"sub_scores": {
"specificity": 22,
"accuracy": 20,
"completeness": 30,
"readability": 13
},
"issues": [
{
"type": "FACTUAL_ISSUES",
"message": "Pricing mentions $29/mo but current price is $49/mo"
}
]
}
The auditor scores each page across four dimensions — specificity, accuracy, completeness, and readability — and flags concrete issues like placeholder text, factual errors, or thin sections. It processes hundreds of pages in batch, checkpointing results per page so it can resume if interrupted.
This is a fundamentally different task from content generation. The model does not need to produce 1,500 words of detailed prose. It needs to:
- Read and comprehend a full article (input-heavy)
- Evaluate it against a detailed rubric (reasoning-heavy)
- Identify specific issues with evidence (analysis-heavy)
- Output a structured verdict — under 600 tokens of JSON (output-light)
Heavy on reasoning, light on generation — the exact opposite of writing articles.
Here is how both models fit into the actual pipeline:
Qwen 3.5 9B handles this well. Its tendency to reason extensively before producing output is an advantage when the job is to carefully analyze content and make judgment calls. The output is a compact JSON object, so the model's habit of producing shorter text does not matter. And at 6.6 GB versus 9.3 GB for Qwen 3 14B, it leaves more memory headroom during batch audits of hundreds of pages.
The same model that scored 15 on a content generation benchmark works reliably as a content judge. The task changed, and the model's fit changed with it.
This pattern — using one model for generation and a different one for validation — applies beyond my specific use case. Code review, compliance checking, document summarization, data extraction: any task where you need an LLM to reason about existing text rather than produce new text is a candidate for a reasoning-optimized model. If your use case is to generate content, you need one type of model. If your use case is to validate content and use an LLM as a judge, you need another.
Takeaways
Whether you are generating legal summaries from case files, building documentation from internal code, creating marketing copy from product specs, or writing structured reviews like I do, the conclusions are the same:
Model generation matters more than parameter count. Qwen 3 14B crushes Qwen 2.5 32B despite being less than half the size. Architecture improvements across model generations compound in ways that raw parameter scaling does not.
Match the model to the task, not the leaderboard. Qwen 3 14B excels at content generation because it produces long, detailed output. Qwen 3.5 9B excels at content auditing because it reasons deeply and outputs a compact verdict. The same architectural trait — heavy internal reasoning — is a liability for one task and an asset for another. Understand what your task demands before picking a model.
Benchmark on your actual workload. I could have stared at leaderboard scores all day and learned nothing useful. Forty-five minutes of benchmarking on my own tasks told me everything I needed to know — and a reusable benchmark script makes testing future models just as fast.
Consistency beats peak performance. Qwen 3.5 9B hit 87 on one topic and 15 on another for content generation. For automated pipelines, you need a model that delivers reliable results every time, not one that occasionally produces a masterpiece between disasters.
Local LLMs are production-ready. Six minutes per article at 91 average quality for generation. Reliable structured verdicts for auditing. Both running on a laptop with no API costs. The economics of local inference keep getting better, and the quality gap with cloud models keeps shrinking.
The bottleneck was never just the model. It was understanding what each model is actually good at.
Written by Egor Burlakov
Engineering and Science Leader with experience building scalable data infrastructure, data pipelines and science applications. Sharing insights about data tools, architecture patterns, and best practices.