← Back to overview
Evaluation Report · with control group

What does the DeepSeek-V4 distill actually add to a 9B Qwen?

eval by Kyle Hessling · model by Jackrong

To answer that question fairly, I ran the same eval suite twice on the same RTX 5090, same llama.cpp build, same Q5_K_M quant: once on Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash (the distill), and once on Qwen/Qwen3.5-9B (the official post-trained model that the distill was built from). Same parameter count, same architecture, same chat template family, same prompts. Two clear wins for the distill emerge: (1) it thinks tighter on long-horizon agentic tasks — the base hits the 8K thinking cap on 3 of 5 prompts, the distill finishes all 5 cleanly. (2) On open-ended creative front-end prompts, the distill ships output that punches above its parameter count — the base produces credible work on templated dashboards but struggles visibly on the designer portfolio and the mobile app marketing prompt: animation issues, overlay bugs, visually flatter design language. Tool-calling and throughput tie. Disk and VRAM are the same. There is no tax I could find.

0 / 5distill cap hits
3 / 5base cap hits
2.2×longer agentic on base
≈ tiedtool calling (5/6 each)
≈ tieddesign quality
~143tok/s · both models

Why this comparison is the right one

Reasoning distills usually carry a tax: you transplant a frontier model's chain-of-thought style into a smaller base, and you can buy reasoning polish at the cost of throughput, instruction-following, or quality at the seams. The right way to look for that tax is to compare against the same base the distill came from, at the same parameter count, on the same hardware. Comparing a 9B distill against a 27B abliterated finetune is the kind of comparison that gets you applause but not insight. This one is meant to surface signal.

Both runs used the same Hermes-style harness — agentic prompts at 8K cap with thinking on, design prompts at 32K cap, real OpenAI-style tools: [...] for the tool-calling section. Both at Q5_K_M, both at FP16 KV cache, both at 40K context. The only thing different is the model file.

Agentic reasoning: where the distill earns its keep

5 prompts, 8K cap each. 8K cap hit means the model used its entire token budget on thinking and didn't reach a final answer.

PromptBase · tokBase · sDistill · tokDistill · sNotes
multi_step_planning800054.9289920.3Base looped on the same plan; distill committed.
self_critique800055.0196913.8Base regenerated the rewrite repeatedly.
structured_extraction800055.0435330.5Base second-guessed JSON shapes; distill emitted it.
code_debug638643.7317022.1Both found the bugs; distill in half the tokens.
tool_use_json7565.3141510.0The one prompt where the base was tighter.
Totals31,142213.9 s13,80696.7 sDistill: 2.2× faster wall time, < half the tokens, zero cap hits.

Three of the five agentic prompts hit the 8K thinking cap on the base model. None did on the distill. That's not a quality preference — that's the base model running out of budget before producing a final answer at all, in 60% of the agentic suite. The DeepSeek-V4 distill teaches a more decisive thinking style: less "wait, let me reconsider" padding, faster commit to a working answer.

One honest counter-example: tool_use_json went the other direction — base 756 tokens, distill 1,415. So the distill isn't categorically tighter; it's tighter on the open-ended tasks where the base spirals. On a prompt with a clear single-shape answer, both produce concise output.

Front-end design: the distill clearly pulls ahead on creative prompts

5 prompts, 32K cap. Both models produced valid self-contained HTML/CSS/JS. The numbers in the table look close — but the actual eye review of side-by-side outputs tells a sharper story than the token counts suggest. Open the A/B buttons on the design cards from the overview page and judge for yourself; this section captures what I saw.

PromptBase · tokBase · sDistill · tokDistill · sEye-review verdict
saas_landing9,84967.915,347109.1Distill richer; base credible.
analytics_dashboard13,18791.413,03292.6Close — both clean templated dashboards.
designer_portfolio5,93040.76,21343.7Distill clearly wins. Base flatter, less polish.
pricing_page9,50365.48,36758.9Close — pricing page is a templated form.
mobile_app_marketing32,000228.310,16171.7Distill clearly wins. Base hit cap; output has animation + overlay bugs.
Totals70,469493.8 s53,120376.0 sDistill: 24% less tokens, 24% less wall time, better creative output.

The pattern: templated vs creative

The two prompts where the base holds its own — analytics_dashboard and pricing_page — are well-templated forms. Dashboards have a known shape: sidebar, KPI cards, chart, table. Pricing pages have three tiers and a toggle. The base 9B has clearly seen enough of those that it produces clean output without needing creative judgment.

The two prompts where the base falls short — designer_portfolio and mobile_app_marketing — ask the model to make taste-driven decisions. What does "kinetic typography" look like rendered? How do you draw a CSS-only iPhone with screen content inside it that doesn't look like a kid's homework? On both, the base produces something visually flat with overlay glitches and unfinished animation states. The distill produces output that, side-by-side, looks like it came out of a much larger model. One-shot, same prompt, same hardware.

The mobile app marketing case is the most striking. The base burned all 32K tokens (228 seconds), and the output it landed on still has overlapping z-index issues and broken hover transitions. The distill rendered a cohesive Stillwater landing page with a clean CSS phone mock in 10K tokens / 72 seconds — three times less budget, with the polish a much bigger model would produce.

It's the same pattern as agentic: when the task has a clear shape, both models do well. When the task is open-ended and requires committing to an aesthetic decision, the base dithers and the distill commits. The DeepSeek-V4 distill is teaching the model to have taste, not just to follow templates.

Tool calling: dead tied

Both models scored 5 PASS, 1 PARTIAL on the same six prompts. Both partial on the same prompt (multi_tool_sequence) for the same reason — Hermes one-tool-per-turn protocol expecting tool results before continuing. Tool-calling is unchanged by the distill.

SlugBaseDistill
single_tool_simplePASSPASS
tool_selectionPASSPASS
multi_tool_sequencePARTIALPARTIAL
no_tool_neededPASSPASS
complex_argsPASSPASS
structured_emailPASSPASS

This is a useful negative result: the distill doesn't break tool calling. A common failure mode of reasoning distills is degrading instruction-following / structured output — that didn't happen here. The model still correctly emits <tool_call> blocks with right shapes when asked.

Throughput is identical

Same engine, same architecture, same quant — speed comes out the same. Both averaged 142–146 tok/s flat across all 16 main runs, standard deviation under 1 tok/s on each. The distill costs nothing in throughput.

What the distill looks like on the inside

Aggregating reasoning trace size across the agentic suite tells a clean story:

The base writes 2.6× more thinking for the same problems. Skim a base agentic .thinking.txt next to a distill one (open agentic_*.txt from the index) and you can see the difference — the base often re-states the problem two or three times before working on it, double-checks intermediate steps that don't need it, and runs simulations of edge cases that won't fire. The distill goes more directly: state, plan, execute, verify, done.

Caveats

Setup

DistillJackrong/Qwen3.5-9B-DeepSeek-V4-Flash · Q5_K_M (locally quantized, 6.1 GB)
BaseQwen/Qwen3.5-9B · Q5_K_M from bartowski (6.4 GB)
Runtimellama.cpp cuda-12.8 (b8708), --flash-attn on, --jinja, embedded chat templates
Context40,960 tokens, FP16 KV cache
HardwareRTX 5090 (32 GB), Intel Core Ultra 7 265K, 125 GB RAM
ThinkingAlways on for both models

Bottom line

The DeepSeek-V4-Flash distill is a real upgrade over plain Qwen 3.5 9B in two specific places: long-horizon thinking (agentic / planning / self-critique / structured extraction) and creative one-shot front-end design. Tool-calling, throughput, disk, VRAM are all the same. Templated UI work is a wash. There's no tax I could find on the things the base is already good at.

If you were running base Qwen 3.5 9B as a local default, this distill is a free upgrade in the literal sense — same size, same speed, more reliable on the prompts where the base spirals, and visibly better one-shot output when the prompt asks for taste.

← Back to the eval overview · distill model card · base model card · Jackrong on HF