Evaluation Report · with control group

What does the DeepSeek-V4 distill actually add to a 9B Qwen?

eval by Kyle Hessling · model by Jackrong

To answer that question fairly, I ran the same eval suite twice on the same RTX 5090, same llama.cpp build, same Q5_K_M quant: once on Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash (the distill), and once on Qwen/Qwen3.5-9B (the official post-trained model that the distill was built from). Same parameter count, same architecture, same chat template family, same prompts. Two clear wins for the distill emerge: (1) it thinks tighter on long-horizon agentic tasks — the base hits the 8K thinking cap on 3 of 5 prompts, the distill finishes all 5 cleanly. (2) On open-ended creative front-end prompts, the distill ships output that punches above its parameter count — the base produces credible work on templated dashboards but struggles visibly on the designer portfolio and the mobile app marketing prompt: animation issues, overlay bugs, visually flatter design language. Tool-calling and throughput tie. Disk and VRAM are the same. There is no tax I could find.

0 / 5distill cap hits

3 / 5base cap hits

2.2×longer agentic on base

≈ tiedtool calling (5/6 each)

≈ tieddesign quality

~143tok/s · both models

Why this comparison is the right one

Reasoning distills usually carry a tax: you transplant a frontier model's chain-of-thought style into a smaller base, and you can buy reasoning polish at the cost of throughput, instruction-following, or quality at the seams. The right way to look for that tax is to compare against the same base the distill came from, at the same parameter count, on the same hardware. Comparing a 9B distill against a 27B abliterated finetune is the kind of comparison that gets you applause but not insight. This one is meant to surface signal.

Both runs used the same Hermes-style harness — agentic prompts at 8K cap with thinking on, design prompts at 32K cap, real OpenAI-style tools: [...] for the tool-calling section. Both at Q5_K_M, both at FP16 KV cache, both at 40K context. The only thing different is the model file.

Agentic reasoning: where the distill earns its keep

5 prompts, 8K cap each. 8K cap hit means the model used its entire token budget on thinking and didn't reach a final answer.

Prompt	Base · tok	Base · s	Distill · tok	Distill · s	Notes
multi_step_planning	8000	54.9	2899	20.3	Base looped on the same plan; distill committed.
self_critique	8000	55.0	1969	13.8	Base regenerated the rewrite repeatedly.
structured_extraction	8000	55.0	4353	30.5	Base second-guessed JSON shapes; distill emitted it.
code_debug	6386	43.7	3170	22.1	Both found the bugs; distill in half the tokens.
tool_use_json	756	5.3	1415	10.0	The one prompt where the base was tighter.
Totals	31,142	213.9 s	13,806	96.7 s	Distill: 2.2× faster wall time, < half the tokens, zero cap hits.

Three of the five agentic prompts hit the 8K thinking cap on the base model. None did on the distill. That's not a quality preference — that's the base model running out of budget before producing a final answer at all, in 60% of the agentic suite. The DeepSeek-V4 distill teaches a more decisive thinking style: less "wait, let me reconsider" padding, faster commit to a working answer.

One honest counter-example: tool_use_json went the other direction — base 756 tokens, distill 1,415. So the distill isn't categorically tighter; it's tighter on the open-ended tasks where the base spirals. On a prompt with a clear single-shape answer, both produce concise output.

Front-end design: the distill clearly pulls ahead on creative prompts

5 prompts, 32K cap. Both models produced valid self-contained HTML/CSS/JS. The numbers in the table look close — but the actual eye review of side-by-side outputs tells a sharper story than the token counts suggest. Open the A/B buttons on the design cards from the overview page and judge for yourself; this section captures what I saw.

Prompt	Base · tok	Base · s	Distill · tok	Distill · s	Eye-review verdict
saas_landing	9,849	67.9	15,347	109.1	Distill richer; base credible.
analytics_dashboard	13,187	91.4	13,032	92.6	Close — both clean templated dashboards.
designer_portfolio	5,930	40.7	6,213	43.7	Distill clearly wins. Base flatter, less polish.
pricing_page	9,503	65.4	8,367	58.9	Close — pricing page is a templated form.
mobile_app_marketing	32,000	228.3	10,161	71.7	Distill clearly wins. Base hit cap; output has animation + overlay bugs.
Totals	70,469	493.8 s	53,120	376.0 s	Distill: 24% less tokens, 24% less wall time, better creative output.

The pattern: templated vs creative

The two prompts where the base holds its own — analytics_dashboard and pricing_page — are well-templated forms. Dashboards have a known shape: sidebar, KPI cards, chart, table. Pricing pages have three tiers and a toggle. The base 9B has clearly seen enough of those that it produces clean output without needing creative judgment.

The two prompts where the base falls short — designer_portfolio and mobile_app_marketing — ask the model to make taste-driven decisions. What does "kinetic typography" look like rendered? How do you draw a CSS-only iPhone with screen content inside it that doesn't look like a kid's homework? On both, the base produces something visually flat with overlay glitches and unfinished animation states. The distill produces output that, side-by-side, looks like it came out of a much larger model. One-shot, same prompt, same hardware.

The mobile app marketing case is the most striking. The base burned all 32K tokens (228 seconds), and the output it landed on still has overlapping z-index issues and broken hover transitions. The distill rendered a cohesive Stillwater landing page with a clean CSS phone mock in 10K tokens / 72 seconds — three times less budget, with the polish a much bigger model would produce.

It's the same pattern as agentic: when the task has a clear shape, both models do well. When the task is open-ended and requires committing to an aesthetic decision, the base dithers and the distill commits. The DeepSeek-V4 distill is teaching the model to have taste, not just to follow templates.

Tool calling: dead tied

Both models scored 5 PASS, 1 PARTIAL on the same six prompts. Both partial on the same prompt (multi_tool_sequence) for the same reason — Hermes one-tool-per-turn protocol expecting tool results before continuing. Tool-calling is unchanged by the distill.

Slug	Base	Distill
single_tool_simple	PASS	PASS
tool_selection	PASS	PASS
multi_tool_sequence	PARTIAL	PARTIAL
no_tool_needed	PASS	PASS
complex_args	PASS	PASS
structured_email	PASS	PASS

This is a useful negative result: the distill doesn't break tool calling. A common failure mode of reasoning distills is degrading instruction-following / structured output — that didn't happen here. The model still correctly emits <tool_call> blocks with right shapes when asked.

Throughput is identical

Same engine, same architecture, same quant — speed comes out the same. Both averaged 142–146 tok/s flat across all 16 main runs, standard deviation under 1 tok/s on each. The distill costs nothing in throughput.

What the distill looks like on the inside

Aggregating reasoning trace size across the agentic suite tells a clean story:

Base: 106,333 chars of <think> across 5 prompts
Distill: 40,780 chars across 5 prompts

The base writes 2.6× more thinking for the same problems. Skim a base agentic .thinking.txt next to a distill one (open agentic_*.txt from the index) and you can see the difference — the base often re-states the problem two or three times before working on it, double-checks intermediate steps that don't need it, and runs simulations of edge cases that won't fire. The distill goes more directly: state, plan, execute, verify, done.

Caveats

Single-pass eval, not an ELO. Treat the numbers as "here's what one careful pass looks like."
Q5_K_M for both, fair-and-equivalent. Higher quants would shift absolute numbers but not the comparison shape.
Tool-calling scored by structure, not execution.
Creative-coding category was dropped from the showcase — most of the six demos in both runs had rendering bugs. That's a 9B-class weakness, not a distill question. Raw outputs are in the bench results directory.
The base model's cap hits aren't refusals or errors; the model is generating valid thinking traces, it just doesn't budget them.

Setup

Distill	Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash · Q5_K_M (locally quantized, 6.1 GB)
Base	Qwen/Qwen3.5-9B · Q5_K_M from bartowski (6.4 GB)
Runtime	llama.cpp cuda-12.8 (b8708), `--flash-attn on`, `--jinja`, embedded chat templates
Context	40,960 tokens, FP16 KV cache
Hardware	RTX 5090 (32 GB), Intel Core Ultra 7 265K, 125 GB RAM
Thinking	Always on for both models

Bottom line

The DeepSeek-V4-Flash distill is a real upgrade over plain Qwen 3.5 9B in two specific places: long-horizon thinking (agentic / planning / self-critique / structured extraction) and creative one-shot front-end design. Tool-calling, throughput, disk, VRAM are all the same. Templated UI work is a wash. There's no tax I could find on the things the base is already good at.

If you were running base Qwen 3.5 9B as a local default, this distill is a free upgrade in the literal sense — same size, same speed, more reliable on the prompts where the base spirals, and visibly better one-shot output when the prompt asks for taste.

← Back to the eval overview · distill model card · base model card · Jackrong on HF