{label}

{badgeKind === 'fine-tune' ? 'Fine-tune' : 'Baseline'}

{runStage === 'idle' && Idle} {(runStage !== 'idle' && runStage !== 'done') && ( <> {runStage === 'reasoning' ? 'Reasoning' : runStage === 'planning' ? 'Planning' : runStage === 'coding' ? 'Generating SQL' : 'Executing'} )} {runStage === 'done' && ( <> {correct ? : } {verdictText} · {latencyShown}s )}

toggleStep(0)}> {parsed.think ?

{formatThink(parsed.think)}

— model skipped this block —

} toggleStep(1)}> {parsed.dag ? :

DAG block missing or unparseable.

} toggleStep(2)}>

{highlightSQL(parsed.code)}

toggleStep(3)}> {!finished ?

Awaiting completion…

: ( <>

{correct ? 'Matches gold result' : 'Differs from gold'} set-equal · {model.rows?.length ?? 0} rows

{model.exec_error && (

SQL error: {model.exec_error}

)} )}

Plain English in. Working data pipelines out.

Ask a database question in plain English. Helix returns three things: the reasoning, a typed pipeline DAG, and CTE SQL ready to drop into DBT, Dagster, or Foundry.

Below, run our 32B specialist side by side with Sonnet 4.6 — real questions, real SQLite, set-based row comparison. No mocks.

Helix · SFT (deployed)

50.1% EX+9.0

What you're running in this demo. A 32B specialist on a couple of H100s, on-prem.

Sonnet 4.6 · CTE+DAG

41.1% EX

Frontier baseline, zero-shot, same prompt.

Helix · +RL stage

53.3% EX+12.2

Same model after reinforcement learning with execution rewards.

Benchmark · 1,432 BIRD-dev samples · 11 SQLite databases · set-based row comparison.

01 · Pick a question

Four real questions, three real databases.

Each case is a natural-language question against a real SQLite database. The two starred ⭐ cases are where Helix gets it right and Sonnet doesn't. On the other two, both models land the gold rows.

Live model: helix-codedag-sft-32b-v0 · SFT release at 50.1% EX.

{cases.map((c, i) => { const star = c.highlight && c.highlight.startsWith('⭐'); return ( ); })}

Selected · {caseData.id}

{caseData.query}

{caseData.db_id}.sqlite expects {caseData.gold_row_count} {caseData.gold_row_count === 1 ? 'row' : 'rows'} {caseData.evidence && hint provided}

{caseData.evidence && (

Hint: {caseData.evidence}

)} {runError && (

{runError}

)}

{tweaks.showSchema && }

02 · Live comparison

Same prompt. Same database. Two models.

{stage === 'idle' ? (

Pick a question above, then hit Run.

Both models will generate a pipeline DAG, write CTE SQL, and execute against the real database side by side.

) : ( <>

{stage === 'done' && } {stage === 'done' && (

Gold reference SQL

{caseData.gold_sql}

)} )}

03 · Why this works

A specialist, verified by execution.

Helix isn't a generalist that happens to write SQL. It's tuned for one job — turning questions into deployable data pipelines — and every training sample was proven correct against a real database before training began.

Built for pipelines, not chat.

One job, tuned end-to-end: think → DAG → CTE SQL. Every DAG node maps 1:1 to a CTE — drop the output straight into DBT, Dagster, or Foundry with lineage built in.

ii.

Every sample, proven correct.

11,086 training samples from BIRD and Spider. Each one's SQL was run against a real database and only kept if the rows matched the gold result. No "looks reasonable" data ever reached training.

iii.

Two training stages.

The deployed model is the SFT stage — 50.1% EX, already +9 over Sonnet. Layering reinforcement learning on top, with an execution-based reward, adds another +3.2 points (53.3%). Same prompt. Same on-prem footprint. Better correctness.

setTweak('liveApi', v)}/> setTweak('speed', v)} options={[ { value: 'fast', label: 'Fast' }, { value: 'normal', label: 'Normal' }, { value: 'slow', label: 'Slow' }, ]}/> setTweak('theme', v)} options={[ { value: 'light', label: 'Light' }, { value: 'dark', label: 'Dark' }, ]}/> setTweak('showSchema', v)}/> { setTweak('openAllSteps', v); setHelixOpen([v, v, v, v]); setSonnetOpen([v, v, v, v]); }}/> ); } ReactDOM.createRoot(document.getElementById('root')).render();