Show HN: Rocky – Rust SQL engine with branches, replay, column lineage
Hi HN, I'm Hugo. I've been building Rocky over the past month, shipping fast in the open. The binary is on GitHub Releases, `dagster-rocky` on PyPI, and the VS Code extension on the Marketplace. I held off on a broader announcement until the trust-system surface was coherent enough to talk about as one thing. The governance waveplan — column classification, per-env masking, 8-field audit trail on every run, `rocky compliance` rollup, role-graph reconciliation, retention policies — landed end-to-end last week in engine-v1.16.0 and rounded out in v1.17.4 (tagged 2026-04-26). That's the milestone I'd been waiting for.The pitch: keep Databricks or Snowflake. Bring Rocky for the DAG. Rocky is a Rust-based control plane for warehouse pipelines. Storage and compute stay with your warehouse. Rocky owns the graph — dependencies, compile-time types, drift, incremental logic, cost, lineage, governance. The things your current stack can't give you because it doesn't own the DAG.A few things I think are interesting:- Branches + replay. `rocky branch create stg` gives you a logical copy of a pipeline's tables (schema-prefix today; native Delta SHALLOW CLONE and Snowflake zero-copy are next). `rocky replay <run_id>` reconstructs which SQL ran against which inputs. Git-grade workflow on a warehouse.- Column-level lineage from the compiler, not a post-hoc graph crawl. The type checker traces columns through joins, CTEs, and windows. VS Code surfaces it inline via LSP.- Governance as a first-class surface. Column classification tags plus per-env masking policies, applied to the warehouse via Unity Catalog (Databricks) or masking policies (Snowflake). 8-field audit trail on every run. `rocky compliance` rollup that CI can gate on. Role-graph reconciliation via SCIM + per-catalog GRANT. Retention policies with a warehouse-side drift probe.- Cost attribution. Every run produces per-model cost (bytes, duration). `[budget]` blocks in `rocky.toml`; breaches fire a `budget_breach` hook event.- Compile-time portability + blast radius. Dialect-divergence lint across Databricks / Snowflake / BigQuery / DuckDB (12 constructs). `SELECT *` downstream-impact lint.- Schema-grounded AI. Generated SQL goes through the compiler — AI suggestions type-check before they can land.What Rocky isn't:- Not a warehouse — it's the control plane on top.- Not a Fivetran replacement. `rocky load` handles files (CSV/Parquet/JSONL); for SaaS sources use Fivetran, Airbyte, or warehouse-native CDC.- Not dbt Cloud — no hosted UI, no managed scheduler. First-class Dagster integration if you need orchestration.Adapters: Databricks (GA), Snowflake (Beta), BigQuery (Beta), DuckDB (local dev / playground). Apache 2.0.I'd love feedback on the trust-system framing, the governance surface (particularly classification-to-masking resolution in `rocky compile` and the `rocky compliance` CI gate), the branches/replay design, the cost-attribution primitives, or anything else that catches your eye. Happy to go deep in the thread.
65 points by hugocorreia90 - 9 comments
The code is what it is. `cargo test --workspace` runs across 19 crates. CI on 5 platforms (macOS ARM/Intel, Linux x86/ARM, Windows). JSON output schemas are codegen-checked in CI so docs can't drift from the binary.
If you want to skip the marketing copy and look at engine reasoning instead: PR #240 (audit trail), #241 (column classification + masking), #270 (failed-source surfacing in discover).
I'd rather hear "the code is bad" than "the post sounds AI-written".
Workflows knows task A runs before task B. Rocky knows `dim_customer.email` flows from `raw_users.email_address` through three CTEs in `stg_customers`. Different layer, same word.
I'll be more careful with that framing.
- dbt-style semantic-layer versions (v1/v2 of a model) - schema migration history - branch-based (Rocky already has branches + replay)
Different design choice for each, so it helps to know which problem you're trying to solve.
ClickHouse is tractable through the Adapter SDK without engine patching. If you can share roughly your model count and workload shape, I can put a real timeline on it. Open to community PRs too.
https://news.ycombinator.com/item?id=47340079
Not saying yours are, but them -- dashes certainly looks like it ;)
The schema-grounded AI feature is worth expanding on. The gap in most AI-generated SQL today is not syntax correctness - modern LLMs are surprisingly good at producing valid SQL - it's semantic correctness. The query parses and runs, but it joins on the wrong key or aggregates at the wrong grain because the model hallucinated a relationship that doesn't exist in the schema. Having a compiler that type-checks generated SQL against the actual DAG catches exactly this class of error.
This is relevant for the broader NL-to-SQL space too. Tools like ai2sql.io (https://ai2sql.io) that convert natural language to SQL produce structurally correct queries, but without schema validation the output can silently reference stale columns or missed foreign keys. A compile step that validates against the live graph would close that gap. Disclosure: I work on ai2sql.io.
Two questions on the branch/replay design: How does branch isolation work for stateful models (incremental, snapshot)? And for cost attribution - are you tracking warehouse compute cost only, or also cost of data scanned? On Snowflake the credit cost and scan volume can tell very different stories about query efficiency.
On the schema-grounded AI angle: agreed. The failure mode you describe — structurally valid SQL that joins on the wrong key or aggregates at the wrong grain because the model hallucinated a relationship — is exactly what the compiler is positioned to catch. AI-generated SQL runs through the type checker before it can land, so suggestions that don't validate against the actual DAG never reach the user. NL-to-SQL tools that integrate a compile step would close exactly the gap you're pointing at.
On your two questions:
1. Branch isolation for stateful models — mixed answer, and worth being honest about:
2. Cost attribution. Both bytes scanned and duration are captured per-model in the run record (`bytes_scanned` and duration on `RunRecord`). Budget gating today is on cost (USD) and duration — `max_usd` and `max_duration_ms` in `[budget]` blocks in `rocky.toml`, as independent thresholds. A direct bytes-scanned budget threshold isn't gateable today; the bytes are in the run record for analysis but you can't currently fail CI on "this run scanned more than N TB". Reasonable extension if there's demand.