-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.
skeeter2020 [3 hidden]5 mins ago
>> The .csv parser is amazing
Their csv support coupled with lots of functions and fast & easy iterative data discovery has totally changed how I approach investigation problems. I used to focus a significant amount of time on understanding the underlying schema of the problem space first, and often there really wasn't one - but you didn't find out easily. Now I start with pulling in data, writing exploratory queries to validate my assumptions, then cleaning & transforming data and creating new tables from that state; rinse and repeat. Aside from getting much deeper much quicker, you also hit dead ends sooner, saving a lot of otherwise wasted time.
There's an interesting paper out there on how the CSV parser works, and some ideas for future enhancements. I couldn't seem to find it but maybe someone else can?
Been playing around with Clickhouse a lot recently and have had a great experience particularly because it hits many of these same points. In my case the "local files" hasn't been a huge fixture but the Parquet and JSON ingestion have been very convenient and I think CH intends for `clickhouse-local` to be some sort of analog to the "add duckdb" point.
One of my favorite features is `SELECT ... FROM s3Cluster('<ch cluster>', 'https://...<s3 url>.../data//.json', ..., 'JSON')`[0] which lets you wildcard ingest from an S3 bucket and distributes the processing across nodes in your configured cluster. Also, I think it works with `schema_inference_mode` (mentioned below) though I haven't tried it. Very cool time for databases / DB tooling.
(I actually wasn't familiar with `union_by_name` but it looks to be like Clickhouse has implemented that as well [1,2] Neat feature in either case!)
This is a great sell. I have this annoyingly manual approach with a SQLite import and so on. This is great. Thank you!
exographicskip [3 hidden]5 mins ago
Thanks for the excellent comment! Now excuse me while I go export my spotify history to play around with duckdb <3
mrtimo [3 hidden]5 mins ago
Spotify says it will take 30 days for the export... it really only takes about 48 hours if I remember correctly.
While you wait for the download here is an example listening history exploration in malloy - I converted the listening history to .parquet: https://github.com/mrtimo/spotify-listening-history
it is also difficult to customize as compared to sqlite so for example if you want to use your own parser for csv than it becomes hard.
But yes it provides lot of convenience out of the box as you have already listed.
nylonstrung [3 hidden]5 mins ago
DuckDB is awesome.
If you want it's power as a query engine but like to write python instead of SQL, I highly recommend using it as a backend for the Ibis dataframe library
It let's you interchange pythonic dataframe syntax (like Pandas and Polars) with SQL that 'compile' down to SQL in DuckDB dialect
And you can use those queries interchangably in postgres, sqlite, polars, spark, etc
I work with scientists who research BC's coastal environment, from airborne observation of glaciers to autonomous drones in the deep sea. We've got heaps of data.
A while back I took a leap of faith with DuckDB as the data-processing engine for a new tool we're using to transform and validate biodiversity data. The goal is to take heaps of existing datasets and convert them to valid Darwin Core data. Keyword being valid.
DuckDB is such an incredible tool in this context. Essentially I dynamically build duckdb tables from schemas describing the data, then import it into the tables. If it fails, it explains why on a row-by-row basis (as far as it's able to, at least). Once the raw data is in, transformations can occur. This is accomplished entirely in DuckDB as well. Finally, validations are performed using application-layer logic if the transformation alone isn't assurance enough.
I've managed to build an application that's way faster, way more capable, and much easier to build than I expected. And it's portable! I think I can get the entire core running in a browser. Field researchers could run this on an iPad in a browser, offline!
This is incredible to me. I've had so much fun learning to use DuckDB better. It's probably my favourite discovery in a couple of years.
And yeah, this totally could have been done any number of different ways. I had prototypes which took much different routes. But the cool part here is I can trust DuckDB to do a ton of heavy lifting. It comes with the cost of some things happening in SQL that I'd prefer it didn't sometimes, but I'm content with that tradeoff. In cases where I'm missing application-layer type safety, I use parsing and tests to ensure my DB abstractions are doing what I expect. It works really well!
edit: For anyone curious, the point of this project is to allow scientists to analyze biodiversity and genomic data more easily using common rather than bespoke tools, as well as publish it to public repositories. Publishing is a major pain point because people in the field typically work very far from the Darwin Core spec :) I'm very excited to polish it a bit and get it in the hands of other organizations.
owlstuffing [3 hidden]5 mins ago
100% agree.
> Writing SQL code
Language integration is paramount for med/lg projects. There's an experimental Java lang project, manifold-sql [1], that does the impossible: inline native DuckDB SQL + type-safety.
"""
[.sql/] SELECT station_name, count(*) AS num_services
FROM 'http://blobs.duckdb.org/train_services.parquet'
WHERE monthname(date) = 'May'
GROUP BY ALL
ORDER BY num_services DESC
LIMIT 3
"""
.fetch()
.forEach(row -> out.println(row.stationName + ": " + row.numServices));
To get fast access to the query results we use the Apache Arrow interface and generate the code directly from DuckDB SQL queries using the SQG tool ( https://sqg.dev/generators/java-duckdb-arrow/)
udkl [3 hidden]5 mins ago
I'm very interested to see a behind the scenes of your tech - do you guys have a writeup somewhere? You HN tool is also promising!
majkinetor [3 hidden]5 mins ago
Anybody with experience in using duckdb to quickly select page of filtered transactions from the single table having a couple of billions of records and let's say 30 columns where each can be filtered using simple WHERE clausule? Lets say 10 years of payment order data. I am wondering since this is not analytical scenario.
Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)
morganherlocker [3 hidden]5 mins ago
I've used duckdb a lot at this scale, and I would not expect something like this to take more than a few seconds, if that. The only slow duckdb queries I have encountered either involve complex joins or glob across many files.
cess11 [3 hidden]5 mins ago
I'm not so sure the common index algorithms would work to speed up a count. How often is the table updated? If it's often, and it's also queried often for the count, then run the count somewhat often on a schedule and store the result separately, and if it isn't queried often, do it more seldom.
From what you describe I'd expect a list of column-value pairs under a WHERE to resolve pretty fast if it uses indices and don't fish out large amounts of data at once.
noo_u [3 hidden]5 mins ago
I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:
"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."
become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.
The author also has another article where they claim that:
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.
RobinL [3 hidden]5 mins ago
Author here. Re: 'SQL should be the first option considered', there are certainly advantages to other dataframe APIs like pandas or polars, and arguably any one is better in the moment than SQL. At the moment Polars is ascendent and it's a high quality API.
But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.
I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.
I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.
noo_u [3 hidden]5 mins ago
I think both of us are ultimately wary of using the wrong tool for the job.
I see your point, even though my experience has been somewhat the opposite. E.g. a pipeline that used to work fast enough/at all up until some point in time because the scale of the data or requirements allowed it. Then some subset of these conditions changes, the pipeline cannot meet them, and one has to reverse engineer obscure SQL views/stored procedures/plugins, and migrate the whole thing to python or some compiled language.
I work with high density signal data now, and my SQL knowledge occupies the "temporary solution" part of my brain for the most part.
chaps [3 hidden]5 mins ago
Yeah, my experiences match yours and I very, very much work with messy data (FOIA data), though I use postgres instead of duckdb.
Most of the datasets I work with are indeed <10GB but the ones that are much larger follow the same ETL and analysis flows. It helps that I've built a lot of tooling to help with types and memory-efficient inserts. Having to rewrite pipelines because of "that one dataframe API" is exactly what solidified my thoughts around SQL over everything else. So much of my life time has been lost trying to get dataframe and non-dataframe libraries to work together.
Thing about SQL is that it can be taken just about anywhere, so the time spent improving your SQL skills is almost always well worth it. R and pandas much less so.
gofreddygo [3 hidden]5 mins ago
I advocated for a SQL solution at work this week and it seems to have worked. My boss is wary of the old school style SQL databases with their constraints and just being a pain to work with. As a developer, these pains aren't too hard to get used to or automate or document away for me and never understood the undue dislike of sql.
The fact that I can use sqlite / local sql db for all kinds of development and reliably use the same code (with minor updates) in the cloud hosted solution is such a huge benefit that it undermines anything else that any other solution has to offer. I'm excited about the sql stuff I learned over 10 years ago being of of great use to me in the coming months.
The last product I worked heavily on used a nosql database and it worked fine till you start tweak it just a little bit - split entities, convert data types or update ids. Most of the data access layer logic dealt with conversion between data coming in from the database and the guardrails to keep the data integrity in check while interacting with the application models. To me this is something so obviously solved years ago with a few lines of constraints. Moving over to sql was totally impossible. Learned my lesson, advocated hard for sql. Hoping for better outcomes.
chaps [3 hidden]5 mins ago
I totally understand the apprehension towards SQL. It's esoteric as all hell and it tends to be managed by DBAs who you really only go to whenever there's a problem. Organizationally it's simpler to just do stuff in-memory without having to fiddle around with databases.
__mharrison__ [3 hidden]5 mins ago
My experience is that if you want to do ML, viz, or advanced analytics, dataframes give a better experience.
If you are shuffling data around in pipelines, sure, go for SQL.
Readability is in the eye of the beholder. I much prefer dataframes for that, though a good chunk of the internet claims to throw up in their mouths upon seeing it...
alastairr [3 hidden]5 mins ago
I'm running duckdb over 500gb of parquet on a largish desktop (50gb ram) and it's been smooth & fast. I guess OOM issues will matter at some point, but I think it's going to be in the top 1% of real world use cases.
jauntywundrkind [3 hidden]5 mins ago
With a decent SSD (or eight), spilling to disk is really not bad these days! Yes!
And if that's still not enough, if you just need to crunch data a couple times a week, it's not unreasonable to get a massive massive cloud box with ridiculous amounts of ram or ram+SSD. I7i or i8g boxes. Alas, we have cheap older gen epycs & some amazing cheap motherboards but RAM prices to DIY are off the charts unbelievable, but so be it.
hnthrowaway0315 [3 hidden]5 mins ago
SQL is popular because everyone can learn and start using it after a while. I agree that Python sometimes is a better tool but I don't see SQL going away anytime.
From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.
physicsguy [3 hidden]5 mins ago
You can get 32TiB of RAM instances on AWS these days
mr_toad [3 hidden]5 mins ago
Which is a lot for a single user, but when you have a dozens or hundreds of analysts who all want to run their own jobs on your hundred terabyte data warehouse then even the largest single machine wont cut it.
RobinL [3 hidden]5 mins ago
Exactly - these huge machines are surely eating a lot into the need for distributed systems like Spark. So much less of a headache to run as well
jeffbee [3 hidden]5 mins ago
That sounds damned near useless for typical data analysis purposes and I would very much prefer a distributed system to a system that would take an hour to fill main memory over its tiny network port. Also, those cost $400/hr and are specifically designed for businesses where they have backed themselves into a corner of needing to run a huge SAP HANA instance. I doubt they would even sell you one before you prove you have an SAP license.
For a tiny fraction of the cost you can get numerous nodes with 600gbps ethernet ports that can fill their memory in seconds.
theLiminator [3 hidden]5 mins ago
Yeah, i'm also similarly confused.
> "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
SQL has nothing to do with fast. Not sure what makes it any more testable than polars? Future-proof in what way? I guess they mean your SQL dialect won't have breaking changes?
wood_spirit [3 hidden]5 mins ago
I’m also a duckdb convert. All my notebooks have moved from Pandas and polars to Duckdb. It is faster to write and faster to read (after you return to a notebook after time away) and often faster to run. Certainly not slower to run.
My current habit is to suck down big datasets to parquet shards and then just query them with a wildcard in duckdb. I move to bigquery when doing true “big data” but a few GB of extract from BQ to a notebook VM disk and duckdb is super ergonomic and performant most of the time.
It’s the sql that I like. Being a veteran of when the world went mad for nosql it is just so nice to experience the revenge of sql.
theLiminator [3 hidden]5 mins ago
I personally find polars easier to read/write than sql. Especially when you start doing UDFs with numpy/et. al. I think for me, duckdb's clear edge is the cli experience.
> It is faster to write and faster to read
At least on clickbench, polars and duckdb are roughly comparable (with polars edging out duckdb).
erikcw [3 hidden]5 mins ago
I use them both depending on which feels more natural for the task, often within the same project. The interop is easy and very high performance thanks to Apache Arrow: `df = duckdb.sql(sql).pl()` and `result = duckdb.sql("SELECT * FROM df")`.
RobinL [3 hidden]5 mins ago
Author here. I wouldn't argue SQL or duckdb is _more_ testable than polars. But I think historically people have criticised SQL as being hard to test. Duckdb changes that.
I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines.
In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time.
theLiminator [3 hidden]5 mins ago
> I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines.
Yeah, but SQL isn't really portable between query all query engines. You always have to be speaking the same dialect. Also, SQL isn't the only "declarative" dsl, polars's lazyframe api is similarly declarative. Technically Ibis's dataframe dsl also works as a multi-frontend declarative query language. Or even substrait.
Anways my point is that SQL is not inherently a faster paradigm than "dataframes", but that you're conflating declarative query planning with SQL.
n_u [3 hidden]5 mins ago
Question for folks in data science / ML space: Has DuckDB been replacing Pandas and NumPy for basic data processing?
lz400 [3 hidden]5 mins ago
I do a lot of data processing and my tool of choice is polars. It's blazing fast and has (like pandas) a lot of very useful functions that aren't in SQL or are awkward to emulate in SQL. I can also just do Python functions if I want something that's not offered.
Please sell DuckDB to me. I don't know it very well but my (possibly wrong) intuition is that even giving equal performance, it's going to drop me to the awkwardness of SQL for data processing.
film42 [3 hidden]5 mins ago
Just 10 minutes ago I was working with a very large semi-malformed excel file generated by a mainframe. DuckDB was able to load it with all_varchar (just keep everything a string) in under a second.
I'm still waiting for Excel to load the file.
efromvt [3 hidden]5 mins ago
DuckDB can't get enough love! Beyond being a really performant database, being so friendly [sql and devx] is really underrated and hard to pull off well and is key to it being so fun - you get a compounding ecosystem because it's so easy to get started. Hoping that they can manage to keep it vibrant without it slowing down the pace of innovation at all.
The web/WASM integration is also fabulous. Looking forward to more "small engines" getting into that space to provide some competition and keep pushing it forward.
netcraft [3 hidden]5 mins ago
I love duckdb, I use it as much as I can.
I just wish that the support for node/bun was as good as python.
And I wish that they would bundle it differently for node/bun - the way it is now it depends on a dynamic link to a library which means I cant bundle it into a bun executable.
DangitBobby [3 hidden]5 mins ago
Being able to use SQL on CSV and json/jsonl files is pretty sweet. Of course it does much more than that, but that's what I do most often with it. Love duckdb.
samuell [3 hidden]5 mins ago
Indeed! I generally like awk a lot for simpler CSV/TSV processing, but when it comes to cases where you need things like combining/joining multiple CSV files or aggregating for certain columns, SQL really shines IME.
yakkomajuri [3 hidden]5 mins ago
Been quite a fan of DuckDB and we actually even use it in production.
But coincidentally today I was exploring memory usage and I believe I'm finding memory leaks. Anybody have similar experiences?
Still debugging more deeply but looking reasonably conclusive atm.
If someone can explain this or has a fix for it I'd love to hear it!
tjchear [3 hidden]5 mins ago
I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?
mr_toad [3 hidden]5 mins ago
Many analytical queries require full scans of fact tables anyway, so indexes are less useful. Joins are usually to dimensional tables, which are quite small. Snowflake doesn’t use indexes at all, and it’s built for handling the largest volumes of data.
However, you wouldn’t want to use either for transaction processing, the lack of indexes would really hurt.
RobinL [3 hidden]5 mins ago
It is true that for json and csv you need a full scan but there are several mitigations.
The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.
Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.
In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.
Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns
ayhanfuat [3 hidden]5 mins ago
If you are going to query it frequently then json/csv might become an issue. I think the reason it doesn't become a problem for duckdb/polars users is that we generally convert them to parquet after first read.
biophysboy [3 hidden]5 mins ago
Zonemaps are created for columns automatically. I process somewhat large tables w/ duckdb regularly (100M rows) and never have any problems.
riku_iki [3 hidden]5 mins ago
that's true for duckdb native tables, but the question was about json.
biophysboy [3 hidden]5 mins ago
They said json and csv - it handles both!
riku_iki [3 hidden]5 mins ago
handles depends on size. But I tried to say there is no zonemaps for json.
0cf8612b2e1e [3 hidden]5 mins ago
Visidata is in Python and has offered “real time” analytics of fixed files for a long time. Computers are stupidly fast. You can do a lot of operations within a few seconds time window.
simlevesque [3 hidden]5 mins ago
But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.
gdulli [3 hidden]5 mins ago
But you can have a surprisingly large amount of data before the inefficiency you're talking about becomes untenable.
akhundelar [3 hidden]5 mins ago
Not a duckdb user, but I use polars a lot (mentioned in the article).
Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.
mpalmer [3 hidden]5 mins ago
I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.
davidtm [3 hidden]5 mins ago
It is the darling of laptop-based analysis. I’m always surprised that duckdb doesn’t get used more to enhance existing, and stubbornly bedded-in data pipelines. We use it in browser and on server at Count (https://count.co) to make data warehouses more responsive for multiuser and agentic work.
wswin [3 hidden]5 mins ago
Comparison to Spark in terms of speed and simplicity is pretty unfair since Spark is designed for multi-node use and horizontal scaling isn't going anywhere.
PLenz [3 hidden]5 mins ago
Duckdb is wonderful. I have several multi-TB pipelines that I moved over from spark and dask. It's so much easier to think in terms of knew machine that gets it's resources used efficiently instead of distributed/cloud processes. It even let me take some big data things back to on-prem from aws.
oulu2006 [3 hidden]5 mins ago
That's really interesting, I love the idea of being able to use columnar support directly within postgresql.
I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.
biophysboy [3 hidden]5 mins ago
Its a really handy tool. I've queried basically everything you can w/ duckdb - csv, json, s3 buckets, MS SQL servers, excel sheets, pandas dataframes, etc - and have had very few issues.
biophysboy [3 hidden]5 mins ago
I think my favorite part of duckdb is its flexibility. Its such a handly little swiss army knife for doing analytical processing in scientific environments (messy data w/ many formats).
smithclay [3 hidden]5 mins ago
Agree with the author, will add: duckdb is an extremely compelling choice if you’re a developer and want to embed analytics in your app (which can also run in a web browser with wasm!)
Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.
dimitrieh [3 hidden]5 mins ago
100%
We recently created a survey website for the community survey results for Node-RED making it completely dynamic and segment-able. Creates lots of value and allows everyone to look at the data through their own lens. It uses DuckDB with WASM under the hood. Awesome technologies.
The wasm is pretty heavy data-wise tho, I’m hoping eventually it’ll be lighter for easier loading on not so good devices.
efromvt [3 hidden]5 mins ago
I'd really love a minimalist version, I'm not sure how small it's feasible for them to shrink it. As long as it doesn't get bigger and devices keep getting faster, I suppose?
countrymile [3 hidden]5 mins ago
It's worth noting that R has a great duckdb API as well. Saved me a lot of time when dealing with a 29GB CSV file and splitting it into separate parquet files on a low RAM ONS server a few months back.
It was mentioned that the performance of DuckDB is similar to that of Polars (among others). In that case why would one choose DuckDB over Polars? The only differentiator seems to be that you do the querying with standard SQL instead of the library specific APIs.
dkdcio [3 hidden]5 mins ago
- performance is often better, especially on “out of core” (“streaming”, spill to disk data sizes). Polars has done a ton of work on their streaming engine but they’re still catching up
- you don’t need to use Python (but Pythonic wrappers like Ibis exist; disclaimer I worked on Ibis, you can find my blogs on performance comparisons easily there); CLI, WASM, etc. w/o Python
- governance: DuckDB as OSS is setup in a more sustainable way (DuckDB Labs + DuckDB Foundation). while there is a VC-backed company (MotherDuck), it doesn’t employ the primary developers/control the project in the same way the Polars company does
- overall just simplicity and focus. tends to break less, solely focused on single-node, easy to extend, etc. — not trying to do a cloud product, distributing computing, supporting GPU execution
s-a-p [3 hidden]5 mins ago
"making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.
Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.
nchagnet [3 hidden]5 mins ago
We're using ducklake with data storage on Google cloud storage and the catalog inside a postgres database and it's a breeze! It may not be the most mature product, but it's definitely a good setup for small to medium applications which still require a data lake.
biophysboy [3 hidden]5 mins ago
DuckLake is pretty new, so I guess it would depend on if you need a more mature, fully-featured app.
pattar [3 hidden]5 mins ago
I went to a talk by the Motherduck team about why they built DuckLake instead of leaning more in on Iceberg. The key takeaway is that instead of storing all the table metadata inside files on s3 and dealing with latency and file io they instead store all of that info inside a duckdb table. Seems like a good idea and worked smoothly when I tried, however it is not quite in a stable production state it is still <1.0. They have a nice talk about it on youtube: https://youtu.be/hrTjvvwhHEQ?si=WaT-rclQHBxnc9qV
willvarfar [3 hidden]5 mins ago
(I work a lot with BigQuery's BigLake adaptor and it's basically caching the metadata of the iceberg manifest and parquet footers in Bigtable (this is Google) so query planning is super fast etc. Really helps)
tobilg [3 hidden]5 mins ago
Built with DuckDB-Wasm: SQL Terminal in the browser
ctrl + f clickhouse doesn't return anything, surprising
rustyconover [3 hidden]5 mins ago
DuckDB is awesome and Robin is too!
clumsysmurf [3 hidden]5 mins ago
DuckDB has experimental builds for Android ... I'm wondering how much work it would take to implement a Java API for it similar to sqlite (Cursor, etc).
-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.
Their csv support coupled with lots of functions and fast & easy iterative data discovery has totally changed how I approach investigation problems. I used to focus a significant amount of time on understanding the underlying schema of the problem space first, and often there really wasn't one - but you didn't find out easily. Now I start with pulling in data, writing exploratory queries to validate my assumptions, then cleaning & transforming data and creating new tables from that state; rinse and repeat. Aside from getting much deeper much quicker, you also hit dead ends sooner, saving a lot of otherwise wasted time.
There's an interesting paper out there on how the CSV parser works, and some ideas for future enhancements. I couldn't seem to find it but maybe someone else can?
One of my favorite features is `SELECT ... FROM s3Cluster('<ch cluster>', 'https://...<s3 url>.../data//.json', ..., 'JSON')`[0] which lets you wildcard ingest from an S3 bucket and distributes the processing across nodes in your configured cluster. Also, I think it works with `schema_inference_mode` (mentioned below) though I haven't tried it. Very cool time for databases / DB tooling.
(I actually wasn't familiar with `union_by_name` but it looks to be like Clickhouse has implemented that as well [1,2] Neat feature in either case!)
[0] https://clickhouse.com/docs/sql-reference/table-functions/s3... [1] https://clickhouse.com/docs/interfaces/schema-inference [2] https://github.com/ClickHouse/ClickHouse/pull/55892
it is also difficult to customize as compared to sqlite so for example if you want to use your own parser for csv than it becomes hard.
But yes it provides lot of convenience out of the box as you have already listed.
If you want it's power as a query engine but like to write python instead of SQL, I highly recommend using it as a backend for the Ibis dataframe library
It let's you interchange pythonic dataframe syntax (like Pandas and Polars) with SQL that 'compile' down to SQL in DuckDB dialect
And you can use those queries interchangably in postgres, sqlite, polars, spark, etc
https://ibis-project.org/
I work with scientists who research BC's coastal environment, from airborne observation of glaciers to autonomous drones in the deep sea. We've got heaps of data.
A while back I took a leap of faith with DuckDB as the data-processing engine for a new tool we're using to transform and validate biodiversity data. The goal is to take heaps of existing datasets and convert them to valid Darwin Core data. Keyword being valid.
DuckDB is such an incredible tool in this context. Essentially I dynamically build duckdb tables from schemas describing the data, then import it into the tables. If it fails, it explains why on a row-by-row basis (as far as it's able to, at least). Once the raw data is in, transformations can occur. This is accomplished entirely in DuckDB as well. Finally, validations are performed using application-layer logic if the transformation alone isn't assurance enough.
I've managed to build an application that's way faster, way more capable, and much easier to build than I expected. And it's portable! I think I can get the entire core running in a browser. Field researchers could run this on an iPad in a browser, offline!
This is incredible to me. I've had so much fun learning to use DuckDB better. It's probably my favourite discovery in a couple of years.
And yeah, this totally could have been done any number of different ways. I had prototypes which took much different routes. But the cool part here is I can trust DuckDB to do a ton of heavy lifting. It comes with the cost of some things happening in SQL that I'd prefer it didn't sometimes, but I'm content with that tradeoff. In cases where I'm missing application-layer type safety, I use parsing and tests to ensure my DB abstractions are doing what I expect. It works really well!
edit: For anyone curious, the point of this project is to allow scientists to analyze biodiversity and genomic data more easily using common rather than bespoke tools, as well as publish it to public repositories. Publishing is a major pain point because people in the field typically work very far from the Darwin Core spec :) I'm very excited to polish it a bit and get it in the hands of other organizations.
> Writing SQL code
Language integration is paramount for med/lg projects. There's an experimental Java lang project, manifold-sql [1], that does the impossible: inline native DuckDB SQL + type-safety.
1. https://github.com/manifold-systems/manifold/blob/master/doc...To get fast access to the query results we use the Apache Arrow interface and generate the code directly from DuckDB SQL queries using the SQG tool ( https://sqg.dev/generators/java-duckdb-arrow/)
Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)
From what you describe I'd expect a list of column-value pairs under a WHERE to resolve pretty fast if it uses indices and don't fish out large amounts of data at once.
"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."
become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.
The author also has another article where they claim that:
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.
But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.
I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.
I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.
I see your point, even though my experience has been somewhat the opposite. E.g. a pipeline that used to work fast enough/at all up until some point in time because the scale of the data or requirements allowed it. Then some subset of these conditions changes, the pipeline cannot meet them, and one has to reverse engineer obscure SQL views/stored procedures/plugins, and migrate the whole thing to python or some compiled language.
I work with high density signal data now, and my SQL knowledge occupies the "temporary solution" part of my brain for the most part.
Most of the datasets I work with are indeed <10GB but the ones that are much larger follow the same ETL and analysis flows. It helps that I've built a lot of tooling to help with types and memory-efficient inserts. Having to rewrite pipelines because of "that one dataframe API" is exactly what solidified my thoughts around SQL over everything else. So much of my life time has been lost trying to get dataframe and non-dataframe libraries to work together.
Thing about SQL is that it can be taken just about anywhere, so the time spent improving your SQL skills is almost always well worth it. R and pandas much less so.
The fact that I can use sqlite / local sql db for all kinds of development and reliably use the same code (with minor updates) in the cloud hosted solution is such a huge benefit that it undermines anything else that any other solution has to offer. I'm excited about the sql stuff I learned over 10 years ago being of of great use to me in the coming months.
The last product I worked heavily on used a nosql database and it worked fine till you start tweak it just a little bit - split entities, convert data types or update ids. Most of the data access layer logic dealt with conversion between data coming in from the database and the guardrails to keep the data integrity in check while interacting with the application models. To me this is something so obviously solved years ago with a few lines of constraints. Moving over to sql was totally impossible. Learned my lesson, advocated hard for sql. Hoping for better outcomes.
If you are shuffling data around in pipelines, sure, go for SQL.
Readability is in the eye of the beholder. I much prefer dataframes for that, though a good chunk of the internet claims to throw up in their mouths upon seeing it...
And if that's still not enough, if you just need to crunch data a couple times a week, it's not unreasonable to get a massive massive cloud box with ridiculous amounts of ram or ram+SSD. I7i or i8g boxes. Alas, we have cheap older gen epycs & some amazing cheap motherboards but RAM prices to DIY are off the charts unbelievable, but so be it.
From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.
For a tiny fraction of the cost you can get numerous nodes with 600gbps ethernet ports that can fill their memory in seconds.
> "SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
SQL has nothing to do with fast. Not sure what makes it any more testable than polars? Future-proof in what way? I guess they mean your SQL dialect won't have breaking changes?
My current habit is to suck down big datasets to parquet shards and then just query them with a wildcard in duckdb. I move to bigquery when doing true “big data” but a few GB of extract from BQ to a notebook VM disk and duckdb is super ergonomic and performant most of the time.
It’s the sql that I like. Being a veteran of when the world went mad for nosql it is just so nice to experience the revenge of sql.
> It is faster to write and faster to read
At least on clickbench, polars and duckdb are roughly comparable (with polars edging out duckdb).
I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines.
In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time.
Yeah, but SQL isn't really portable between query all query engines. You always have to be speaking the same dialect. Also, SQL isn't the only "declarative" dsl, polars's lazyframe api is similarly declarative. Technically Ibis's dataframe dsl also works as a multi-frontend declarative query language. Or even substrait.
Anways my point is that SQL is not inherently a faster paradigm than "dataframes", but that you're conflating declarative query planning with SQL.
Please sell DuckDB to me. I don't know it very well but my (possibly wrong) intuition is that even giving equal performance, it's going to drop me to the awkwardness of SQL for data processing.
I'm still waiting for Excel to load the file.
The web/WASM integration is also fabulous. Looking forward to more "small engines" getting into that space to provide some competition and keep pushing it forward.
But coincidentally today I was exploring memory usage and I believe I'm finding memory leaks. Anybody have similar experiences?
Still debugging more deeply but looking reasonably conclusive atm.
https://github.com/duckdb/duckdb/issues/20569
If someone can explain this or has a fix for it I'd love to hear it!
However, you wouldn’t want to use either for transaction processing, the lack of indexes would really hurt.
The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.
Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.
In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.
Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns
Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.
I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.
Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.
We recently created a survey website for the community survey results for Node-RED making it completely dynamic and segment-able. Creates lots of value and allows everyone to look at the data through their own lens. It uses DuckDB with WASM under the hood. Awesome technologies.
https://nodered.org/community-survey/
https://r.duckdb.org/
- you don’t need to use Python (but Pythonic wrappers like Ibis exist; disclaimer I worked on Ibis, you can find my blogs on performance comparisons easily there); CLI, WASM, etc. w/o Python
- governance: DuckDB as OSS is setup in a more sustainable way (DuckDB Labs + DuckDB Foundation). while there is a VC-backed company (MotherDuck), it doesn’t employ the primary developers/control the project in the same way the Polars company does
- overall just simplicity and focus. tends to break less, solely focused on single-node, easy to extend, etc. — not trying to do a cloud product, distributing computing, supporting GPU execution
Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.
https://terminal.sql-workbench.com