Hacker News

Postgres data stored in Parquet on S3: LTAP architecture explained

125 points by andrenotgiant - 41 comments

dsauerbrun [3 hidden]5 mins ago

Maybe I'm too stupid to understand the article... How does this achieve performant querying for olap and oltp purposes?

Based on my understanding, olap queries will go to the parquet files which are stored in a columnar fashion and oltp style queries will go to a caching layer that sits on top of those parquet files?

What's the special sauce here? Seems like they're just caching the data which, for all intents and purposes, seems like the same solution of storing another copy of the data which is what they say they're avoiding.

conradludgate [3 hidden]5 mins ago

Hi, I work on Lakebase (but not on storage), here's how I understand it.

For Lakebase and Neon, our architecture needs the caching layer regardless (what we call Pageservers). Performing reads from S3 directly is too slow so we reconstruct pages and keep them on an nvme server for faster querying. Changing the format on S3 to be Parquet effectively introduces no additional copies over our existing architecture

dsauerbrun [3 hidden]5 mins ago

Hmm, if the caching layer doesn't change(I assume it was optimized for olap style queries), and the new parquet format is better for olap... I'm still not understanding how it performs well for oltp reads.

I'll give the article another read... Maybe I missed something. Thank you for the response! Really nice to be able to get info straight from people who work on the product

nikita [3 hidden]5 mins ago

Recent data plus working set is always in Postgres page format.

Historical data when pushed to s3 is in parquet. This happens async - not on the transaction hot path.

So older data below certain LSN is on s3 in parquet available to all analytics processing. Hot data is on page servers in page format for OLTP.

You can be smart in querying both representations for real time analytical queries

viccis [3 hidden]5 mins ago

From what I have seen, it's basically a Lambda architecture.

saisrirampur [3 hidden]5 mins ago

But why? I’m skeptical of the idea of unifying storage just because it sounds “elegant” or “cool”. It’s not obvious to me how a single storage engine can compete with purpose-built OLTP and OLAP systems like Postgres and ClickHouse, without significant tradeoffs.

You also mention removing CDC pipelines. I’m curious if the materialization (conversion across formats) can catchup to an OLTP workload that is heavy (50K+ tps), which is pretty common these days. Also CDC if done right and with care can be magical for users and stays native to the OLTP/OLAP data-store.

Third, data Lakes and open formats are suitable for Data Warehousing / Data analyst use-cases than real-time customer facing apps. Sure, you might work on changing that, which is what you are upto, but you’ll always run into tradeoffs, which will make it hard to unleash the best performance, much needed for the latter category.

nikita [3 hidden]5 mins ago

Conversion is async. The whole point is to never deal with CDC which is error prone and taxing Postgres with occupying a replication slot and burning memory and cpu in the OLTP system.

saisrirampur [3 hidden]5 mins ago

Taxing Postgres is one thing, which can be overcome with ways like using standbys. There could other more native ways (than unifying storage), which you’ll hear about in a few weeks. Also I don’t fully agree on logical replication taxing Postgres, if the client is built with care and precision.

In regards to error prone and speed (lag, latency at real-world scale), I wish the blog went into more detail and gave evidence than talk theory.

ronfriedhaber [3 hidden]5 mins ago

> CDC which is error prone

Just have superior CDC :)

saisrirampur [3 hidden]5 mins ago

Exactly! Why unifying storage, which opens up a can of trade-offs.

nikita [3 hidden]5 mins ago

If you product is CDC based (peerdb) you don’t want storage to support this :)

This architecture is better for OLTP because all maintenance operations are moved to storage AND it has all other benefits such as LTAP that emerge from having a scalable storage.

saisrirampur [3 hidden]5 mins ago

;) sounding good on paper vs how it works in practice (supporting demanding real-time OLTP/OLAP workloads) are completely different ball games.

Separately, I understand taking care of it at storage level, but still don’t get “unifying storage” or “zero copy”.

Anyways, I’ll stop now. Good to see all the innovation happening on converging OLTP/OLAP front. Each with a different approach and perspective. :)

creeksai [3 hidden]5 mins ago

Did you even read the blog? Or are you just throwing shade because you are working on a competitive product based on CDC?

saisrirampur [3 hidden]5 mins ago

Oh no, I did read the blog. Not throwing shade at anyone here—the blog is great. It just doesn’t provide real-world evidence, and it opens up a bunch of technical questions that I’m trying to understand. that’s exactly what HN is for. :)

hasyimibhar [3 hidden]5 mins ago

How does LTAP architecture deals with major Postgres upgrade? Is it truly zero-downtime for both upstream and downstream?

Avalaxy [3 hidden]5 mins ago

Super cool stuff. Being able to combine your analytical platform and transactional database into one storage layer without having to set up ETL pipelines in between is really a game changer. Especially since it's just postgres, instead of some proprietary database.

scritty-dev [3 hidden]5 mins ago

So then would LTAP sit to both the left and the right of the medallion architecture? Meaning would you on the left of Bronze use it as an OLTP and to the right of Gold use it as an OLAP? Currently we've been mainly utilizing it to the right of Gold to develop analytic PERN applications that allow us to reuse the RBAC/ACLs set in Unity Catalog, but from this article it seems like that's only half of its utility?

andrenotgiant [3 hidden]5 mins ago

Here's what I don't understand:

Part of the value of doing an ETL pipeline via streaming replication is you get the full history of data in a table. An SCD type 2 table where each row also has a valid_from and valid_to timestamp column.

How would someone do the same thing with this architecture?

khurs [3 hidden]5 mins ago

Both Iceberg and Delta Lake support 'time travel' so you can query data as it was at a certain date.

SELECT count * FROM my_table AS OF "2025-01-01"

https://delta.io/blog/2023-02-01-delta-lake-time-travel/

https://iceberg.apache.org/docs/latest/spark-queries/#spark-...

_zoltan_ [3 hidden]5 mins ago

up to a limited number of snapshots. it's not arbitrary times, it's when there was an atomic snapshot created in the catalog.

hasyimibhar [3 hidden]5 mins ago

It wouldn't be possible to do this with LTAP architecture since (I'm assuming) the individual logical changes are not visible. But honestly I've always seen SCD type 2 table as a workaround due to lack of data modeling experience in the source database. If you design your tables correctly, you shouldn't need SCD type 2 downstream.

For example, if you know your user can change emails, and there might be events from another source that is keyed by user email (e.g. marketing-related events), then naturally you will need some sort of email_history table that has historical mapping of user id to email (you probably need it for audit purposes too). Then in this case there is no need to build SCD type 2 table of user from CDC, it's already there.

TheTaytay [3 hidden]5 mins ago

I think you have a point, and SCD type 2 feels like a workaround, but there is also something to be said for the ability to query every row as it was at any given version. I’m not saying that SCD type 2 is the best solution given there might be a more domain-specific way to do it, but I see it a lot like file-based version control. It’s convenient to be able to examine all files as they existed at any point in time, without having to “model” the ways in which those files might change directly into the domain of the individual files.

If you have something like dolt (not affiliated), a version controlled database, you wouldn’t have to slap change dates on anything OR create your historical table. The changes would be implicit in the version history.

hasyimibhar [3 hidden]5 mins ago

> If you have something like dolt (not affiliated), a version controlled database, you wouldn’t have to slap change dates on anything OR create your historical table. The changes would be implicit in the version history.

Nope, even if I have the ability to see the exact changes of each row, I would still add timestamps everywhere, because timestamp of row change does not equal event timestamp. For example, if I have an order table with status column, and I see a CDC event where status changed from in_progress to completed, I cannot simply assume that the CDC timestamp is the timestamp when order was completed. It's possible that the source database received the event late a few minutes late due to delay upstream, or it's backfilling some missed orders a few days ago. Having a completed_at timestamp (and a bunch of other timestamps for each order lifecycle) would eliminate any ambiguities, and your data analyst will thank you for it.

It's the same thing with row history. You cannot simply assume that your row changes are aligned with the logical history of your entity.

eveningtree [3 hidden]5 mins ago

Rather than answering directly, I'm thinking about this problem from the other end altogether ever since I saw the dbricks rt demo. Apologies for the rambling response, as I haven't yet finished thinking about this problem...

We ended up with 'hot' data in oltp and 'cold/archival' data in olap because the storage size of oltp has always been limited.

(1) Limited by computation - there's only so much data that we can store on disks and nvme

(2) Limited by wallet - disks and nvme are EXPENSIVE

Also, the tight coupling of compute and data didn't help. It limited the size of databases on the individual expensive compute nodes.

So, another question will be -

What's currently stopping me from keeping the scd history tables right in my oltp db? what's forcing me to copy state into my etl/elt pipeline and the process it into scd into a dedicated olap db?

To some extent,the answer is still the same - the oltp cannot scale for the storage size required for keeping historical data. So, I've had to take out the 'cold' historical data and keep it in my olap freezer.

Now, if oltp itself is scaling, I'm not gonna bother with the copying step. I'll just prefer to store the history in oltp itself.

In my perspective (majorly from handling IoT systems), I need olap for 2 reasons - (1) storage scalability, and (2) analytical processing speed

I now consider (1) to be a solved problem

As for (2), I'm still not sure how this architecture ends up matching the query processing speeds of column-oriented storages. But again, I need to study more.

The SCD pipeline still remains in some form. Either in the form of (1) scd rows that we currently keep (etl pipeline) , or (2) as older lsn rows that simply don't get deleted (existing db engine).

I've done quite a lot of experimentation with (2), and it is a pretty solid concept to work with.

I've spent quite a lot of years hammering my brain at databases and datastores in general. And I've now got a feeling that this is it. Finally.

ignoreusernames [3 hidden]5 mins ago

If safe keeper exposes the changes to the tables somehow, a type2 scd is just a windowed lag over the primary key sorted by the timestamp

nikita [3 hidden]5 mins ago

Safekeepers keep a window of WAL in Postgres WAL format and doesn’t have an external API.

It streams WAL to pageservers and Postgres read replicas

PunchyHamster [3 hidden]5 mins ago

I don't wanna see that S3 bandwidth bill after running some big query

otterley [3 hidden]5 mins ago

There’s no S3 bandwidth bill for traffic to and from EC2 in the same region.

khurs [3 hidden]5 mins ago

There are self hosted object stores which use the same protocol as S3. One example: https://github.com/minio/minio

Parquet files are smaller than row based storage in a database (but not those databases with focus on strong compression).

And for backup - the files are probably easier to just copy to multiple disks for redundancy, as opposed to database dumps and incremental backups which at the Petabyte scale will be a pain.

dockerd [3 hidden]5 mins ago

Minio is no longer maintained.

khurs [3 hidden]5 mins ago

oh wow, another rug pull!!

So Ceph/SeaweedFS/RustFS/Garage are the alternatives I think

scritty-dev [3 hidden]5 mins ago

Correct, but RustFS is the only drop in replacement (just migrated) Garage and Seaweed are nice (didn't look into Ceph) but you have to re-ingest. RustFS was just plug and play albeit a few minor API differences.

Also Apache licensing gives some peace of mind after the musical chair license game before they finally landed on only paid AIStor offering.

_zoltan_ [3 hidden]5 mins ago

your reply makes zero sense.

there is a reason why people develop for S3: a lot of enterprise data is there. people ingest there from various sources. and it's not just parquet usually, it's multivendor sources writing to an iceberg catalog.

nobody will run minio on AWS other than hobby projects and small demos.

I regularly work with iceberg datasets in the double digit TB range per dataset. keep that in mind when you think about sizes. databricks, snowflake, large enterprise vendors: they are targeting these sizes.

khurs [3 hidden]5 mins ago

> nobody will run minio on AWS other than hobby projects and small demos.

You realise not every company uses AWS for any/all its needs?

There are datacenters around the world owned by individual companies or co-located. And many companies still have servers on prem.

Compute and disks are getting more dense & liquid cooled, so less rack space is needed for same power.

And Minio and others can handle Petabytes+

https://www.cisco.com/c/en/us/products/collateral/servers-un...

Backblaze, Cloudflare R2 and other cheaper S3 compatible competitors also exist.

_zoltan_ [3 hidden]5 mins ago

they exist, sure. And I'm sure it can handle PB+. on prem is an existing market, however, if you reread my comment I talked about running minio on AWS because S3 is too expensive - just doesn't make sense to do.

I've yet to met a Fortune 100 who isn't mostly using either on prem or a large hyperscaler (S3/Azure/GCS).

tux3 [3 hidden]5 mins ago

The large enterprise vendors are not prise-sensitive. They're on AWS because you never get fired for picking AWS, and there isn't really any other choice for these vendors regardless of AWS ripping you off.

At this point S3 is a standard interface. All sorts of cloud providers and open-source projects provide S3. If you're on AWS, price isn't the reason. You pick AWS because you don't see your company taking a risk with anything else.

S3 doesn't mean expensive. AWS does. But AWS users are fully locked-in, they'll pay whatever the price is.

khurs [3 hidden]5 mins ago

"The large enterprise vendors are not price-sensitive."

Have you ever spoken to a CTO? They most certainly are.

Also many are Microsoft houses so using Azure blob plus one of the reasons for Kubernetes/Openshift adoption was to be cloud neutral

tux3 [3 hidden]5 mins ago

There is a scale between prise-sensitivity and risk-averseness, from my point of reference large companies are much more risk-averse than they are price sensitive. Of course this will vary, CTOs exist in all sort of different environments.

Price is not the reason people chose AWS. Some companies use Azure. The current startup at $WORK uses yet another smaller Cloud. And yet AWS sill has the clear lead in market share. That's because price is far from the only factor, and not even the main factor.

re-thc [3 hidden]5 mins ago

> There is a scale between prise-sensitivity and risk-averseness, from my point of reference large companies are much more risk-averse than they are price sensitive.

That's not true. It's just the way things work "saving money" isn't part of the KPI. Enterprise teams get a budget. If you "saved" you don't get it back. So unless there's a legit need it's ALWAYS easier increasing than cutting it.

It's not about risk. It's about power. They are price sensitive but in a way that doesn't matter to the bottom line i.e. if I can cut my AWS storage bill by 10% and then spend it on random tokens I'd do it.

xtracto [3 hidden]5 mins ago

They definitely havent. Tech side of companies is a Cost Center. And the main question the CEO/CFO makes to the CTO every week is "how can we reduce our AWS bill?" , even before the how was your weekend ? One.

_zoltan_ [3 hidden]5 mins ago

when people talk about S3 they mostly mean AWS, but I do agree S3 is a standard interface. in non-AWS cases they'd say S3-compatible.