This is an industry we're[0] in. Owning is at one end of the spectrum, with cloud at the other, and a broadly couple of options in-between:
1 - Cloud – This is minimising cap-ex, hiring, and risk, while largely maximising operational costs (its expensive) and cost variability (usage based).
2 - Managed Private Cloud - What we do. Still minimal-to-no cap-ex, hiring, risk, and medium-sized operational cost (around 50% cheaper than AWS et al). We rent or colocate bare metal, manage it for you, handle software deployments, deploy only open-source, etc. Only really makes sense above €$5k/month spend.
3 - Rented Bare Metal – Let someone else handle the hardware financing for you. Still minimal cap-ex, but with greater hiring/skilling and risk. Around 90% cheaper than AWS et al (plus time).
4 - Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.
A good provider for option 3 is someone like Hetzner. Their internal ROI on server hardware seems to be around the 3 year mark. After which I assume it is either still running with a client, or goes into their server auction system.
Options 3 & 4 generally become more appealing either at scale, or when infrastructure is part of the core business. Option 1 is great for startups who want to spend very little initially, but then grow very quickly. Option 2 is pretty good for SMEs with baseline load, regular-sized business growth, and maybe an overworked DevOps team!
I think the issue with this formulation is what drives the cost at cloud providers isn't necessarily that their hardware is too expensive (which it is), but that they push you towards overcomplicated and inefficient architectures that cost too much to run.
A core at this are all the 'managed' services - if you have a server box, its in your financial interest to squeeze as much per out of it as possible. If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.
This 'microservices' push usually means that instead of having an on-server session where you can serve stuff from a temporary cache, all the data that persists between requests needs to be stored in a db somewhere, all the auth logic needs to re-check your credentials, and something needs to direct the traffic and load balance these endpoint, and all this stuff costs money.
I think if you have 4 Java boxes as servers with a redundant DB with read replicas on EC2, your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.
These crazy AWS bills usually come from using every service under the sun.
bojangleslover [3 hidden]5 mins ago
The complexity is what gets you. One of AWS's favorite situations is
1) Senior engineer starts on AWS
2) Senior engineer leaves because our industry does not value longevity or loyalty at all whatsoever (not saying it should, just observing that it doesn't)
3) New engineer comes in and panics
4) Ends up using a "managed service" to relieve the panic
5) New engineer leaves
6) Second new engineer comes in and not only panics but
outright needs help
7) Paired with some "certified AWS partner" who claims to help "reduce cost" but who actually gets a kickback from the extra spend they induce (usually 10% if I'm not mistaken)
Calling it it ransomware is obviously hyperbolic but there are definitely some parallels one could draw
On top of it all, AWS pricing is about to massively go up due to the RAM price increase. There's no way it can't since AWS is over half of Amazon's profit while only around 15% of its revenue.
Aurornis [3 hidden]5 mins ago
One of the biggest problems with the self-hosted situations I’ve seen is when the senior engineers who set it up leave and the next generation has to figure out how to run it all.
In theory with perfect documentation they’d have a good head start to learn it, but there is always a lot of unwritten knowledge involved in managing an inherited setup.
With AWS the knowledge is at least transferable and you can find people who have worked with that exact thing before.
Engineers also leave for a lot of reasons. Even highly paid engineers go off and retire, change to a job for more novelty, or decide to try starting their own business.
strobe [3 hidden]5 mins ago
>With AWS the knowledge is at least transferable
unfortunately it lot of things in AWS that also could be messed up so it might be really hard to research what is going on. For example, you could have hundreds of Lambdas running without any idea where original sources and how they connected to each-other, or complex VPCs network routing where some rules and security groups shared randomly between services so if you do small change it could lead to completely difference service to degrade (like you were hired to help with service X but after you changes some service Y went down and you even not aware that it existed)
Hikikomori [3 hidden]5 mins ago
Not much different from how it worked in companies I used to work for. Except the situation was even worse as we had no api or UI to probe for information.
ethbr1 [3 hidden]5 mins ago
There are many great developers who are not also SREs. Building and operating/maintaining have their different mindsets.
coliveira [3 hidden]5 mins ago
The end result of all this is that the percentage of people who know how to implement systems without AWS/Azure will be a single digit. From that point on, this will be the only "economic" way, it doesn't matter what the prices are.
couscouspie [3 hidden]5 mins ago
That's not a factual statement over reality, but more of a normative judgement to justify resignation. Yes, professionals that know how to actually do these things are not abundantly available, but available enough to achieve the transition. The talent exists and is absolutely passionate about software freedom and hence highly intrinsically motivated to work on it. The only thing that is lacking so far is the demand and the talent available will skyrocket, when the market starts demanding it.
eitally [3 hidden]5 mins ago
They actually are abundantly available and many are looking for work. The volume of "enterprise IT" sysadmin labor dwarfs that of the population of "big tech" employees and cloud architects.
organsnyder [3 hidden]5 mins ago
I've worked with many "enterprise IT" sysadmins (in healthcare, specifically). Some are very proficient generalists, but most (in my experience) are fluent in only their specific platforms, no different than the typical AWS engineer.
toomuchtodo [3 hidden]5 mins ago
Perhaps we need bootcamps for on prem stacks if we are concerned about a skills gap. This is no different imho from the trades skills shortage many developed countries face. The muscle must be flexed. Otherwise, you will be held captive by a provider "who does it all for you".
"Today, we are going to calculate the power requirements for this rack, rack the equipment, wire power and network up, and learn how to use PXE and iLO to get from zero to operational."
organsnyder [3 hidden]5 mins ago
This might be my own ego talking (I see myself as a generalist), but IMHO what we need are people that are comfortable jumping into unfamiliar systems and learning on-the-fly, applying their existing knowledge to new domains (while recognizing the assumptions their existing knowledge is causing them to make). That seems much harder to teach, especially in a boot camp format.
toomuchtodo [3 hidden]5 mins ago
As a very curious autodidact, I strongly agree, but this talent is rare and can punch it's own ticket (broadly speaking). These people innovate and build systems for others to maintain, in my experience. But, to your point, we should figure out the sorting hat for folks who want to radically own these on prem systems [1] if they are needed.
Yeah, anyone who has >10 years experience with servers/backend dev has almost certainly managed dedicated infra.
friendzis [3 hidden]5 mins ago
> and the talent available will skyrocket, when the market starts demanding it.
Part of what clouds are selling is experience. A "cloud admin" bootcamp graduate can be a useful "cloud engineer", but it takes some serious years of experience to become a talented on prem sre. So it becomes an ouroboros: moving towards clouds makes it easier to move to the clouds.
phil21 [3 hidden]5 mins ago
> A "cloud admin" bootcamp graduate can be a useful "cloud engineer",
If by useful you mean "useful at generating revenue for AWS or GCP" then sure, I agree.
These certificates and bootcamps are roughly equivalent to the Cisco CCNA certificate and training courses back in the 90's. That certificate existed to sell more Cisco gear - and Cisco outright admitted this at the time.
SahAssar [3 hidden]5 mins ago
> A "cloud admin" bootcamp graduate can be a useful "cloud engineer"
That is not true. It takes a lot more than a bootcamp to be useful in this space, unless your definition is to copy-paste some CDK without knowing what it does.
bix6 [3 hidden]5 mins ago
> The only thing that is lacking so far is the demand and the talent available will skyrocket, when the market starts demanding it.
But will the market demand it? AWS just continues to grow.
bluGill [3 hidden]5 mins ago
Only time will tell. It depends on when someone with a MBA starts asking questions about cloud spending and runs the real numbers. People promoting self hosting often are not counting all the cost of self hosting (AWS has people working 24x7 so that if something fails someone is there to take action)
cheema33 [3 hidden]5 mins ago
> AWS has people working 24x7 so that if something fails someone is there to take action..
The number of things that these 24x7 people from AWS will cover for you is small. If your application craps out for any number of reasons that doesn't have anything to do with AWS, that is on you. If your app needs to run 24x7 and it is critical, then you need your own 24x7 person anyway.
bluGill [3 hidden]5 mins ago
All the hardware and network issues are on them. I agree that you still need your own people to support you applications, but that is only part of the problem.
iso1631 [3 hidden]5 mins ago
I've got thousands of devices over hundreds of sites in dozens of countries. The number of hardware failures are a tiny number, and certainly don't need 24/7 response
Meanwhile AWS breaks once or twice a year.
misir [3 hidden]5 mins ago
From what I've seen, if you're depending on AWS, if something fails you too need someone 24x7 so that you can take action as well. Sometimes magic happens and systems recover after aws restarts their DNS, but usually the combination of event causes the application to get into an unrecoverable state that you need manual action. It doesn't always happen but you need someone to be there if it ever happens. Or bare minimum you need to evaluate if the underlying issue is really caused by AWS or something else has to be done on top of waiting for them to fix.
bluGill [3 hidden]5 mins ago
How many problems is AWS able to handle for you that you are never aware of though?
Symbiote [3 hidden]5 mins ago
How many problems do you think there are?
I've only had one outage I could attribute to running on-prem, meanwhile it's a bit of a joke with the non-IT staff in the office that when "The Internet" (i.e. Cloudflare, Amazon) goes down with news reports etc our own services are all running fine.
infecto [3 hidden]5 mins ago
It’s all anecdotal but in my experiences it’s usually opposite. Bored senior engineer wants to use something new and picks a AWS bespoke service for a new project.
I am sure it happens a multitude of ways but I have never seen the case you are describing.
alpinisme [3 hidden]5 mins ago
I’ve seen your case more than the ransom scenario too. But also even more often: early-to-mid-career dev saw a cloud pattern trending online, heard it was a new “best practice,” and so needed to find a way to move their company to using it.
walt_grata [3 hidden]5 mins ago
Is that what I should be doing? I'm just encouraging the devs on my team to read designing data intensive apps and setting up time for group discussions. Aside from coding and meetings that is.
antonvs [3 hidden]5 mins ago
> One of AWS's favorite situations
I'll give you an alternative scenario, which IME is more realistic.
I'm a software developer, and I've worked at several companies, big and small and in-between, with poor to abysmal IT/operations. I've introduced and/or advocated cloud at all of them.
The idea that it's "more expensive" is nonsense in these situations. Calculate the cost of the IT/operations incompetence, and the cost of the slowness of getting anything done, and cloud is cheap.
Extremely cheap.
Not only that, it can increase shipping velocity, and enable all kinds of important capabilities that the business otherwise just wouldn't have, or would struggle to implement.
Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers.
acdha [3 hidden]5 mins ago
> Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers
This has been my experience as well. There are legitimate points of criticism but every time I’ve seen someone try to make that argument it’s been comparing significantly different levels of service (e.g. a storage comparison equating S3 with tape) or leaving out entire categories of cost like the time someone tried to say their bare metal costs for a two server database cluster was comparable to RDS despite not even having things like power or backups.
antonvs [3 hidden]5 mins ago
> 3) New engineer comes in and panics
> 4) Ends up using a "managed service" to relieve the panic
It's not as though this is unique to cloud.
I've seen multiple managers come in and introduce some SaaS because it fills a gap in their own understanding and abilities. Then when they leave, everyone stops using it and the account is cancelled.
The difference with cloud is that it tends to be more central to the operation, so can't just be canceled when an advocate leaves.
mrweasel [3 hidden]5 mins ago
Just this week a friend of mine was spinning up some AWS managed service, complaining about the complexity, and how any reconfiguration took 45 minutes to reload. It's a service you can just install with apt, the default configuration is fine. Not only is many service no longer cheaper in the cloud, the management overhead also exceed that of on-prem.
mystifyingpoi [3 hidden]5 mins ago
I'd gladly use (and maybe even pay for!) an open-source reimplementation of AWS RDS Aurora. All the bells and whistles with failover, clustering, volume-based snaps, cross-region replication, metrics etc.
As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.
SOLAR_FIELDS [3 hidden]5 mins ago
serverless v2 is one of the products that i was skeptical about but is genuinely one of the most robust solutions out there in that space. it has its warts, but I usually default to it for fresh installs because you get so much out of the box with it
sgarland [3 hidden]5 mins ago
Nitpick (I blame Amazon for their horrible naming): Aurora and RDS are separate products.
What you’re asking for can mostly be pieced together, but no, it doesn’t exist as-is.
Failover: this has been a thing for a long time. Set up a synchronous standby, then add a monitoring job that checks heartbeats and promotes the standby when needed. Optionally use something like heartbeat to have a floating IP that gets swapped on failover, or handle routing with pgbouncer / pgcat etc. instead. Alternatively, use pg_auto_failover, which does all of this for you.
Clustering: you mean read replicas?
Volume-based snaps: assuming you mean CoW snapshots, that’s a filesystem implementation detail. Use ZFS (or btrfs, but I wouldn’t, personally). Or Ceph if you need a distributed storage solution, but I would definitely not try to run Ceph in prod unless you really, really know what you’re doing. Lightbits is another solution, but it isn’t free (as in beer).
Cross-region replication: this is just replication? It doesn’t matter where the other node[s] are, as long as they’re reachable, and you’ve accepted the tradeoffs of latency (synchronous standbys) or potential data loss (async standbys).
Metrics: Percona Monitoring & Management if you want a dedicated DB-first, all-in-one monitoring solution, otherwise set up your own scrapers and dashboards in whatever you’d like.
What you will not get from this is Aurora’s shared cluster volume. I personally think that’s a good thing, because I think separating compute from storage is a terrible tradeoff for performance, but YMMV. What that means is you need to manage disk utilization and capacity, as well as properly designing your failure domain. For example, if you have a synchronous standby, you may decide that you don’t care if a disk dies, so no messing with any kind of RAID (though you’d then miss out on ZFS’ auto-repair from bad checksums). As long as this aligns with your failure domain model, it’s fine - you might have separate physical disks, but co-locate the Postgres instances in a single physical server (…don’t), or you might require separate servers, or separate racks, or separate data centers, etc.
tl;dr you can fairly closely replicate the experience of Aurora, but you’ll need to know what you’re doing. And frankly, if you don’t, even if someone built a OSS product that does all of this, you shouldn’t be running it in prod - how will you fix issues when they crop up?
vel0city [3 hidden]5 mins ago
> you can fairly closely replicate the experience of Aurora
Nobody doubts one could build something similar to Aurora given enough budget, time, and skills.
But that's not replicating the experience of Aurora. The experience of Aurora is I can have all of that, in like 30 lines of terraform and a few minutes. And then I don't need to worry about managing the zpools, I don't need to ensure the heartbeats are working fine, I don't need to worry about hardware failures (to a large extent), I don't need to drive to multiple different physical locations to set up the hardware, I don't need to worry about handling patching, etc.
You might replicate the features, but you're not replicating the experience.
sgarland [3 hidden]5 mins ago
The person I replied to said they wanted an open-source reimplementation of Aurora. My point - which was probably poorly-worded, or just implied - was that there's a lot of work that goes into something like that, and if you can't put the pieces together on your own, you probably shouldn't be running it for anything you can't afford downtime on.
Managed services have a clear value proposition. I personally think they're grossly overpriced, but I understand the appeal. Asking for that experience but also free / cheap doesn't make any sense.
infecto [3 hidden]5 mins ago
What managed service? Curious, I don’t use the full suite of aws services but wondering what would take 45mins, maybe it was a large cluster of some sort that needed rolling changes?
coliveira [3 hidden]5 mins ago
My observation is that all these services are exploding in complexity, and they justify saying that there are more features now, so everyone needs to accept spending more and more time and effort for the same results.
patrick451 [3 hidden]5 mins ago
It's basically the same dynamic as hedonic adjustment in the CPI calculations. Cars may cost twice as much now they have usb chargers built in so inflation isn't really that bad.
mrweasel [3 hidden]5 mins ago
I think this was MWAA
coredog64 [3 hidden]5 mins ago
> If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.
If ECS is faster, then you're more satisfied with AWS and less likely to migrate. You're also open to additional services that might bring up the spend (e.g. ECS Container Insights or X-Ray)
Source: Former Amazon employee
torginus [3 hidden]5 mins ago
We did some benchmarks and ECS was definitely quite a bit more expensive for a given capacity than just running docker on our own EC2 instances. It also bears pointing out that a lot of applications (either in-house or off-the-shelf) expect a persistent mutable config directory or sqlite database.
We used EFS to solve that issue, but it was very awkward, expensive and slow, its certainly not meant for that.
lumost [3 hidden]5 mins ago
I don’t understand why most cloud backend designs seem to strive for maximizing the number of services used.
My biggest gripe with this is async tasks where the app does numerous hijinks to avoid a 10 minute lambda processing timeout. Rather than structure the process to process many independent and small batches, or simply using a modest container to do the job in a single shot - a myriad of intermediate steps are introduced to write data to dynamo/s3/kinesis + sqs/and coordination.
A dynamically provisioned, serverless container with 24 cores and 64 GB of memory can happily process GBs of data transformations.
parentheses [3 hidden]5 mins ago
Fully agree to this. I find the cost of cloud providers is mostly driven by architecture. If you're cost conscious, cloud architectures need to be up-front designed with this in mind.
Microservices is a killer with cost. For each microservices pod
- you're often running a bunch of side cars - datadog, auth, ingress
- you pay massive workload separation overhead with orchestration, management, monitoring and ofc complexity
I am just flabbergasted that this is how we operate as a norm in our industry.
jdmichal [3 hidden]5 mins ago
It's about fitting your utilization to the model that best serves you.
If you can keep 4 "Java boxes" fed with work 80%+ of the time, then sure EC2 is a good fit.
We do a lot of batch processing and save money over having EC2 boxes always on. Sure we could probably pinch some more pennies if we managed the EC2 box uptime and figured out mechanisms for load balancing the batches... But that's engineering time we just don't really care to spend when ECS nets us most of the savings advantage and is simple to reason about and use.
nthdesign [3 hidden]5 mins ago
Agreed. There is a wide price difference between running a managed AWS or Azure MySQL service and running MySQL on a VM that you spin up in AWS or Azure.
re-thc [3 hidden]5 mins ago
> your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.
You don’t need colocation to save 4x though. Bandwidth pricing is 10x. EC2 is 2-4x especially outside US. EBS for its iops is just bad.
bojangleslover [3 hidden]5 mins ago
Great comment. I agree it's a spectrum and those of us who are comfortable on (4) like yourself and probably us at Carolina Cloud [0] as well, (4) seems like a no brainer. But there's a long tail of semi-technical users who are more comfortable in 2-3 or even 1, which is what ultimately traps them into the ransomware-adjacent situation that is a lot of the modern public cloud. I would push back on "usage-based". Yes it is technically usage-based but the base fee also goes way up and there are also sometimes retainers on these services (ie minimum spend). So of course "usage-based" is not wrong but what it usually means is "more expensive and potentially far more expensive".
The problem is that clouds have easily become 3 or 5 times the price of managed services, 10x the price of option 3, and 20x the price of option 4. To say nothing of the fact that almost all businesses can run fine on "pc under desk" type situations.
So in practice cloud has become the more expensive option the second your spend goes over the price of 1 engineer.
boplicity [3 hidden]5 mins ago
I don't know. I rent a bare metal server for $500 a month, which is way overkill. It takes almost no time to manage -- maybe a few hours a year -- and can handle almost anything I throw at it. Maybe my needs are too simple though?
edge17 [3 hidden]5 mins ago
Just curious, what is the spec you pay $6000/year for? Where/what is the line between rent vs buy?
boplicity [3 hidden]5 mins ago
It's a server with:
- 2x Intel Xeon 5218
- 128gb Ram
- 2x960GB SSD
- 30TB monthly bandwidth
I pay around an extra $200/month for "premium" support and Acronis backups, both of which have come in handy, but are probably not necessary. (Automated backups to AWS are actually pretty cheap.) It definitely helps with peace of mind, though.
cheema33 [3 hidden]5 mins ago
I have a similar system from Hetzner. I pay around $100 for it. No bandwidth cap.
I have setup encrypted backups to go to my backup server in the office. We have a gigabit service at the office. Critical data changes are backed up every hour and full backup once a day.
boplicity [3 hidden]5 mins ago
Yeah -- I know I could probably get a better deal. I pay more for premium support ($200), as well as a North American location. Plus, probably an addition premium for not wanting to go through the effort of switching servers.
Lucasoato [3 hidden]5 mins ago
Hetzner is definitely an interesting option. I’m a bit scared of managing the services on my own (like Postgres, Site2Site VPN, …) but the price difference makes it so appealing. From our financial models, Hetzner can win over AWS when you spend over 10~15K per month on infrastructure and you’re hiring really well. It’s still a risk, but a risk that definitely can be worthy.
mrweasel [3 hidden]5 mins ago
> I’m a bit scared of managing the services on my own
I see it from the other direction, when if something fails, I have complete access to everything, meaning that I have a chance of fixing it. That's down to hardware even. Things get abstracted away, hidden behind APIs and data lives beyond my reach, when I run stuff in the cloud.
Security and regular mistakes are much the same in the cloud, but I then have to layer whatever complications the cloud provide comes with on top. If cost has to be much much lower if I'm going to trust a cloud provider over running something in my own data center.
adamcharnock [3 hidden]5 mins ago
You sum it up very neatly. We've heard this from quite a few companies, and that's kind of why we started our ours.
We figured, "Okay, if we can do this well, reliably, and de-risk it; then we can offer that as a service and just split the difference on the cost savings"
(plus we include engineering time proportional to cluster size, and also do the migration on our own dime as part of the de-risking)
wulfstan [3 hidden]5 mins ago
I've just shifted my SWE infrastructure from AWS to Hetzner (literally in the last month). My current analysis looks like it will be about 15-20% of the cost - £240 vs 40-50 euros.
Expect a significant exit expense, though, especially if you are shifting large volumes of S3 data. That's been our biggest expense. I've moved this to Wasabi at about 8 euros a month (vs about $70-80 a month on S3), but I've paid transit fees of about $180 - and it was more expensive because I used DataSync.
Retrospectively, I should have just DIYed the transfer, but maybe others can benefit from my error...
adamcharnock [3 hidden]5 mins ago
FYI, AWS offers free Egress when leaving them (because they were forced to be EU regulation, but they chose to offer it globally):
But. Don't leave it until the last minute to talk to them about this. They don't make it easy, and require some warning (think months, IIRC)
sciencejerk [3 hidden]5 mins ago
Thank God for the EU regulations. USA has been too lax about cracking down on anti-competitive market practices
wulfstan [3 hidden]5 mins ago
Extremely useful information - unfortunately I just assumed this didn't apply to me because I am in the UK and not the EU. Another mistake, though given it's not huge amounts of money I will chalk it up to experience.
Hopefully someone else will benefit from this helpful advice.
iso1631 [3 hidden]5 mins ago
> I’m a bit scared of managing the services on my own (like Postgres, Site2Site VPN, …)
Out of interest, how old are you? This was quite normal expectation of a technical department even 15 years ago.
christophilus [3 hidden]5 mins ago
I’m curious to know the answer, too. I used to deploy my software on-prem back in the day, and that always included an installation of Microsoft SQL Server. So, all of my clients had at least one database server they had to keep operational. Most of those clients didn’t have an IT staff at all, so if something went wrong (which was exceedingly rare), they’d call me and I’d walk them through diagnosing and fixing things, or I’d Remote Desktop into the server if their firewalls permitted and fix it myself. Backups were automated and would produce an alert if they failed to verify.
It’s not rocket science, especially when you’re talking about small amounts of data (small credit union systems in my example).
infecto [3 hidden]5 mins ago
No it was not. 15 years ago Heroku was the rage. Even the places that had bare metal usually had someone running something similar to devops and at least core infrar was not being touched. I am sure places existed but 15 years while far away was already pretty far along from what you describe. At least in SV.
acdha [3 hidden]5 mins ago
Heroku was popular with startups who didn’t have infrastructure skills but the price was high enough that anyone who wasn’t in that triangle of “lavish budget, small team, limited app diversity” wasn’t using it. Things like AWS IaaS were far more popular due to the lower cost and greater flexibility but even that was far from a majority service class.
infecto [3 hidden]5 mins ago
I am not sure if you are trying to refute my lived experience or what exactly the point is. Heroku was wildly popular with startups at the time, not just those with lavish budgets. I was already touching RDS at this point and even before RDS came around no organization I worked at had me jumping on bare metal to provision services myself. There always a system in place where someone helped out engineering to deploy systems. I know this was not always the case but the person I was responding to made it sound like 15 years ago all engineers were provisioning their own database and doing other times of dev/sys ops on a regular basis. It’s not true at least in SV.
sanderjd [3 hidden]5 mins ago
A tricky thing on this site is that there are lots of different people with very different kinds of experience, which often results in people talking past each other. A lot of people here have experience as zero-to-one early startup engineers, and yep, I share your experience that Heroku was very popular in that space. A lot of other people have experience at later growth and infrastructure focused startups, and they have totally different experiences. And other people have experience as SREs at big tech, or doing IT / infrastructure for non-tech fortune 500 businesses. All of these are very different experiences, and very different things have been popular over the last couple decades depending on which kind of experience you have.
infecto [3 hidden]5 mins ago
Absolutely true but I also think it’s a fair callout when the intent was to disprove the original post asking how old someone was because 15 years ago everyone was stringing together their own services which is absolutely not true. There were many shades of gray at that time both in my experience of either have a sysops/devops team to help or deploying to Heroku as well as folks that were indeed stringing together services.
I find it equally disingenuous to suggest that Heroku was only for startups with lavish budgets. Absolutely not true. That’s my only purpose here. Everyone has different experiences but don’t go and push your own narrative as the only one especially when it’s not true.
sanderjd [3 hidden]5 mins ago
I kind of thought the "15 years" was just one of those things where people kind of forget what year it is. Wow, 2010 was already over 15 years ago?? That kind of mistake. I think this person was thinking pre-2005. I graduated college just after that, and that's when all this cloud and managed services stuff was just starting to explode. I think it's true that before that, pretty much everyone was maintaining actual servers somewhere. (For instance, I helped out with the physical servers for our CS lab some when I was in college. Most of what we hosted on those would be easier to do on the cloud now, but that wasn't a thing then.)
acdha [3 hidden]5 mins ago
I have no doubt that was your experience. My point was that it wasn’t even common in SV as whole, just the startup scene. Think about headcount: how many times fewer people worked at your startup than any one of Apple, Oracle, HP, Salesforce, Intuit, eBay, Yahoo, etc.? Then thing about how many other companies there are just in the Bay Area who have large IT investments even if they’re not tech companies.
Even at their peak, Heroku was a niche. If you’d gone conferences like WWDC or Pycon at the time, they’d be well represented, yes, and plenty of people liked them but it wasn’t a secret that they didn’t cover everyone’s needs or that pricing was off putting for many people, and that tended to go up the bigger the company you talked to because larger organizations have more complex needs and they use enough stuff that they already have teams of people with those skills.
iso1631 [3 hidden]5 mins ago
> Heroku was wildly popular with startups
The world's a lot bigger than startups
infecto [3 hidden]5 mins ago
Did you fail to finish reading the rest? At the same time I had touch with organizations that were still in data centers but I as an engineer had no touch on the bare metal and ticket systems were in place to help provision necessary services. I was not deploying my own Postgres database.
Your original statement is factually incorrect.
unethical_ban [3 hidden]5 mins ago
SV and financial services are quite different.
It's 2026 and banks are still running their mainframe, running windows VMs on VMware and building their enterprise software with Java.
The big boys still have their own datacenters they own.
Sure, they try dabbling with cloud services, and maybe they've pushed their edge out there, and some minor services they can afford to experiment with.
infecto [3 hidden]5 mins ago
If you are working at a bank you are most likely not standing up your own Postgres and related services. Even 15 years ago. I am not saying it never happened, I am saying that even 15 years ago even large orgs with data enters often had in place sys and devops that helped with providing resources. Obviously not the rule but also not an exception.
unethical_ban [3 hidden]5 mins ago
True. We had separate teams for Oracle and MSSQL management. We had 3 teams each for Windows, "midrange" (Unix) and mainframe server management. That doesn't include IAM.
Lucasoato [3 hidden]5 mins ago
Ahah I'm 31, but deciding if it makes sense to manage your own db doesn't depend on the age of the CTO.
See, turning up a VM, installing and running Postgres is easy.
The hard part is keeping it updated, keeping the OS updated, automate backups, deploying replicas, encrypting the volumes and the backups, demonstrating to a third party auditor all of the above... and mind that there might be many other things I honestly ignore!
I'm not saying I won't go that path, it might be a good idea after a certain scale, but in the first and second year of a startup your mind should 100% be on "How can I make my customer happy" rather than "We failed again the audit, we won't have the SOC 2 Type I certification in time to sign that new customer".
If deciding between Hetzner and AWS was so easy, one of them might not be pricing its services correctly.
baby [3 hidden]5 mins ago
I’m wondering if it makes sense to distribute your architecture so that workers who do most of the heavy lifting are in hetzner, while the other stuff is in costly AWS. On the other hand this means you don’t have easy access to S3, etc.
rockwotj [3 hidden]5 mins ago
networking costs are so high in AWS I doubt this makes sense
mattbillenstein [3 hidden]5 mins ago
Depends on how data-heavy the work is. We run a bunch of gpu training jobs on other clouds with the data ending up in S3 - the extra transfer costs wrt what we save on getting the gpus from the cheapest cloud available, it makes a lot of sense.
Also, just availability of these things on AWS has been a real pain - I think every startup got a lot of credits there, so flood of people trying to then use them.
objektif [3 hidden]5 mins ago
No amount of money will make me maintain my own dbs. We tried it at first and it was a nightmare.
dev_l1x_be [3 hidden]5 mins ago
Or CDN, queues, log service, observability, distributed storage. I am not even sure what the people in the on-prem vs cloud argument think. If you need a highly specialised infra with one or two core services and a lower tier network is ok then on-prem is ok. Otherwise if is a never ending quest to re-discover the millions of engineering hours went into building something like AWS.
g8oz [3 hidden]5 mins ago
It's worth becoming good at.
sanderjd [3 hidden]5 mins ago
Is it though? This is a genuine question. My intuition is that the investment of time / stress / risk to become good at this is unlikely to have high ROI to either the person putting in that time or to the business paying them to do so. But maybe that's not right.
Symbiote [3 hidden]5 mins ago
Managing the PostgreSQL databases is a medium to low complexity task as I see it.
Take two equivalent machines, set up with streaming replication exactly as described in the documentation, add Bacula for backups to an off-site location for point-in-time recovery.
We haven't felt the need to set up auto fail-over to the hot spare; that would take some extra effort (and is included with AWS equivalents?) but nothing I'd be scared of.
Add monitoring that the DB servers are working, replication is up-to-date and the backups are working.
cheema33 [3 hidden]5 mins ago
> Managing the PostgreSQL databases is a medium to low complexity task as I see it.
Same here. But, I assume you have managed PostgreSQL in the past. I have. There are a large number of people software devs who have not. For them, it is not a low complexity task. And I can understand that.
I am a software dev for our small org and I run the servers and services we need. I use ansible and terraform to automate as much as I can. And recently I have added LLMs to the mix. If something goes wrong, I ask Claude to use the ansible and terraform skills that I created for it, to find out what is going on. It is surprisingly good at this. Similarly I use LLMs to create new services or change configuration on existing ones. I review the changes before they are applied, but this process greatly simplifies service management.
Dylan16807 [3 hidden]5 mins ago
> Same here. But, I assume you have managed PostgreSQL in the past. I have. There are a large number of people software devs who have not. For them, it is not a low complexity task. And I can understand that.
I'd say needing to read the documentation for the first time is what bumps it up from low complexity to medium. And then at medium you should still do it if there's a significant cost difference.
sanderjd [3 hidden]5 mins ago
For what it's worth, I have also managed my own databases, but that's exactly why I don't think it's a good use of my time. Because it does take time! And managed database options are abundant, inexpensive, and perform well. So I just don't really see the appeal of putting time into this.
mattbillenstein [3 hidden]5 mins ago
If you have a database, you still have work to do - optimizing, understanding indexes, etc. Managed services don't solve these problems for you magically and once you do them, just running the db itself isn't such a big deal and it's probably easier to tune for what you want to do.
sanderjd [3 hidden]5 mins ago
Absolutely yes. But you have to do this either way. So it's just purely additive work to run the infrastructure as well.
I think if it were true that the tuning is easier if you run the infrastructure yourself, then this would be a good point. But in my experience, this isn't the case for a couple reasons. First of all, the majority of tuning wins (indexes, etc.) are not on the infrastructure side, so it's not a big win to run it yourself. But then also, the professionals working at a managed DB vendor are better at doing the kind of tuning that is useful on the infra side.
objektif [3 hidden]5 mins ago
How do you manage availability zones in your fully self managed setup?
sanderjd [3 hidden]5 mins ago
This sounds medium to high complexity to me. You need to do all those things, and also have multiple people who know how to do them, and also make sure that you don't lose all the people who know how to do them, and have one of those people on call to be able to troubleshoot and fix things if they go wrong, and have processes around all that. (At least if you are running in production with real customers depending on you, you should have all those things.)
With a managed solution, all of that is amortized into your monthly payment, and you're sharing the cost of it across all the customers of the provider of the managed offering.
Personally, I would rather focus on things that are in or at least closer to the core competency of our business, and hire out this kind of thing.
objektif [3 hidden]5 mins ago
You are right. Are you actually seriously considering whether to go fully managed or self managed at this point? Pls go AWS route and thank me later :)
sanderjd [3 hidden]5 mins ago
No not at all, I have the same opinion as you! But I'm curious to understand the opposite view.
riku_iki [3 hidden]5 mins ago
> We haven't felt the need to set up auto fail-over to the hot spare; that would take some extra effort (and is included with AWS equivalents?) but nothing I'd be scared of.
this part is actually scariest, since there are like 10 different 3rd party solutions of unknown stability and maintanability.
objektif [3 hidden]5 mins ago
I really do not think so. Most startups should rather focus on their core competency and direct engineering resources to their edge. When you are $100 mln ARR then feel free to mess around with whatever db setup you want.
ibejoeb [3 hidden]5 mins ago
Dead on. Recently, 3 and 4 have been compelling. Cloud costs have rocketed up. I started my casual transition to co-lo 2 years ago and just in december finished everything. I have more capacity at about 30% of the cost. If you go option 3, you even get the benefit of 6+ month retro pricing for RAM/storage. I'm running all DDR4, but I have so much of it I don't know what to do with it.
The flip side is that compliance is a little more involved. Rather than, say, carve out a whole swathe of SOC-2 ops, I have to coordinate some controls. It's not a lot, and it's still a lot lighter than I used to do 10+ years ago. Just something to consider.
mgaunard [3 hidden]5 mins ago
you're missing 5, what they are doing.
There is a world of difference between renting some cabinets in an Equinix datacenter and operating your own.
adamcharnock [3 hidden]5 mins ago
Fair point!
5 - Datacenter (DC) - Like 4, except also take control of the space/power/HVAC/transit/security side of the equation. Makes sense either at scale, or if you have specific needs. Specific needs could be: specific location, reliability (higher or lower than a DC), resilience (conflict planning).
There are actually some really interesting use cases here. For example, reliability: If your company is in a physical office, how strong is the need to run your internal systems in a data centre? If you run your servers in your office, then there's no connectivity reliability concerns. If the power goes out, then the power is out to your staff's computers anyway (still get a UPS though).
Or perhaps you don't need as high reliability if you're doing only batch workloads? Do you need to pay the premium for redundant network connections and power supplies?
If you want your company to still function in the event of some kind of military conflict, do you really want to rely on fibre optic lines between your office and the data center? Do you want to keep all your infrastructure in such a high-value target?
I think this is one of the more interesting areas to think about, at least for me!
jermaustin1 [3 hidden]5 mins ago
When I worked IT for a school district at the beginning of my career (2006-2007), I was blown away that every school had a MASSIVE server room (my office at each school - the MDF). 3-5 racks filled (depending on school size and connection speed to the central DC - data closet) 50-75% was networking equipment (5 PCs per class hardwired), 10% was the Novell Netware server(s) and storage, the other 15% was application storage for app distributions on login.
mgaunard [3 hidden]5 mins ago
Personally I haven't seen a scenario where it makes sense beyond a small experimental lab where you value the ability to tinker physically with the hardware regularly.
Offices are usually very expensive real estate in city centers and with very limited cooling capabilities.
Then again the US is a different place, they don't have cities like in Europe (bar NYC).
kryptiskt [3 hidden]5 mins ago
If you are a bank or a bookmaker or similar you may well want to have total control of physical access to the machines. I know one bookmaker I worked with had their own mini-datacenter, mainly because of physical security.
tomcam [3 hidden]5 mins ago
I am pretty forward-thinking but even when I started writing my first web server 30+ years ago I didn’t foresee the day when the phrase “my bookie’s datacenter” might cross my lips.
mgaunard [3 hidden]5 mins ago
Most trading venues are in Equinix data centers.
direwolf20 [3 hidden]5 mins ago
If you have less than a rack of hardware, if you have physical security requirements, and/or your hardware is used in the office more than from the internet, it can make sense.
noosphr [3 hidden]5 mins ago
5 was a great option for ml work last year since colo rented didn't come with a 10kW cable. With ram, sd and GPU prices the way they are now I have no idea what you'd need to do.
Thank goodness we did all the capex before the OpenAI ram deal and expensive nvidia gpus were the worst we had to deal with.
eru [3 hidden]5 mins ago
> 4 - Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.
Is it still the cheapest after you take into account that skills, scale, cap-ex and long term lock-in also have opportunity costs?
graemep [3 hidden]5 mins ago
That is why the the second "if" is there.
You can get locked into cloud too.
The lock in is not really long term as it is an easy option to migrate off.
weavie [3 hidden]5 mins ago
What is the upper limit of Hertzner? Say you have an AWS bill in the $100s of millions, could Hertzner realistically take on that scale?
adamcharnock [3 hidden]5 mins ago
An interesting question, so time for some 100% speculation.
It sounds like they probably have revenue in the €500mm range today. And given that the bare metal cost of AWS-equivalent bills tends to be a 90% reduction, we'll say a €10mm+ bare metal cost.
So I would say a cautious and qualified "yes". But I know even for smaller deployments of tens or hundreds of servers, they'll ask you what the purpose is. If you say something like "blockchain," they're going to say, "Actually, we prefer not to have your business."
I get the strong impression that while they naturally do want business, they also aren't going to take a huge amount of risk on board themselves. Their specialism is optimising on cost, which naturally has to involve avoiding or mitigating risk. I'm sure there'd be business terms to discuss, put it that way.
StilesCrisis [3 hidden]5 mins ago
Why would a client who wants to run a Blockchain be risky for Herzner? I'm not a fan, I just don't see the issue. If the client pays their monthly bill, who cares if they're using the machine to mine for Bitcoin?
Symbiote [3 hidden]5 mins ago
They are certain to run the machines at 100% continually, which will cost more than a typical customer who doesn't do this, and leave the old machines with less second-hand value for their auction thing afterwards.
mbreese [3 hidden]5 mins ago
I’d bet that main reason would be power. Running machines at 100% doesn’t subtract much extra , but a server running hard for 24 hours would use more power than a bursty workload.
(While we’re all speculating)
ndriscoll [3 hidden]5 mins ago
Also very subject to wildly unstable market dynamics. If it's profitable to mine, they'll want as much capacity as they can get, leading Hetzner to over provision. Then once it becomes unprofitable, they'll want to stop all mining, leaving a ton of idle, unpaid machines. Better to have stable customers that don't swing 0-100 utilization depending on ability to arbitrage compute costs.
I wouldn't be surprised if mining is also associated with fraud (e.g. using stolen credit cards to buy compute).
geocar [3 hidden]5 mins ago
Who are you thinking of?
Netflix might be spending as much as $120m (but probably a little less), and I thought they were probably Amazon's biggest customer. Does someone (single-buyer) spend more than that with AWS?
Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer, and Netflix's shareholders would probably be worried about risk relying on a vendor that is much smaller than them.
Sometimes if the companies are friendly to the idea, they could form a joint venture or maybe Netflix could just acquire Hertzner (and compete with Amazon?), but I think it unlikely Hertzner could take on Netflix-sized for nontechnical reasons.
However increasing pop capacity by 30% within 6mo is pretty realistic, so I think they'd probably be able to physically service Netflix without changing too much if management could get comfortable with the idea
phiresky [3 hidden]5 mins ago
A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.
geocar [3 hidden]5 mins ago
> A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.
I'm not convinced.
I assume someone at Netflix has thought about this, because if that were true and as simple as you say, Netflix would simply just buy Hetzner.
I think there lots of reasons you could have this experience, and it still wouldn't be Netflix's experience.
For one, big applications tend to get discounts. A decade ago when I (the company I was working for) was paying Amazon a mere $0,2M a month and getting much better prices from my account manager than were posted on the website.
There are other reasons (mostly from my own experiences pricing/costing big applications, but also due to some exotic/unusual Amazon features I'm sure Netflix depends on) but this is probably big enough: Volume gets discounts, and at Netflix-size I would expect spectacular discounts.
I do not think we can estimate the factor better than 1.5-2x without a really good example/case-study of a company someplace in-between: How big are the companies you're thinking about? If they're not spending at least $5m a month I doubt the figures would be indicative of the kind of savings Netflix could expect.
varsketiz [3 hidden]5 mins ago
We run our own infrastructure, sometimes with our own fincing (4), sometimes external (3). The cost is in tens of millions per year.
When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.
I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.
direwolf20 [3 hidden]5 mins ago
That $120m will become $12m when they're not using AWS.
weavie [3 hidden]5 mins ago
I'm largely just thinking $HUGE when throwing out that number, but there are plenty of companies that have cloud costs in that range. A quick search brings up Walmart, Meta, Netflix, Spotify, Snap, JP Morgan.
Quarrel [3 hidden]5 mins ago
> Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer
A little scare for both sides.
Unless we're misunderstanding something I think the $100Ms figure is hard to consider in a vacuum.
objektif [3 hidden]5 mins ago
Figma apparently spends around 300-400k/day on AWS. I think this puts them up there.
mbreese [3 hidden]5 mins ago
How is this reasonable? At what point do they pull a Dropbox and de-AWS? I can’t think of why they would gain with AWS over in house hosting at that point.
I’m not surprised, but you’d think there would be some point where they would decide to build a data center of their own. It’s a mature enough company.
sanderjd [3 hidden]5 mins ago
This space of #2 like Lithus is not something I'm very familiar with, so thank you for the comment that piqued my interest!
If you're willing to share, I'm curious who else you would describe as being in this space.
My last decade and a half or so of experience has all been in cloud services, and prior to that it was #3 or #4. What was striking to me when I went to the Lithus website was that I couldn't figure out any details without hitting a "Schedule a Call" button. This makes it difficult for me to map my experiences in using cloud services onto what Lithus offers. Can I use terraform? How does the kubernetes offering work? How does the ML/AI data pipelines work? To me, it would be nice if I could try it out in a very limited way as self-service, or at least read some technical documentation. Without that, I'm left wondering how it works. I'm sure this is a conscious decision to not do this, and for good reasons, but I thought I'd share my impressions!
adamcharnock [3 hidden]5 mins ago
Hello! I think this is a fair question, and improving the communication on the website is something that is steadily climbing up our priority list.
We're not really that kind of product company; we're more of a services company. What we do is deploy Kubernetes clusters onto bare metal servers. That's the core technical offering. However, everything beyond that is somewhat per-client. Some clients need a lot of compute. Some clients need a custom object storage cluster. Some clients need a lot of high-speed internal networking. Which is why we prefer to have a call to figure out specifically what your needs are. But I can also see how this isn't necessarily satisfying if you're used to just grabbing the API docs and having a look around.
What we will do is take your company's software stack and migrate it off AWS/Azure/Google and deploy it onto our new infrastructure. We will then become (or work with) your DevOps team to supporting you. This can be anything from containerising workloads to diagnosing performance issues to deploying a new multi-region Postgres cluster. Whatever you need done on your hardware that we feel we can reasonably support. We are the ones on-call should NATS fall over at 4am.
Your team also has full access to the Kubernetes cluster to deploy to as you wish.
I think the pricing page is the most concrete thing on our website, and it is entirely accurate. If you were to phone us and say, "I want that exact hardware," we would do it for you. But the real value we also offer is in the DevOps support we provide, actually doing the migration up-front (at our own cost), and being there working with your team every week.
sanderjd [3 hidden]5 mins ago
This makes total sense to me. I'm thinking through the flow that would lead me to be a customer of yours.
In my current job, I think we're honestly a bit past the phase where I would want to take on a migration to a service like yours. We already have a good team of infrastructure folks running our cloud infrastructure, and we have accepted the lock-in of various AWS managed services. So the high-touch devops support doesn't sound that useful to me (we already have people who are good at this), and replacing all the locked-in components seems unlikely to have good ROI. I think we'd be more likely to go straight to #3 if we decided to take that on to save money.
But I'll probably be a founder or early employee at a new startup again someday, and I'm intrigued by your offering from that perspective. But it seems pretty clear to me that I shouldn't call you up on day 1, because I'm going to be nowhere near $5k a month, and I want to move faster than calling someone up to talk about my needs. I want to self-serve a small amount of usage, and cloud services seem really great for that. But this is how they get you! Once you've started with a particular cloud service, it's always easiest to take on more lock-in.
At some point between these two situations, though, I can see where your offering would be great. But the decision point isn't all that clear to me. In my experience, by the time you start looking at your AWS bill and thinking "crap, that seems pretty expensive", you have better things to do than an infrastructure migration, and you have taken on some lock-in.
I do like the idea of high-touch services to solve the breaking-the-lock-in challenge! I'll certainly keep this in mind next time I find myself in this middle ground where the cloud starts feeling more expensive than it's worth, but we don't want to go straight to #3.
whiplash451 [3 hidden]5 mins ago
> Option 1 is great for startups
Unfortunately, (successful) startups can quickly get trapped into this option. If they're growing fast, everyone on the board will ask why you'd move to another option at the first place. The cloud becomes a very deep local minimum that's hard to get out off.
Schlagbohrer [3 hidden]5 mins ago
Can someone explain 2 to me. How is a managed private cloud different from full cloud? Like you are still using AWS or Azure but you are keeping all your operation in a bundled, portable way, so you can leave that provider easily at any time, rather than becoming very dependent on them? Is it like staying provider-agnostic but still cloud based?
adamcharnock [3 hidden]5 mins ago
To put it plainly: We deploy a Kubernetes cluster on Hetzner dedicated servers and become your DevOps team (or a part thereof).
It works because bare metal is about 10% the cost of cloud, and our value-add is in 1) creating a resilient platform on top of that, 2) supporting it, 3) being on-call, and 4) being or supporting your DevOps team.
This starts with us providing a Kubernetes cluster which we manage, but we also take responsibility for the services run on it. If you want Postgres, Redis, Clickhouse, NATS, etc, we'll deploy it and be SLA-on-call for any issues.
If you don't want to deal with Kubernetes then you don't have to. Just have your software engineers hand us the software and we'll handle deployment.
Everything is deployed on open source tooling, you have access to all the configuration for the services we deploy. You have server root access. If you want to leave you can do.
Our customers have full root access, and our engineers (myself included) are in a Slack channel with you engineers.
And, FWIW, it doesn't have to be Hetzner. We can colocate or use other providers, but Hetzner offer excellent bang-per-buck.
Edit: And all this is included in the cluster price, which comes out cheaper than the same hardware on the major cloud providers
mancerayder [3 hidden]5 mins ago
You give customers root but you're on call when something goes tits up?
You're a brave DevOps team. That would cause a lot of friction in my experience, since people with root or other administrative privileges do naughty things, but others are getting called in on Saturday afternoon.
belthesar [3 hidden]5 mins ago
From a platform risk perspective, each tenant has dedicated resources, so it's their platform to blow up. If a customer with root access blows up their own system, then the resources from the MSP to fix it are billable, and the after-action meetings would likely include a review of whether that access is appropriate, if additional training is needed to prevent those issues in the future (also billable), or if the customer-provider relationship is the right fit. Will the on-call resource be having a bad time fixing someone else's screw up? Yeah, and having been that guy before, I empathize. The business can and should manage this relationship however, so that it doesn't become an undue burden on their support teams. A customer platform that is always getting broken at 4pm on a Friday when an overzealous customer admin is going in and deciding to run arbitrary kubectl commands takes support capacity away from other customers when a major incident happens, regardless of how much you're making in support billing.
adamcharnock [3 hidden]5 mins ago
This is essentially how it is. Additionally, the reality is that our customers don't often even need to think about using root access, but they have it if they want it. They are putting a lot of trust in us, so we also put trust in them.
victorbjorklund [3 hidden]5 mins ago
Instead of using the Cloud's own Kubernetes service, for example, you just buy the compute and run your own Kubernetes cluster. At a certain scale that is going to be cheaper if you have to know how. And since you are no longer tied to which services are provided and you just need access to compute and storage. you can also shop around for better prices than Amazon or Azure since you can really go to any provider of a VPS.
megggan [3 hidden]5 mins ago
Getting rid of bureaucratic internal IT department is a game changer for productivity. That alone is worth 10x infra costs, especially for big companies where work can grind to a halt dealing with obstructionists through service now. Good leaders understand this.
bell-cot [3 hidden]5 mins ago
Sadly true. Or, the so-called internal IT Dept. can be a shambolic mess of PHB's, Brunchlords, Catberts, metric maximizers, and micromanagers, presiding over the hollowed-out and burned out remains of the actual workforce that you'd need to reliably do the job.
CrzyLngPwd [3 hidden]5 mins ago
#2.5ish
We rent hardware and also some VPS, as well as use AWS for cheap things such as S3 fronted with Cloudflare, and SES for priority emails.
We have other services we pay for, such as AI content detection, disposable email detection, a small postal email server, and more.
We're only a small business, so having predictable monthly costs is vital.
Our servers are far from maxed out, and we process ~4 million dynamic page and API requests per day.
Archelaos [3 hidden]5 mins ago
I am using something inbetween 2 and 3, a hosted Web-site and database service with excellent customer support. On shared hardware it is 22 €/month. A managed server on dedicated hardware starts at about 50 €/month.
rcpt [3 hidden]5 mins ago
5. On-premise and engineers touch the wires every few days.
doctorpangloss [3 hidden]5 mins ago
Where do AWS reserved instances come into your hierarchy? What if there existed a “perpetual” reserved instance? Is cap-ex vs. op-ex really the key distinction?
preisschild [3 hidden]5 mins ago
Been using Hetzner Cloud for Kubernetes and generally like it, but it has its limitations. The network is highly unpredictable. You at best get 2Gbit/s, but at worst a few hundreds of Mbit/s.
Is that for the virtual private network? I heard some people say that you actually get higher bandwidth if you're using the public network instead of the private network within Hetzner, which is a little bit crazy.
direwolf20 [3 hidden]5 mins ago
Hetzner dedicated is pretty bad at private networks, so bad you should use a VPN instead. Don't know about the cloud side of things.
jgalt212 [3 hidden]5 mins ago
We looked at option 4. And colocation is not cheap. It was cheaper for us to lease VMs from Hetzner than to buy boxes and colocate at Equinix.
DyslexicAtheist [3 hidden]5 mins ago
this is what we did in the 90ies into mid 2000:
> Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills
back then this type of "skill" was abundant. You could easily get sysadmin contractors who would take a drive down to the data-center (probably rented facilities in a real-estate that belonged to a bank or insurance) to exchange some disks that died for some reason. such a person was full stack in a sense that they covered backups, networking, firewalls, and knew how to source hardware.
the argument was that this was too expensive and the cloud was better. so hundreds of thousands of SME's embraced the cloud - most of them never needed Google-type of scale, but got sucked into the "recurring revenue" grift that is SaaS.
If you opposed this mentality you were basically saying "we as a company will never scale this much" which was at best "toxic" and at worst "career-ending".
The thing is these ancient skills still exist. And most orgs simply do not need AWS type of scale. European orgs would do well to revisit these basic ideas. And Hetzner or Lithus would be a much more natural (and honest) fit for these companies.
belorn [3 hidden]5 mins ago
I wonder how much companies pay yearly in order to avoid having an employee pick up a drive from a local store, drive to the data center, pull the disk drive, screw out the failing hard drive and put in the new one, add it in the raid, verify the repair process has started, and then return to the office.
Symbiote [3 hidden]5 mins ago
I don't think I've ever seen a non-hot-swap disk in a normal server. The oldest I dealt with had 16 HDDs per server, and only 12 were accessible from the outside, bu the 4 internal ones were still hot-swap after taking the cover off.
Even some really old (2000s-era) junk I found in a cupboard at work was all hot-swap drives.
But more realistically in this case, you tell the data centre "remote hands" person that a new HDD will arrive next-day from Dell, and it's to go in server XYZ in rack V-U at drive position T. This may well be a free service, assuming normal failure rates.
belorn [3 hidden]5 mins ago
Yes, I did write that a bit hasty. I changed above to the normal process. As it happened we just installed a server without hotswap disk, but to be fair that is the first one I have personally seen in the last 20 years.
Remote hands is a thing indeed. Servers also tend to be mostly pre-built now days by server retailers, even when buying more custom made ones like servermicro where you pick each component. There isn't that many parts to a generic server purchase. Its a chassi, motherboard, cpu, memory, and disks. PSU tend to be determined by the motherboard/chassi choice, same with disk backplanes/raid/ipmi/network/cables/ventilation/shrouds. The biggest work is in doing the correct purchase, not in the assembly. Once delivered you put on the rails, install any additional item not pre-built, put it in the rack and plug in the cables.
amluto [3 hidden]5 mins ago
In the Bay Area there are little datacenters that will happily colocate a rack for you and will even provide an engineer who can swap disks. The service is called “remote hands”. It may still be faster to drive over.
It baffles me that my career trajectory somehow managed to insulate me from ever having to deal with the cloud, while such esoteric skills as swapping a hot swap disk or racking and cabling a new blade chassis are apparently on the order of finding a COBOL developer now. Really?
I can promise you that large financial institutions still have datacenters. Many, many, many datacenters!
direwolf20 [3 hidden]5 mins ago
we had two racks in our office of mostly developers. If you have an office you already have a rack for switches and patch panels. Adding a few servers is obvious.
Software development isn't a typical SME however. Mike's Fish and Chips will not buy a server and that's fine.
bpavuk [3 hidden]5 mins ago
if someone on the DevOps team knows Nix, option 3 becomes a lot cheaper time-wise! yeah, Nix flakes still need maintenance, especially on the `nixos-unstable` branch, but you get the quickest disaster recovery route possible!
plus, infra flexibility removes random constraints that e.g. Cloudflare Workers have
slyall [3 hidden]5 mins ago
There are a bunch of ways to manage bare metal servers apart from Nix. People have been doing it for years. Kickstart, theforeman, maas, etc, [0]. Many to choose from according to your needs and layers you want them to manage.
Reality is these days you just boot a basic image that runs containers
Indeed! We've yet to go down this route, but it's something we're thinking on. A friend and I have been talking about how to bring Nix-like constructs to Kubernetes as well, which has been interesting. (https://github.com/clotodex/kix, very much in the "this is fun to think about" phase)
Option 4 as well, that's how we do it at work and it's been great. However, it can't really be "someone on the team knows Nix", anyone working on Ops will need Nix skills in order to be effective.
lstodd [3 hidden]5 mins ago
Why this fixation on Nix? You don't need Nix to run bare metal.
preisschild [3 hidden]5 mins ago
I'm a NixOS fan, but been using Talos Linux on Hetzner nodes (using Cluster-API) to form a Kubernetes Cluster. Works great too!
scalemaxx [3 hidden]5 mins ago
Everything comes circle. Back in my day, we just called it a "data center". Or on-premise. You know, before the cloud even existed. A 1990s VP of IT would look at this post and say, what's new? Better computing for sure. Better virtualization and administration software, definitely. Cooling and power and racks? More of the same.
The argument made 2 decades ago was that you shouldn't own the infrastructure (capital expense) and instead just account for the cost as operational expense (opex). The rationale was you exchange ownership for rent. Make your headache someone else's headache.
The ping pong between centralized vs decentralized, owned vs rented, will just keep going. It's never an either or, but when companies make it all-or-nothing then you have to really examine the specifics.
IG_Semmelweiss [3 hidden]5 mins ago
There's a very interesting insight from your message.
The Cloud providers made a lot of sense to finance departments since aside from the promised savings, you would take that cloud expense now and lower your tax rate.
After the passing of the One Beautiful Bill ("OBB"), the law allows you to accelerate CapEx to instead expense it[1], similar to the benefit given by cloud service providers.
This puts way more wind on the sails of the on-prem movement, for sure
> you shouldn't own the infrastructure (capital expense) and instead just account for the cost as operational expense (opex)
That was part of the reason.
The real reason was the internal infrastructure team in many orgs got nowhere. There was a huge queue and many teams instead had to find infinite workarounds including standing up their own. The "cloud" provided a standardized way to at least deal with this mess e.g. single source of billing.
> A 1990s VP of IT would look at this post and say, what's new?
Speed. The US lives in luxury but outside of that it often takes a LONG time to get proper servers. You don't just go online. There are many places where you have to talk to a vendor with no list price and the drama continues. Being out of capacity can mean weeks to months before you get anywhere.
sanderjd [3 hidden]5 mins ago
Yep! The biggest win for me when AWS came out was that I could self-serve what I needed and put it on a credit card, rather than filing a ticket and waiting some number of days / weeks / months to get a new VM approved and deployed.
scalemaxx [3 hidden]5 mins ago
I agree - my reference to the 1990s VP of IT was looking at the post, which is about on-premise data centers... not the cloud. I don't think there's a speed advantage for on-premise data centers now vs the 1990s, but if there is let me know. Otherwise, indeed, it's a 1990s-era blast from the past.
the_af [3 hidden]5 mins ago
Agreed. Also, a realistic assessment should not downplay the very real overhead and headache of managing your on-premise data center. It comes at a cost in engineering/firefighting hours, it's not painless. There's a reason this eternal ping pong keeps going on!
adolph [3 hidden]5 mins ago
Yeah, I think the major improvement of cloud services was the rationalization of them into services with a cost instead of "ask that person for a whatsit" and "hopefully the associate goomba will approve."
All teams will henceforth expose their data and functionality through service interfaces
>San Diego has a mild climate and we opted for pure outside air cooling. This gives us less control of the temperature and humidity, but uses only a couple dozen kW. We have dual 48” intake fans and dual 48” exhaust fans to keep the air cool. To ensure low humidity (<45%) we use recirculating fans to mix hot exhaust air with the intake air. One server is connected to several sensors and runs a PID loop to control the fans to optimize the temperature and humidity.
Oh man, this is bad advice. Airborn humidity and contaminants will KILL your servers on a very short horizon in most places - even San Diego. I highly suggest enthalpy wheel coolers (kyotocooling is one vendor - switch datacenters runs very similar units on their massive datacenters in the Nevada desert) as they remove the heat from the indoor air using outdoor air (+can boost slightly with an integrated refrigeration unit to hit target intake temps) without switching the air from one side to the other. This has huge benefits for air control quality and outdoor air tolerance and a single 500KW heat rejection unit uses only 25KW of input power (when it needs to boost the AC unit's output). You can combine this with evaporative cooling on the exterior intakes to lower the temps even further at the expense of some water consumption (typically far cheaper than the extra electricity to boost the cooling through an hvac cycle).
Not knocking the achievement just speaking from experience that taking outdoor air (even filtered + mixed) into a datacenter is a recipe for hardware failure and the mean time to failure for that is highly dependant on your outdoor air conditions. I've run 3MW facilities with passive air cooling and taking outdoor air directly into servers requires a LOT more conditioning and consideration than is outlined in this article.
Torq_boi [3 hidden]5 mins ago
Yes, it's easy to destroy the servers with a lot of dust and/or high humidity. But with filtering and ensuring humidity never exceeds 45% we've had pretty good results.
kccqzy [3 hidden]5 mins ago
I remember visiting a small data center (about half the size of the Comma one) where shoe covers were required. Apparently they were worried about people’s shoes bringing in dust and other contamination.
tgtweak [3 hidden]5 mins ago
It's not a static number as it's also based on ambient air temperature in the form of dew point - 45% RH at low temps can be far more dangerous than 65% RH at warm ambient.
Likewise the impact on server longevity is not a finite boundary but rather "exposure over time" gradient that, if exceeding the "low risk" boundary (>-12'C/10'f dew point or >15'C/59'f dry bulb temp) results in lower MTBF than design. This is defined (and server equipment manufacturers conform and build to) ASHRAE TC 9.9. This mean - if you're running your servers above high risk curve for humidity and temperature, you're shortening the life considerably compared to low risk curve.
Generally, 15% RH is considered suboptimal and can be dangerous near freezing temperatures - in San Diego in January there were several 90%+RH scenarios that would have been dangerous for servers even when mixed down with warm exhaust air - furthermore, the outdoor air at 76'f during that period means you have limited capacity to mix in warm exhaust air (which btw came from that same 99%RH input air) without getting into higher-than-ideal intake temps.
Any dew points above 62.5'f are considered high risk for servers - as are any intake temps exceeding 32'C/90'f. You want to be on the midpoint between those and 16'C/65'f temps & -12'C/10'f dew point to have no impact on server longevity or MTBF rates.
As a recent example:
KCASANDI6112 - January 2, 2026
High Low Average
Temperature 73.4 °F 59.9 °F 63.5 °F
Dew Point 68.0 °F 60.0 °F 62.6 °F
Humidity 99 % 81 % 96 %
Precipitation 0.12 in -- --
Lastly, air contaminants - in the form of dust (that can be filtered out) and chemicals (which can't without extensive scrubbing) are probably the most detrimental to server equipment if not properly managed, and require very intentional and frequent filter changes (typically high MERV pleated filters changed on a time or pressure drop signal) to prevent server degradation and equipment risks.
The last consideration is fire suppression - permitted datacenters usually require compliance with separate fire code, such that direct outdoor air exchange without active shutdown and dry suppression is not permitted - this is to prevent a scenario where your equipment catches on fire and a constant supply of fresh oxygen-rich outdoor air turns that into an inferno. Smoke detection systems don't operate well with outdoor-mixed air or any level of airborn particulates.
So - for those reasons - among a few others - open air datacenters are not recommended unless you're doing them at google or meta scale, and in those scenarios you typically have much more extensive systems and purpose-designed hardware in order to operate for the design life of the equipment without issues.
phailhaus [3 hidden]5 mins ago
I didn't even know this is something you had to worry about. This is why I use the cloud, all the unknown unknowns.
speedgoose [3 hidden]5 mins ago
I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.
For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.
Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.
I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.
I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.
epolanski [3 hidden]5 mins ago
> but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO
I worked in a company with two server farms (a main and a a backup one essentially) in Italy located in two different regions and we had a total of 5 employees taking care of them.
We didn't hear about them, we didn't know their names, but we had almost 100% uptime and terrific performance.
There was one single person out of 40 developers who's main responsibility were deploys, and that's it.
It costed my company 800k euros per year to run both the server farms (hardware, salaries, energy), and it spared the company around 7-8M in cloud costs.
Now I work for clients that spend multiple millions in cloud for a fraction of the output and traffic, and I think employ around 15+ dev ops engineers.
riku_iki [3 hidden]5 mins ago
it depends on complexity of your infra.
Running full scale kubernets, with multiple databases and services and expected 99.99% uptime likely can't be handled by one person.
lstodd [3 hidden]5 mins ago
Takes a team of 3-4 in my experience. One person doesn't cut it when the talk of percents of uptime starts no matter what scale. (and no matter cloud, dedicated or on-premises).
olavgg [3 hidden]5 mins ago
> I would rather pay a competent cloud provider than being responsible for reliability issues.
Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
pageandrew [3 hidden]5 mins ago
The point was about redundancy / geo spread / HA. It’s significantly more difficult to operate two physical sites than one. You can only be in one place at a time.
If you want true reliability, you need redundant physical locations, power, networking. That’s extremely easy to achieve on cloud providers.
PunchyHamster [3 hidden]5 mins ago
You can just rent the rack space in datacenter and have that covered. It's still much cheaper than running that in cloud.
It doesn't make sense if you only have few servers, but if you are renting equivalent of multiple racks of servers from cloud and run them for most of the day, on-prem is staggeringly cheaper.
We have few racks and we do "move to cloud" calculation every few years and without fail they come up at least 3x the cost.
And before the "but you need to do more work" whining I hear from people that never did that - it's not much more than navigating forest of cloud APIs and dealing with random blackbox issues in cloud that you can't really debug, just go around it.
direwolf20 [3 hidden]5 mins ago
How much does your single site go down?
On cloud it's out of your control when an AZ goes down. When it's your server you can do things to increase reliability. Most colos have redundant power feeds and internet. On prem that's a bit harder, but you can buy a UPS.
If your head office is hit by a meteor your business is over. Don't need to prepare for that.
account42 [3 hidden]5 mins ago
You don't need full "cloud" providers for that, colocation is a thing.
nicman23 [3 hidden]5 mins ago
or just to be good at hiding the round trip of latency
jim180 [3 hidden]5 mins ago
Also I'd add this question, why do so many developers and sysadmins think, that cloud companies always hire competent/non-lazy/non-pissed employees?
faust201 [3 hidden]5 mins ago
> Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
It is a different skillset. SRE is also an under-valued/paid (unless one is in FAANGO).
clickety_clack [3 hidden]5 mins ago
It’s all downside. If nothing goes wrong, then the company feels like they’re wasting money on a salary. If things go wrong they’re all your fault.
faust201 [3 hidden]5 mins ago
Correct
sgarland [3 hidden]5 mins ago
SRE has also lost nearly all meaning at this point, and more or less is equivalent to "I run observability" (but that's a SaaS solution too).
infecto [3 hidden]5 mins ago
Maybe you find it fun. I don’t, I prefer building software not running and setting up servers.
It’s also nontrivial once you go past some level of complexity and volume. I have made my career at building software and part of that requires understanding the limitations and specifics of the underlying hardware but at the end of the day I simply want to provision and run a container, I don’t want to think about the security and networking setup it’s not worth my time.
tomcam [3 hidden]5 mins ago
Because when I’m running a busy site and I can’t figure out what went wrong, I freak out. I don’t know whether the problem will take 2 hours or 2 days to diagnose.
MaKey [3 hidden]5 mins ago
Usually you can figure out what went wrong pretty quickly. Freaking out doesn't help with the "quickly" part though.
rvz [3 hidden]5 mins ago
> Why do so many developers and sysadmins think they're not competent for hosting services.
Because those services solve the problem for them. It is the same thing with GitHub.
However, as predicted half a decade ago with GitHub becoming unreliable [0] and as price increases begin to happen, you can see that self-hosting begins to make more sense and you have complete control of the infrastructure and it has never been more easier to self host and bring control over costs.
> its also fun to solve technical issues you may have.
What you have just seen with coding agents is going to have the same effect on "developers" that will have a decline in skills the moment they become over-reliant on coding agents and won't be able to write a single line of code at all to fix a problem they don't fully understand.
At a previous job, the company had its critical IT infrastructure on their own data center. It was not in the IT industry, but the company was large and rich enough to justify two small data centers. It notably had batteries, diesel generators, 24/7 teams, and some advanced security (for valid reasons).
I agree that solving technical issues is very fun, and hosting services is usually easy, but having resilient infrastructure is costly and I simply don't like to be woken up at night to fix stuff while the company is bleeding money and customers.
bigfatkitten [3 hidden]5 mins ago
> Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
Speaking as someone who does this, it is very straightforward. You can rent space from people like Equinix or Global Switch for very reasonable prices. They then take care of power, cooling, cabling plant etc.
Torq_boi [3 hidden]5 mins ago
Yes, we still use the azure for user-facing services and the website. They don't need GPUs and don't need expensive resources, so it's not as worth it to bring those in-house.
We also rely on github. It has historically been good a service, but getting worth it.
lstodd [3 hidden]5 mins ago
I don't get why most everyone insists on comparing cloud to on-premises and not to dedicated. Why would anyone run own DC infra when there's Hetzner and many others?
Schlagbohrer [3 hidden]5 mins ago
Unfortunately we experienced an issue where our Slurm pool was contaminated by a misconstrained Postgres Daemon. Normally the contaminated slurm pool would drain into a docker container, but due to Rust it overloaded and the daemon ate its own head. Eventually we returned it to a restful state so all's well that ends well.
(hardware engineer trying to understand wtaf software people are saying when they speak)
IFC_LLC [3 hidden]5 mins ago
This is cool. Yet, there are levels of insanity and those depend on your inability to estimate things.
When I'm launching a project it's easier for me to rent $250 worth of compute from AWS. When the project consumes $30k a month, it's easier for me to rent a colocation.
My point is that a good engineer should know how to calculate all the ups and downs here to propose a sound plan to the management. That's the winning thing.
piker [3 hidden]5 mins ago
It goes further than this first order, though. If you're trying to build a business that attracts the types of talent who wants to know the stack up and down, starting with an AWS instance might give you a better shot at funding (and thus a better overall shot), but it's not clear that it gives you a shot a building the business you're aiming for. For the things that "don't make your beer better", sure, but we're talking about training ML models at an ML shop. Here it makes sense for this reason.
infecto [3 hidden]5 mins ago
That last part is exactly it and I while I know the intro sentence nails it I don’t think compute resonates with people (everyone uses compute). If you are 24/7 running work at scale it absolutely makes sense past the initial first couple years to build out your own DC like this.
redrove [3 hidden]5 mins ago
We’re past the point in history where most engineers get to make even a recommendation about which platform to use to management.
In 99.999999% of cases management has already decided and is just informing you, because they know better.
JackSlateur [3 hidden]5 mins ago
I work in a multi-billions dollars company and do not face what you describe
Perhaps an exception (yet so far, I've never encounter the situation you describe)
kevinkatzke [3 hidden]5 mins ago
Feels like I’ve lived through a full infrastructure fashion cycle already. I started my career when cloud was the obvious answer and on-prem was “legacy.”
Now on-prem is cool again.
Makes me wonder whether we’re already setting up the next cycle 10 years from now, when everyone rediscovers why cloud was attractive in the first place and starts saying “on-prem is a bad idea” again.
Aurornis [3 hidden]5 mins ago
> Makes me wonder whether we’re already setting up the next cycle 10 years from now, when everyone rediscovers why cloud was attractive in the first place and starts saying “on-prem is a bad idea” again.
My entire career I’ve encountered people passionately pushing for on-prem and railing against anything cloud. I can’t remember a time when Hacker News comments leaned pro-cloud because it’s always been about self-hosting.
The few times the on-prem people won out in my career never went exactly as they imagined. Buying a couple servers and setting them up at the colo is easy enough, but the slow and steady drag of maintaining your own infrastructure starts to work its way into every development cycle after that. In my experience, every team has significantly underestimated how all the little things add up to a drag on available time for other work.
The best case for on-prem that I saw was when a company was basically in maintenance mode. Engineers had a lot of extra time to optimize, update. maintain, and cost reduce without subtracting from feature development or bug fixes.
The worst cases for on-prem I’ve seen have been funded startups. In this situation it’s imperative that everyone focus on feature development and rapid iteration. Letting some of the engineers get sidetracked with setting up and maintaining their own hosting to save a dollar amount that barely hires 1-2 more engineers but sets the schedule back by many months was a huge mistake.
In my experience, most engineers become less enchanted with rolling their own on premises hosting as they get older. Their work becomes more about getting the job done quickly and to budget, not hyper-optimizing the hosting situation at the expense of inviting more complexity and miscellaneous tasks into their workload.
Aromasin [3 hidden]5 mins ago
If this were cyclical, I'd be inclined to agree, but this seems to be more of a wave. I also think the push back is more than just one against rented compute. It is tied to a societal ennui that comes from the feeling that we no longer own anything, be it music, housing, movies, land, tools, phones, or cars. Everything is moving to either being rented or on credit. There's a push back against this self-made feudal revival, and that scales all the way from individuals through to corporations; in this case, against the idea that a mega-corporation gets to decide how and when you get to use your compute, and at what variable price.
mbreese [3 hidden]5 mins ago
Just one cycle?
This is cyclical and I see the main axis of contention as centralized vs de-centralized computing.
Mainframes (network) gave way to mini and microcomputers (PCs). PCs gave way to server farms and web-based applications. Private servers and data centers gave way to the Cloud. Edge computing is again a push towards a more decentralized model.
Like all good engineering problems, where data and applications are hosted involve tradeoffs. Priorities change. Technologies change. But oftentimes, what works in one generation doesn't in the next. Part of it is the slow march of progress. But I think some of it is just not wanting to use your parent's technology stack and wanting to build your own.
The cloud vs. on-prem tradeoff is one of flexibility, capacity, maintenance, and capex vs opex.
It's a similar story in application development. At one point, we're navigating text forms on a mainframe, the next it's a GUI local application, followed by Electron or Web applications with remote data. We'll cycle back to local-first data (likely on-phone local models).
When you start to hear about the network being the computer again, you'll know we've started to swing back the other way again.
pizzafeelsright [3 hidden]5 mins ago
Mainframe -> Desktop -> Server Room -> Data Center -> Cloud (rented data center) -> Space (Skynet)
devmor [3 hidden]5 mins ago
Sometimes, I feel like this is indicative of the incredible waste present in IT and development. Granted the cost of this kind of infrastructure upheaval is orders of magnitude cheaper than something like manufacturing - but still, it feels ridiculous that established companies can swap back and forth on a whim.
andrewstuart2 [3 hidden]5 mins ago
The problem was always the platform. For me, I saw very early on that kubernetes was exactly what I wanted after reading about how Google "treats the datacenter like one large computer." And I've been very happily running my own side projects on my own home cluster for 10 ish years (my kube-system namespace is 9y old). But selling any of my employers on this was a very hard proposition until enough people had shown it working at that scale.
3acctforcom [3 hidden]5 mins ago
The lowest grade I got in my business degree was in the "IT management" course. That's because the ONLY acceptable answer to any business IT problem is to move everything to the cloud. Renting is ALWAYS better than owning because you transfer cost and risk to a 3rd party.
That's pretty much the dogma of the 2010s.
It doesn't matter that my org runs a line-of-business datacentre that is a fraction of the cost of public cloud. It doesn't matter that my "big" ERP and admin servers take up half a rack in that datacentre. MBA dogma says that I need to fire every graybeard sysadmin, raze our datacentre facility to the ground, and move to AWS.
Fun fact, salaries and hardware purchases typically track inflation, because switching cost for hardware is nil and hiring isn't that expensive. Whereas software is usually 5-10% increases every year because they know that vendor lock-in and switching costs for software are expensive.
MagicMoonlight [3 hidden]5 mins ago
Right, but is that a like for like comparison?
AWS has redundant data centres across the world and within each region. A file in S3 will never be lost, even if you store it for a thousand years.
What happens if your city has a tornado and your data centre gets hit? Is your company now dead?
And how much do you spend on all these sysadmins? 200k each? If you’re saving 20k/month by paying 100k/month in salaries, you aren’t saving anything.
drnick1 [3 hidden]5 mins ago
On premises isn't only about saving money (that's not always clear). The article neglects the most important benefits which are freedom (control) and privacy. It's basically the same considerations that apply to owning vs renting a house.
Aurornis [3 hidden]5 mins ago
The entire second section is about different benefits of having your own data centers. Cost is listed as the last one, not the primary one.
vadepaysa [3 hidden]5 mins ago
I was an on-prem maxi (if thats a thing) for a long time. I've run clusters that costed more than $5M, but these days I am a changed man. I start with PaaS like Vercel and work my way down to on-prem depending on how important and cost conscious that workload is.
Pains I faced running BIG clusters on-prem.
1. Supply chain Management -- everything from power supplies all the way to GPUs and storage has to be procured, shipped, disassembled and installed. You need labor pool and dedicated management.
2. Inventory Management -- You also need to manage inventory on hand for parts that WILL fail. You can expect 20% of your cluster to have some degree of issues on an ongoing basis
3. Networking and security -- You are on your own defending your network or have to pay a ton of money to vendors to come in and help you. Even with the simplest of storage clusters, we've had to deal with pretty sophisticated attacks.
When I ran massive clusters, I had a large team dealing with these. Obviously, with PaaS, you dont need anyone.
cheema33 [3 hidden]5 mins ago
> I was an on-prem maxi (if thats a thing) for a long time. I've run clusters that costed more than $5M, but these days I am a changed man.
I have had a similar transformation. I still host non-critical services on-prem. They are exceptionally cheap to run. Everything else, I host it on Hetzner.
majormajor [3 hidden]5 mins ago
In addition to those sorts of non-first-hardware-purchase costs, the person writing the check needs to think long and hard about how bad an outage would be, and how much money it makes sense to budget simply to "avoiding outages." And the more important it is not to have any downtime, the more it's gonna cost to build up some sort of substitute for cross-datacenter cloud functionality. (You are also likely not going to be as good at either managing and configuring those networks, or hiring people to do so, as AWS, either.)
jillesvangurp [3 hidden]5 mins ago
At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.
There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.
People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.
The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.
This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
PunchyHamster [3 hidden]5 mins ago
> At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.
You rent a dataspace, which is OPEX not CAPEX, and you just lease the servers, which turns big CAPEX into monthly OPEX bill
Running your own DC is "we have two dozen racks of servers" endeavour, but even just renting DC space and buying servers is much cheaper than getting same level of performance from the cloud.
> This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
YOU NEED THOSE PEOPLE TO MANAGE CLOUD TOO. That's what always get ignore in calculations, people go "oh, but we really need like 2-3 ops people to cover datacenter and have shifts on the on-call", but you need same thing for cloud too, it is just dumped on programmers/devops guys in the team rather than having separate staff.
We have few racks and the part related to hardware is small part of total workload, most of it is same as we would (and do for few cloud customers) in cloud, writing manifests for automation.
input_sh [3 hidden]5 mins ago
> YOU NEED THOSE PEOPLE TO MANAGE CLOUD TOO.
Finally, some sense! "Cloud" was meant to make ops jobs disappear, but they just increased our salary by turning us into "DevOps Engineers" and the company's hosting bill increased fivefold in the process. You will never convince even 1% of devs to learn the ops side properly, therefore you'll still end up hiring ops people and we will cost you more now. On top of that, everyone that started as a "DevOps Engineer" knows less about ops than those that started as ops and transitioned into being "DevOps Engineers" (or some flavour of it like SREs or Platform Engineers).
If you're a programmer scared into thinking AI is going to take away your job, re-read my comment.
tracker1 [3 hidden]5 mins ago
I'm not disagreeing... but it depends on how you shift the complexity/work and how you lean into or don't lean into the services a given cloud provider offer or not.
Just database management is a pretty specialized skill, separate from development or optimizing the structures of said data... For a lot of SaaS providers, if you aren't at a point where you can afford a dedicated DBA/Ops staff just for data, that's one reason you might lean into cloud operations or hybrid ops just for dbms management, security and backups. This is a low hanging fruit in terms of cloud offerings evem... but can shift a lot of burden in terms of operational overhead.
Again, depending on your business and data models.
mattbillenstein [3 hidden]5 mins ago
Honestly, the way I've seen a lot of cloud done, they need _more_ people to manage that than a sensible private cloud setup.
wobfan [3 hidden]5 mins ago
To be fair, I think people are vastly over estimating the work they would have and the power they would need. Yes, if you have to massively scale up, then it'll take some work, but most of it is one-time work. You do it, and when it runs, you only have a fraction of work over the next months to maintain it. And with fraction, I mean below 5%. And keep in mind that >99% of startups who think of "yeah we need this and that cloud, because we need to scale" will never scale. Instead they are happily locking themselves into a cloud service. And if they actually scale at some point, this service will be massively more expensive.
coffeebeqn [3 hidden]5 mins ago
One decent server would be enough to run 99.5% of startups backends.
maccard [3 hidden]5 mins ago
We have two on site servers that we use. For various reasons (power cuts, internet outages, cleaners unplugging them) I’d say we have to intervene with them physically about once a month. It’s a total pain in the ass, especially when you don’t have _an_ it person sitting in the office to mind it. I’m in the Uk and our office is in Spain…
But it is significantly cheaper and faster
meatmanek [3 hidden]5 mins ago
You might want to look into colocating that server at a datacenter nearby. You can get a few U of rack space and the risk of power outages, internet outages, or cleaners unplugging the servers should go way down.
direwolf20 [3 hidden]5 mins ago
Startups don't know how much hardware they need when they release to customers. The extreme flexibility of cloud makes a lot of sense for them.
aforwardslash [3 hidden]5 mins ago
But they should; cloud wont magically make the architecture scale. A competent CTO should know the limits of the platform, its called "load testing" or "stress testing"; scalability is independent of the provider. Cloud gives you a nicer interface to add resources, granted; but that"s it.
As a hear-say anecdote, thats why some startups have db servers with hundreds of gb of ram and dozens of cpus to run a workload that could be served from a 5 year old laptop.
lelanthran [3 hidden]5 mins ago
Your calculation assumes that an FTE is needed to maintain a few beefy servers.
Once they are up and running that employee is spending at most a few hours a month on them. Maybe even a few hours every six months.
OTOH you are specifically ignoring that you'll require mostly the same time from a cloud trained person if you're all-in on AWS.
I expect the marginal cost of one employee over the other is zero.
jillesvangurp [3 hidden]5 mins ago
> Once they are up and running
You should also calculate the cost of getting it up and running. With Google Cloud (I don't actually use AWS), I mainly worry about building docker containers in CI and deploying them to vms and triggering rolling restarts as those get replaced with new ones. I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours. Also, where does the hardware live? What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where? Do you pay for security for wherever all that happens? What about cleaning, AC, or a special server room in your building. All that stuff is cost. Some of it is upfront cost. Some of it is recurring cost.
The article is a about a company that owns its own data center. The cost they are citing (5 million) is substantial and probably a bit more complete. That's one end of the spectrum.
Symbiote [3 hidden]5 mins ago
You are massively overcomplicating this.
> I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
These are not difficult problems. You can use the same/similar cloud install images.
A 10 year old nerd can install Linux on a computer; if you're a professional developer I'm sure you can read the documentation and automate that.
> And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours.
You could use the same person who is on standby to fix the cloud system if that has some failure.
> Also, where does the hardware live?
In rented rackspace nearby, and/or in other locations if you need more redundancy.
> What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where?
It will probably report the hardware failure to Dell/HP/etc automatically and open a case. Email or phone to confirm, the part will be sent overnight, and you can either install it yourself (very, very easy for things like failed disks) or ask a technician to do it (I only did this once with a CPU failure on a brand new server). Dell/HP/etc will provide the technician, or your rented datacentre space will have one for simpler tasks like disks.
meatmanek [3 hidden]5 mins ago
My west-coast employer used to have a few racks of hardware on the east coast. Not a single employee of our company saw the hardware for several years after installation.
The installation itself was handled by the vendor and datacenter. For hard drive failures, our vendor (who provided the warranty) shipped a drive and had a technician drive to the site. We had to 1. tell the datacenter to expect the package and let the tech in, and 2. be online to run the command to blink the lights on the drive that needed replacing and then verify that the drive came online. This 6-company dance (us, vendor, DC, tech, fedex, HDD manufacturer) was more annoying than just terminating an EC2 instance and recreating it (or having EBS handle drive failures behind the scenes) but it wasn't that bad in the grand scheme of things.
abc123abc123 [3 hidden]5 mins ago
Shush! The cloud companies want customers to think it is a complicated near death experience to run on their own hardware.
It is sad that the knowledge of how easy it really is, is getting extinct. The cloud and SaaS companies benefit greatly.
lelanthran [3 hidden]5 mins ago
> You should also calculate the cost of getting it up and running.
I was not doing the calculation. I was only pointing out that it was not as simple as you make it out to be.
Okay, a few other things that aren't in most calculations:
1. Looking at jobs postings in my area, the highest paid ones require experience with specific cloud vendors. The FTEs you need to "manage" the cloud are a great deal more expensive than developers.
2. You don't need to compare on-prem data center with AWS - you can rent a pretty beefy VPS or colocate for a fraction of the cost of AWS (or GCP, or Azure) services. You're comparing the most expensive alternative when avoiding cloud services, not the most typical.
3. Even if you do want to build your own on-prem rack, FTEs aren't generally paid extra for being on the standby rota. You aren't paying extra. Where you will pay extra is for hot failovers, or machine room maintenance, etc, which you don't actually need if your hot failover is a cheap beefy VPS-on-demand on Hetzner, DO, etc.
4. You are measuring the cost of absolute 0% downtime. I can't think of many businesses that have such high sensitivity to downtime. Even banks handle downtime much larger than that even while their IT systems are still up. With such strict requirements you're getting into the spot where the business itself cannot continue because of catastrophe, but the IT systems can :-/. What use is the IT systems when the business itself may be down?
The TLDR is:
1. If you have highly paid cloud-trained FTEs, and
2. Your only option other than Cloud is on-prem, and
3. Your FTEs are actually FT-contractors who get paid per hour, and
4. Your uptime requirements are moire stringent than national banks,
yeah, then cloud services are only slightly more expensive.
You know how many businesses fall into that specific narrow set of requirements?
JackSlateur [3 hidden]5 mins ago
Maintainer is a real work
If you do it only a few hours every 6 months, you are not maintaining your infrastructure, you are letting it die (until the need arises and everything must be done and this is a massive project)
tracker1 [3 hidden]5 mins ago
On the software side... depending on your business model, you can factor in a lot of the cost structures into your structure. Especially for say B2B arrangements.
Cloud integrations, for example, allow you to simply use a different database instance altogether per customer, while you can share services that utilize a given db connection. But actually setting up and managing that type of database infrastructure yourself may be much more resource intensive from a head count perspective.
I mention this, because having completely separate databases is an abstraction that cloud operations have already solved... while you can choose other options, such as more complex data models to otherwise isolate or share resources how does this complexity affect your services down-stream and the overall data complexities across one or all clients.
Harder still, if your data/service is centered around b2b clients of yours that have direct consumer interactions... then what if the industry is health or finance where there are even more legal concerns. Figuring a minimal (off the top) cost of each client of yours and scaling to the number of users under them isn't too hard to consider if you're using a mix of cloud services in concert with your own systems/services.
So yeah.. there's definitely considerations in either direction.
bambax [3 hidden]5 mins ago
> it doesn't make much sense for the majority of startup companies until they become late stage
Here's what TFA says about this:
> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out.
and I think they're right. Be careful how you start because you may be stuck in the initial situation for a long time.
sgarland [3 hidden]5 mins ago
> But until then it's a long term cost optimization with really high upfront capital expenditure and risk.
The upfront capex does not need to be that high, unless you're running your own AI models. Other than leasing new ones, as a sibling comment stated, you can buy used. You can get a solid Dell 2U with a full service contract (3 years) for ~$5-10K depending on CPU / memory / storage configuration. Or if you don't mind going older - because honestly, most webapps aren't doing anything compute-heavy - you can drop that to < $1K/node. Replacement parts for those are cheap, so buy an extra of everything.
tracker1 [3 hidden]5 mins ago
And if each of your clients is in the Healthcare industry and dealing with end-user medical data? Or financial data? Are you prepared for appropriate data isolation/sharding and controls? Do you have a strategy for scaling database operations per client or across all clients?
It really depends on the business model as to how well you might support your own infrastructure vs. relying on a new backend instance per client in a cloud infrastructure that has already solved many of the issues at play.
sgarland [3 hidden]5 mins ago
> And if each of your clients is in the Healthcare industry and dealing with end-user medical data? Or financial data?
Then you're probably going to need some combination of HIPAA / SOC 2 / PCI DSS certification, regardless of where your servers are physically located. AWS has certified the infrastructure side for you, but that doesn't remove your obligations for the logical side.
> Are you prepared for appropriate data isolation/sharding and controls? Do you have a strategy for scaling database operations per client or across all clients?
Again, you're going to need that regardless of where your servers physically exist.
> vs. relying on a new backend instance per client in a cloud infrastructure
You want to spin up an EC2 per client, and run an isolated copy of the application, isolated DB, etc. inside of it? That sounds like a nightmare to manage, especially if you want or need HA capabilities.
tracker1 [3 hidden]5 mins ago
>> vs. relying on a new backend instance per client in a cloud infrastructure
> You want to spin up an EC2 per client, and run an isolated copy of the application, isolated DB, etc. inside of it? That sounds like a nightmare to manage, especially if you want or need HA capabilities.
No... just running a new hosted database instance per client... but (re)using your service/application infrastructure, but just connecting through a different database host/proxy based on the client for the request.
Just that utility at the database management layer is probably worth the price of entry for using cloud resources if you can't justify and cover the cost of say 5+ employees just for the data management infrastructure.
sgarland [3 hidden]5 mins ago
That’s going to be enormously expensive. If you need guaranteed tenant isolation, put them in separate schemas, with specific user grants. That scales up much better than you’d think.
Or use Citus Postgres, and get sharding by schema for free, so you have both isolation and more or less infinite growth.
I’m not sure why if you think it would take 5 employees to manage self-hosted DBs, that it won’t take close to that to manage cloud-hosted ones. The only real difference you’re going to have once both are set up is dealing with any possible hardware issues. The initial setup for backups, streaming replication, etc. is a one-time thing, and then it just works. Hire a contractor for that, optionally keeping them on retainer for emergencies if you want.
You still have to deal with DB issues with a managed service: things like schema management, table design, index maintenance, parameter tuning, query optimization are all your responsibility, not the cloud provider’s.
ActorNightly [3 hidden]5 mins ago
>At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk.
The issue with comma.ai is that the company is HEAVILY burdened with Geohotz ideals, despite him no longer even being on the board. I used to be very much into his streams and he rants about it plenty. A large reason of why they run their own datacenter is that they ideologically refuse to give money to AWS or Google (but I guess Microsoft passes their non woke test).
Which is quite hilarious to me because they live in a very "woke" state and complain about power costs in the blog post. They could easily move to Wyoming or Montana and with low humidity and colder air in the winter run their servers more optimally.
Torq_boi [3 hidden]5 mins ago
Our preference for training in our own datacenter has nothing to do with wokeness. Did you read the blog post? The reasons are clearly explained.
The climate in Wyoming and Montana are actually worse in terms of climate. San Diego's climate extremes are less extreme than those places. Though moving out of CA is a good idea for power cost reasons, also addressed in the blog.
ashu1461 [3 hidden]5 mins ago
And not just any FTEs, probably few senior / staff level engineers who would cost a lot more.
g-b-r [3 hidden]5 mins ago
You should keep in mind that for a lot of things you can use a servicing contract, rather than hiring full-time employees.
It's typically going to cost significantly less; it can make a lot of sense for small companies, especially.
simianwords [3 hidden]5 mins ago
The reason companies don’t go with on premises even if cloud is way more expensive is because of the risk involved in on premises.
You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.
Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.
I would be vary of a smallish company building their own Jira in house in a similar way.
fauigerzigerk [3 hidden]5 mins ago
I'm starting to wonder though whether companies even have the in-house competence to compare the options and price this risk correctly.
>Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
Yes, but one differentiating factor is always price and you don't want to lose all your margins to some infrastructure provider.
simianwords [3 hidden]5 mins ago
Software companies have higher margins so these decisions are lower stakes. Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.
Think of a ~5000 employee startup. Two scenarios:
1. if they win the market, they capture something like ~60% margin
2. if that doesn't happen, they just lose, VC fund runs out and then they leave
In this dynamic, costs associated with infrastructure don't change the bottomline of profitability. The risk involved with rolling out their on infrastructure can hurt their main product's existence itself.
fauigerzigerk [3 hidden]5 mins ago
I'm not disputing that there are situations where it makes sense to pay a high risk premium. What I'm disputing is that price doesn't matter. I get the impression that companies are losing the capability to make rational pricing decisions.
>Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.
Well, exactly. But the degree to which the price of a specific input affects your bottom line depends on your product.
During the dot com era, some VC funded startups (such as Google) made a decision to avoid using Windows servers, Oracle databases and the whole super expensive scale-up architecture that was the risk-free, professional option at the time. If they hadn't taken this risk, they might not have survived.
[Edit] But I think it's not just about cloud vs on-premises. A more important question may be how you're using the cloud. You don't have to lock yourself into a million proprietary APIs and throw petabytes of your data into an egress jail.
sam_lowry_ [3 hidden]5 mins ago
Precious real-world engineering skills also play a role.
But most importantly, the attractive power that companies doing on-premise infrastructure have towards the best talent.
MagicMoonlight [3 hidden]5 mins ago
Yes, the idea is that you focus on the things that differentiate you from the competition. If you’re a factory that makes nails, a better data centre won’t make you any more money. It won’t help you sell more nails. So you should leave the data centres to the experts, and focus on work which improves your actual product.
If you don’t, you’ll be stuck trying to figure out data centres. Hiring tons of infrastructure experts, trying to manage power consumption. And for what? You won’t sell any more nails.
If you’re a company like Google, having better data centres does relate to your products, so it makes sense to focus on them and build your own.
d1sxeyes [3 hidden]5 mins ago
It’s also opex vs capex, which is a battle opex wins most of the time.
bayindirh [3 hidden]5 mins ago
Opex is faster. Login, click, SSH, get a tea.
Capex needs work. A couple of years, at least.
If you are willing to put in the work. Your mundane computer is always better than the shiny one you don't own.
iso1631 [3 hidden]5 mins ago
That's because of company policies. An SME owner will buy a server and have it in the rack the next day.
Of course creating a VM is still a teraform commit away (you're not using clickops in prod surely)
amluto [3 hidden]5 mins ago
If you want something at all customized, it takes longer than that to receive the server. That being said, you can buy a server that will outperform anything the cloud can give you at much better cost.
bayindirh [3 hidden]5 mins ago
SME and "a server" is doing some big weight lifting here.
If you want a custom server, one or a thousand, it's at least a couple of weeks.
If you want a powerful GPU server, that's rack + power + cooling (and a significant lead time). A respectable GPU server means ~2KW of power dissipation and considerable heat.
If you want a datacenter of any size, now that's a year at least from breaking ground to power-on.
selkin [3 hidden]5 mins ago
And multiple years from the boardroom making a decision to build a data center to breaking ground.
marcosdumay [3 hidden]5 mins ago
Well, capex has a multi-year depreciation schedule and has to cover interest rates. So the simplified "opex wins most of the time" is right.
But we are talking about a cost difference of tens of times, maybe a few hundred. The cloud is not like "most of the time".
aragilar [3 hidden]5 mins ago
It depends. Grant funding (e.g. in academia) makes capex easier to manage than opex (because when the grant runs out you still have device).
simianwords [3 hidden]5 mins ago
I think it wins because opex is seen as stable recurring cost and capex is seen as the money you put in your primary differentiation for long term gains.
bonesss [3 hidden]5 mins ago
For mature Enterprises my understanding is that the financial math works out such that the cloud becomes smart for market validation, before moving to cheaper long term solution once revenue is stable.
Scale up, prove the market and establish operations on the credit card, and if it doesn’t work the money moves onto more promising opportunities. If the operation is profitable you transition away from the too expensive cloud to increase profitability, and use the operations incoming revenue to pay for it (freeing up more money to chase more promising opportunities).
Personally I can’t imagine anything outside of a hybrid approach, if only to maintain power dynamics with suppliers on both sides. Price increases and forced changes can be met with instant redeployments off their services/stack, creating room for more substantive negotiations. When investments come in the form of saving time and money, it’s not hard to get everyone aligned.
d1sxeyes [3 hidden]5 mins ago
True, but for a lot of companies “our servers are on-prem” is not a primary differentiator.
simianwords [3 hidden]5 mins ago
i think we are saying the same thing?
TonyStr [3 hidden]5 mins ago
Capex may also require you to take out loans
spacebanana7 [3 hidden]5 mins ago
Which is incredibly difficult in the public sector. Yes, there are various financing instruments available for capital purchases but they're always annoying, slow and complicated. It's much easier to spend 5k per month than 500k outright.
seg_lol [3 hidden]5 mins ago
Your numbers don't line up, if you are spending 5k in cloud costs, and on prem is 1/3 of cloud. At 48 month replacement cycle, 1/3 of 5k * 48 months is 80k. So it is 80k vs 5k a month for 48 months.
I think the primary reason that people over fixate on the cloud is that they can't do math. So renting is a hedge.
spacebanana7 [3 hidden]5 mins ago
It’s not really about the numbers though.
Even spending 10k recurring can be easier administratively that spending 10k on a one time purchase that depreciates over a 3 year cycle in some organisations because you don’t have to go into meetings to debate whether it’s actually a 2 or 4 year depreciation or discuss opportunity costs of locking up capital for 3 years etc.
Getting things done is mostly a matter of getting through bureaucracy. Projects fail because of getting stuck in approvals far more often than they fail because of going overbudget.
seg_lol [3 hidden]5 mins ago
> It’s not really about the numbers though.
Of course not.
sgarland [3 hidden]5 mins ago
Note that they're running R630/R730s for storage. Those are 12-year old servers, and yet they say each one can do 20 Gbps (2.5 GBps) of random reads. In comparison, the same generation of hardware at AWS ({c,m,r}4) instance maxes out at 50% of that for EBS throughput on m4, and 70% on r4 - and that assumes carefully tuned block sizes.
Old hardware is _plenty_ powerful for a lot of tasks today.
treesknees [3 hidden]5 mins ago
I’m on a project at work replacing our R430s and R730s. They’ve been absolute tanks with very few hardware failures. That said, my company chooses to have OEM support for replacing failed components and keeping firmware/bios/idrac updated. You can absolutely run these if you’re OK with 3rd party replacements or parting out spare machines. Some industries are more tolerant to this than others.
sgarland [3 hidden]5 mins ago
I ran 3x R620s 24/7/365 in my homelab for ~6 years (well, other than when I moved, or shut one down for a clean-and-inspect, or lost power in excess of what my UPS could handle... thanks, Texas). The only things that failed during that time were a couple of sticks of RAM, and a PSU.
hbogert [3 hidden]5 mins ago
Datacenters need cool dry air? <45%
No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.
CamperBob2 [3 hidden]5 mins ago
Low humidity causes static electricity.
RAM that is plugged in and operating isn't subject to external ESD, unless you count lightning strikes. Where are you getting this?
swiftcoder [3 hidden]5 mins ago
The datacenter is in San Diego - a quick Google confirms that external humidity pretty much never drops below 50% there.
Things would be different in a colder climate where humidity goes --> 0% in the winter
mbreese [3 hidden]5 mins ago
Low is good if you are also adding more humidity back in. If you want to maintain 45-50% (guessing), then you would want <45% environmental humidity so that you can raise it to the level you want. You're right about avoiding static, but you'd still want to try to keep it somewhat consistent.
It is much cheaper to use external air for cooling if you can.
hbogert [3 hidden]5 mins ago
Yeah but the article makes it sound as if lower is better, which it is definitely not. And yeah you need to control humidity, that might mean sometimes lowering, and sometimes increase it by whatever solution you have.
Also this is where cutting corner indeed results in lower cost, which was the reason for the OP to begin with. It just means you won't get as good a datacenter as people who are actually tuning this whole day and have decades of experience.
regular_trash [3 hidden]5 mins ago
The distinction between rent/own is kind of a false dichotomy. You never truly own your platform - you just "rent" it in a more distributed way that shields you from a single stress point. The tradeoff is that you have to manage more resources to take care of it, but you have much greater flexibility.
I have a feeling AI is going to be similar in the future. Sure, you can "rent" access to LLM's and have agents doing all your code. And in the future, it'll likely be as good as most engineers today. But the tradeoff is that you are effectively renting your labor from a single source instead of having a distributed workforce. I don't know what the long-term ramifications are here, if any, but I thought it was an interesting parallel.
MagicMoonlight [3 hidden]5 mins ago
For ML it makes sense, because you’re using so much compute that renting it is just burning money.
For most businesses, it’s a false economy. Hardware is cheap, but having proper redundancy and multiple sites isn’t. Having a 24/7 team available to respond to issues isn’t.
What happens if their data centre loses power? What if it burns down?
butterisgood [3 hidden]5 mins ago
I think this is how IBM is making tons of money on mainframes. A lot of what people are doing with cloud can be done on premises with the right levels of virtualization.
60% YoY growth is pretty excellent for an "outdated" technology.
insuranceguru [3 hidden]5 mins ago
The own vs rent calculus for compute is starting to mirror the market value vs replacement cost divergence we see in physical assets.
Cloud is convenient because it lowers OpEx initially, but you lose control over the long-term CapEx efficiency. Once you reach a certain scale, paying the premium for AWS flexibility stops making sense compared to the raw horsepower of owned metal.
seg_lol [3 hidden]5 mins ago
Using "big" cloud providers is often a mistake. You want to use rented assets to bootstrap and then start deploying on instances that are more and more under your control. With big cloud providers, it is easy to just succumb to their service offerings rather than do the right thing. Do your PoC on Hetzner and DigitalOcean then scale with purpose.
sys42590 [3 hidden]5 mins ago
It would be interesting to hear their contingency plan for any kind of disaster (most commonly a fire) that hits their data center.
I fully lost three small VPS there, and their response was poor: they didn't even refund time lost, they didn't compensate for time lost (e.g. a couple of months of free VPS), I got better updates from the news than from them (news were saying "almost total loss", while them were trying to convince me that I had the incredible bad luck that my three VPS were in the very small zone affected by the fire). The only way I had to recover what I lost was backups in local machines.
When someone point out how safe are cloud providers, as if they have multiple levels of redundancy and are fully protected against even an alien invasion, I remember the OVH fire.
They handled the fire terribly and after that they improved a bit, but an OVH VPS is just a VM running on a single piece of hardware.
Quite not the same thing as the "Compute" which is running on clusters.
AndroTux [3 hidden]5 mins ago
contingency plan: Don't build your data center out of wood.
srg0 [3 hidden]5 mins ago
Plastic is made from the same stuff as gasoline.
direwolf20 [3 hidden]5 mins ago
Drain cleaner and hydrochloric acid makes salt water. Water is made of highly explosive hydrogen. Salt is made of toxic chlorine and explosive sodium.
fpoling [3 hidden]5 mins ago
They use the datasenter for model training, not to serve online users. Presumably even if it will be offline for a week or even a month it will not be a total disaster as long as they have, for example, offsite tape backups.
instagib [3 hidden]5 mins ago
Flooding due to burst frozen pipe, false sprinkler trigger, or many others.
Something very similar happened at work. Water valve monitoring wasn’t up yet. Fire didn’t respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.
twelvechairs [3 hidden]5 mins ago
Theres only one solution to this problem and its 2 data centres in some way or form
mbreese [3 hidden]5 mins ago
What's the line from Contact?
why build one when you can have two at twice the price?
But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.
golem14 [3 hidden]5 mins ago
Or build two 2.5MM DCs (if can parallelize your workload well enough) and in case of disaster, you only lose capacity.
You need however plan for 1MM+ pa in OPEX because good SREs ain’t cheap (or hardware guys building and maintaining machines)
direwolf20 [3 hidden]5 mins ago
the plan is to not set it on fire. If your office burns down you are already screwed
epistasis [3 hidden]5 mins ago
Ah Slurm, so good to see it still being used. As soon as I touched it in ~2010 I realized this was finally the solid queue management system we needed. Things like Sun Grid Engine or PBS were always such awful and burdensome PoS.
IIRC, Slurm came out of LLNL, and it finally made both usage and management of a cluster of nodes really easy and fun.
Compare Slurm to something like AWS Batch or Google Batch and just laugh at what the cloud has created...
ynac [3 hidden]5 mins ago
Not nearly on the article's level, but I've been operating what I call a fog machine (itsy bitsy personal cloud) for about 15 years. It's just a bunch of local and off-site NAS boxes. It has kinda worked out great. Mostly Synology, but probably won't be when their scheduled retirement comes up. The networking is dead simple, the power use is distributed, and the size of it all is still a monster for me - back in the day, I had to use it for a very large audio project to keep backups of something like 750,000 albums and other audio recordings along with their metadata and assets.
pja [3 hidden]5 mins ago
I’m impressed that San Diego electrical power manages to be even more expensive than in the UK. That takes some doing.
yomismoaqui [3 hidden]5 mins ago
This quote is gold:
The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
rudolph9 [3 hidden]5 mins ago
> Having your own data center is cool
This company sounds more like a hobby interest than a business focused on solving genuine problems.
HanClinto [3 hidden]5 mins ago
It kinda' does, doesn't it?
Re: the "hobby" part is where I agree with you the most. Where you say it's not solving genuine problems is where I differ the most.
It really feels to me like Comma is staffed by people who recognize that they never stopped enjoying playing with Lego -- their bricks just grew up, and they realized they can:
1) solve real-world problems
2) not be jerks about it
3) get paid to do it
Not everything has to be about optimizing for #3.
I'm a happy paying customer of Comma.ai (Comma four, baby!) -- their product is awesome, extremely consumer-friendly, and I hope they can grow in their success!
BirAdam [3 hidden]5 mins ago
To me it sounds more like a return to vertical integration.
This is becoming increasingly common as far as I can tell.
There are benefits either direction, and I think that each company needs to evaluate the pros and cons themselves. Emotional pros/cons are something companies need to evaluate as employee morale can make or break a company. If the company is super technical in culture and they gain something intangible that is boosting the bottom line, having a datacenter as a "cool" factor is probably worth it.
vovavili [3 hidden]5 mins ago
I'd argue that it is in the long-term interest of any genuinely innovative company to attract intellectually curious talent with some coolness factor.
pu_pe [3 hidden]5 mins ago
> Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering.
It's easy to inspire people when you have great engineers in the first place. That's a given at a place like comma.ai, but there are many companies out there where administering a datacenter is far beyond their core competencies.
I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies. The same way that comma.ai employees likely don't have an in-house canteen, it can make sense to focus on what you are good at and outsource the rest.
szszrk [3 hidden]5 mins ago
> I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies.
They spend too much time on yet another cloud native support group call, learning for ThatOneCloudProvider certificates, figuring out that single implementation caveats, standardizing security procedures between cloud teams, and so on.
Yet people in the article just throw a 1000 lines of code KV store mkv [0] on a huge raw storage server and call it a day. And it's a legit choice, they did actual study beforehand and concluded: we don't need redundancy in most cases. At all. I respect that.
If it were me, instead of writing all these bespoke services to replicate cloud functionality, I'd just buy oxide.computer systems.
0xbadcafebee [3 hidden]5 mins ago
If your business relies on compute, and you run that compute in the cloud, you are putting a lot of trust in your cloud provider. Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.
This is not a valid reason for running your own datacenter, or running your own server.
Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering. Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
This is not a valid reason for running your own datacenter, or running your own server.
Avoiding the cloud for ML also creates better incentives for engineers. Engineers generally want to improve things. In ML many problems go away by just using more compute. In the cloud that means improvements are just a budget increase away. This locks you into inefficient and expensive solutions. Instead, when all you have available is your current compute, the quickest improvements are usually speeding up your code, or fixing fundamental issues.
This is not a valid reason for owning a datacenter, or running your own server.
Finally there’s cost, owning a data center can be far cheaper than renting in the cloud. Especially if your compute or storage needs are fairly consistent, which tends to be true if you are in the business of training or running models. In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.
This is one of only two valid reasons for owning a datacenter, and one of several valid reasons for running your own server.
The only two valid reasons to build/operate a datacenter: 1) what you're doing is so costly that building your own factory is the only profitable way for your business to produce its widgets, 2) you can't find a datacenter with the location or capacity you need and there is no other way to serve your business needs.
There's many valid reasons to run your own servers (colo), although most people will not run into them in a business setting.
siliconc0w [3 hidden]5 mins ago
You can also buy the hardware and hire an IT vendor to rack and help manage it as smart hands so you never need to visit the datacenter. With modern beefy hardware, even large web services only need a few racks so most orgs don't even to manage a large footprint.
Sure you have to schedule your own hardware repairs or updates but it also means you don't need to wrangle with the ridiculous cost-engineering, reserved instances, cloud product support issues or API deprecations, proprietary configuration languages, etc.
Bare metal is better for a lot of non-cost reasons too, as the article notes it's just easier/better to reason about the lower level primitives and you get more reliable and repeatable performance.
j45 [3 hidden]5 mins ago
That’s called managed servers or managed services.
I have run bare metal and manage services you just have to be clear on what you have coverage for when disaster strikes or be willing to proactively replace hard drives before they die.
apothegm [3 hidden]5 mins ago
This also depends so much on your scaling needs. If you need 3 mid-sized ECS/EC2 instances, a load balancer, and a database with backups, renting those from AWS isn’t going to be significantly more expensive for a decent-sized company than hiring someone to manage a cluster for you and dealing with all the overhead of keeping it maintained and secure.
If you’re at the scale of hundreds of instances, that math changes significantly.
And a lot of it depends on what type of business you have and what percent of your budget hosting accounts for.
infecto [3 hidden]5 mins ago
I also thinks it’s risk model too. Every time I see these kind of posts I think it misses the point there is a balance not only on cost like you describe but risk as well. You are paying to offload some of the risk from yourself.
betaby [3 hidden]5 mins ago
> You are paying to offload some of the risk from yourself.
The opposite is also true: one is risking being banned by exascalers.
eldenring [3 hidden]5 mins ago
The issue is that they have already paid off their datacenter 5x over compared to cloud. For offline, batch training, I don't ses how any amount of risk could offset the savings.
infecto [3 hidden]5 mins ago
It’s no issue and it’s right in the front for their situation, if your business is computer it makes little sense for cloud.
That said from the risk perspective I assume for what their doing in the data center there is low risk if downtime happens.
komali2 [3 hidden]5 mins ago
> The cloud requires expertise in company-specific APIs and billing systems.
This is one reason I hate dealing with AWS. It feels like a waste of time in some ways. Like learning a fly-by-night javascript library - maybe I'm better off spending that time writing the functionality on my own, to increase my knowledge and familiarity?
JKCalhoun [3 hidden]5 mins ago
Naive comment from a hobbyist with nothing close to $5M: I'm curious about the degree to which you build a "home lab" equivalent. I mean if "scaling" turned out to be just adding another Raspberry Pi to the rack (where is Mr. Geerling when you need him?) I could grow my mini-cloud month by month as spending money allowed.
(And it would be fun too.)
sgarland [3 hidden]5 mins ago
The degree is whatever you want to deal with. I had a rack at my last house (need to redesign the space for it at new house) with 3x Dell R620s in a Proxmox cluster, running K8s, serving Ceph from NVMe drives over Infiniband (for the mesh traffic), and 2x Supermicros running independent ZFS pools.
It was fun to build - especially Infiniband - but my next iteration is going to be a single beefy server, maybe with storage attached externally. What I had had outstanding uptime, but ultimately it was massively overkill, noisy, hot, and sucked power down.
coffeebeqn [3 hidden]5 mins ago
You sure can. Pi are pretty underpowered you can get machines with more cores and memory and pcie lanes and networking out there and virtualize them
user34283 [3 hidden]5 mins ago
I paid 150€ for a Mini PC with an Intel N100, 16 GB of DDR5 memory, and a 500 GB SSD.
While I have no intention to scale up low spec hardware like this, it at least seems to beat the Azure VMs we use at work with "4 CPUs", which corresponds to two physical cores on an AMD EPYC CPU.
And that super slow machine I understand costs more than $100 per month, and that's without charges for disk space slower than the SSD, or network traffic.
Renting at Azure seems to be a terrible decision, particularly for desktop use.
Maro [3 hidden]5 mins ago
Working at a non-tech regional bigco, where ofc cloud is the default, I see everyday how AWS costs get out of hand, it's a constant struggle just to keep costs flat. In our case, the reality is that NONE of our services require scalability, and the main upside of high uptime is nice primarily for my blood pressure.. we only really need uptime during business hours, nobody cares what happens at night when everybody is sleeping.
On the other hand, there's significant vendor lockin, complexity, etc. And I'm not really sure we actually end up with less people over time, headcount always expands over time, and there's always cool new projects like monitoring, observability, AI, etc.
My feeling is, if we rented 20-30 chunky machines and ran Linux on them, with k8s, we'd be 80% there. For specific things I'd still use AWS, like infinite S3 storage, or RDS instances for super-important data.
If I were to do a startup, I would almost certainly not base it off AWS (or other cloud), I'd do what I write above: run chunky servers on OVH (initially just 1-2), and use specific AWS services like S3 and RDS.
A bit unrelated to the above, but I'd also try to keep away from expensive SaaS like Jira, Slack, etc. I'd use the best self-hosted open source version, and be done with it. I'd try Gitea for git hosting, Mattermost for team chat, etc.
And actually, given the geo-political situation as an EU citizen, maybe I wouldn't even put my data on AWS at all and self-host that as well...
wessorh [3 hidden]5 mins ago
what is the underling filesystem for your kv store, it doesn't appear to use raw devices.
nubela [3 hidden]5 mins ago
Same thing. I was previously spending 5-8K on DigitalOcean, supposedly a "budget" cloud. Then the company was sold, and I started a new company on entirely self-hosted hardware. Cloudflare tunnel + CC + microk8s made it trivial! And I spend close to nothing other than internet that I already am spending on. I do have solar power too.
ex-aws-dude [3 hidden]5 mins ago
I can see how this would work fine if the primary purpose is for training rather than serving large volumes of customer traffic in multiple regions
It would probably even make sense for some companies to still use cloud for their API but do the training on prem as that may be the expensive part.
eubluue [3 hidden]5 mins ago
On top of that, now when the US cloud act is again a weapon against EU, most European companies know better and are migrating in droves to colo, on-prem and EU clouds. Bye bye US hyperscalers!
juvoly [3 hidden]5 mins ago
> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.
Cost and lock-in are obvious factors, but "sovereignty" has also become a key factor in the sales cycle, at least in Europe.
Handing health data, Juvoly is happy to run AI work loads on premise.
Hasz [3 hidden]5 mins ago
This is hackernews, do the math for the love of god.
There are good business and technical reasons to choose a public cloud.
There are good business and technical reasons to choose a private cloud.
There are good business and technical reasons to do something in-between or hybrid.
The endless "public cloud is a ripoff" or "private clouds are impossible" is just a circular discussion past each other. Saying to only use one or another is textbook cargo-culting.
bob1029 [3 hidden]5 mins ago
The #1 reason I would advocate for using AWS today is the compliance package they bring to the party. No other cloud provider has anything remotely like Artifact. I can pull Amazon's PCI-DSS compliance documentation using an API call. If you have a heavily regulated business (or work with customers who do), AWS is hard to beat.
If you don't have any kind of serious compliance requirement, using Amazon is probably not ideal. I would say that Azure AD is ok too if you have to do Microsoft stuff, but I'd never host an actual VM on that cloud.
Compliance and "Microsoft stuff" covers a lot of real world businesses. Going on prem should only be done if it's actually going to make your life easier. If you have to replicate all of Azure AD or Route53, it might be better to just use the cloud offerings.
wiether [3 hidden]5 mins ago
> The #1 reason I would advocate for using AWS today is the compliance package they bring to the party.
I was going to post the same comment.
Most of the people agreeing to foot the AWS bill do it because they see how much the compliance is worth to them.
cgsmith [3 hidden]5 mins ago
I used to colocate a 2U server that I purchased with a local data center. It was a great learning experience for me. Im curious why a company wouldn't colocate their own hardware? Proximity isnt an issue when you can have the datacenter perform physical tasks. Bravo to the comma team regardless. It'll be a great learning experience and make each person on their team better.
Ps... bx cable instead of conduit for electrical looks cringe.
vidarh [3 hidden]5 mins ago
The main reason not to colocate is if you're somewhere with high real estate costs... E.g
Hetzner managed servers competes on price w/co-location for me because I'm in London.
doublerabbit [3 hidden]5 mins ago
I colocate in London, a single server / firewall comes to around £5k a year. I also colocate two other servers in some northern UK location in some industrial estate for £2k as my backups. I've never enjoyed the cloud and dedicated server's have their own caveats too.
Budget hosts such as Hetzner/OVH have been known to suddenly pull the plug for no reason.
My kit is old, second hand old (Cisco UCS 220 M5, 2xDell somethings) and last night I just discovered I can throw in two NVIDIA T4's and turn it in to a personal LLM.
I'm quite excited having my own colocated server with basic LLM abilities. My own hardware with my own data and my own cables. Just need my own IP's now.
vidarh [3 hidden]5 mins ago
> Budget hosts such as Hetzner/OVH have been known to suddenly pull the plug for no reason.
The same would apply for any number of hosts. Hetzner/OVH are cheap, but as your own numbers show the location price gap is more than sufficient to cover the costs of servers.
In fact you can colocate with Hetzner too, and you'd get a similar price gap - the lower cost of real-estate is a large part of the reason why they can be as cheap as they are.
Data centre operations is a real estate play - to the point that at least one UK data centre operator is owned by a real estate investment company.
doublerabbit [3 hidden]5 mins ago
Thanks. I hadn't seen it as such and you're right. I guess it comes down to personal preference.
Where I feel that data has become a commodity in that I can sell your username and email for a few pence, I would rather prefer to have my own hardware in my own possession and that any request of it has to go to me, nor some server provider.
vidarh [3 hidden]5 mins ago
That's a totally valid reason. I also have infrastructure I operate because of personal comfort rather than because it's financially optimal.
b8 [3 hidden]5 mins ago
SSD's don't last longer than HDDs. Also they're much more expensive due to AI now. They should move to cutdown on power costs.
kavalg [3 hidden]5 mins ago
This was one of the coolest job ads that I've ever read :). Congrats for what you have done with your infrastructure, team and product!
HanClinto [3 hidden]5 mins ago
Agreed!
Gives a whole new level to the idea of "full stack developer"
Dormeno [3 hidden]5 mins ago
The company I work for used to have a hybrid where 95% was on-prem, but became closer to 90% in the cloud when it became more expensive to do on-prem because of VMware licensing. There are alternatives to VMware, but not officially supported with our hardware configuration, so the switch requires changing all the hardware, which still drives it higher than the cloud. Almost everything we have is cloud agnostic, and for anything that requires resilience, it sits in two different providers.
Now the company is looking at doing further cost savings as the buildings rented for running on-prem are sitting mostly unused, but also the prices of buildings have gone up in recent years, notably too, so we're likely to be saving money moving into the cloud. This is likely to make the cloud transition permanent.
danpalmer [3 hidden]5 mins ago
> Cloud companies generally make onboarding very easy, and offboarding very difficult.
I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.
lelanthran [3 hidden]5 mins ago
> As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.
Its the other way around. How do you think all businesses moved to the cloud in the first place?
comrade1234 [3 hidden]5 mins ago
15-years ago or so a spreadsheet was floating around where you could enter server costs, compute power, etc and it would tell you when you would break-even by buying instead of going with AWS. I think it was leaked from Amazon because it was always three-years to break-even even as hardware changed over time.
TonyStr [3 hidden]5 mins ago
Azure provides their own "Total Cost of Ownership" calculator for this purpose [0]. Notably, this makes you estimate peripheral costs such as cost of having a server administrator, electricity, etc.
I plugged in our own numbers (60 servers we own in a data centre we rent) and Microsoft thinks this costs us an order of magnitude more than it does.
Their "assumption" for hardware purchase prices seems way off compared to what we buy from Dell or HP.
It's interesting that the "IT labour" cost they estimate is $140k for DIY, and $120k for Azure.
Their saving is 5 times more than what we spend...
TonyStr [3 hidden]5 mins ago
Thank you, I've wanted to see someone use this in the real world. When doing Azure certifications (AZ900, AZ204, etc.), they force you to learn about this tool.
Symbiote [3 hidden]5 mins ago
I may be out of date with RAM prices. Dell's configuration tool wants £1000 each for 32GB RDIMMs — but prices in Dell's configuration tool are always significantly higher than we get if we write to their sales person.
Even so, a rough configuration for a 2-processor 16 core/processor server with 256GiB RAM comes to $20k, vs $22k + 100% = $44k quoted by MS. (The 100% is MS' 20%-per-year "maintenance cost" that they add on to the estimate. In reality this is 0% as everything is under Dell's warranty.)
And most importantly, the tool is only comparing the cost of Azure to constructing and maintaining a data centre! Unless there are other requirements (which would probably rule out Azure anyway) that's daft, a realistic comparison should be to colocation or hired dedicated servers, depending on the scale.
vidarh [3 hidden]5 mins ago
If you buy, maybe. Leasing or renting tends to be cheaper from day one. Tack on migration costs and ca. 6 months is a more realistic target. If the spreadsheet always said 3 years, it sounds like an intentional "leak".
g-b-r [3 hidden]5 mins ago
Did the AWS part include the egress costs to extract your data from AWS, if you ever want to leave them?
Well, somebody should recreate it. I smell a potential startup idea somewhere. There's a ton of "cloud cost optimizers" software but most involve tweaking AWS knobs and taking a cut of the savings. A startup that could offload non critical service from AWS to colo and traditional bare metal hosting like Hetzner has a strong future.
One thing to keep in mind is that the curve for GPU depreciation (in the last 5 years at least) is a little steeper than 3 years. Current estimates is that the capital depreciation cost would plunge dramatically around the third year. For a top tier H100 depreciation kicks in around the 3rd year but they mentioned for the less capable ones like the A100 the depreciation is even worse.
Now this is not factoring cost of labour. Labor at SF wages is dreadfully expensive, now if your data center is right across the border in Tijuana on the other hand..
imcritic [3 hidden]5 mins ago
I love articles like this and companies with this kind of openness. Mad respect to them for this article and for sharing software solutions!
durakot [3 hidden]5 mins ago
There's the HN I know and love
evertheylen [3 hidden]5 mins ago
> Maintaining a data center is much more about solving real-world challenges. The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
I find this to be applicable on a smaller scale too! I'd rather setup and debug a beefy Linux VPS via SSH than fiddle with various propietary cloud APIs/interfaces. Doesn't go as low-level as Watts, bits and FLOPs but I still consider knowledge about Linux more valuable than knowing which Azure knobs to turn.
bovermyer [3 hidden]5 mins ago
I'm thinking about doing a research project at my university looking into distributed "data centers" hosted by communities instead of centralized cloud providers.
The trick is in how to create mostly self-maintaining deployable/swappable data centers at low cost...
bradley13 [3 hidden]5 mins ago
Goes for small business and individuals as well. Sure, there are times that cloud makes sense, but you can and should do a lot on your own hardware.
arjie [3 hidden]5 mins ago
Realistically, it's the speed with which you can expand and contract. The cloud gives unbounded flexibility - not on the per-request scale or whatever, but on the per-project scale. To try things out with a bunch of EC2s or GCEs is cheap. You have it for a while and then you let it go. I say this as someone with terabytes of RAM in servers, and a cabinet I have in the Bay Area.
rmoriz [3 hidden]5 mins ago
Cloud, in terms of "other company's infrastructure" always implies losing the competence to select, source and operate hardware. Treating hardware as commodity will eventually treat your very own business as commodity: Someone can just copy your software/IP and ruin your business. Every durable business needs some kind of intellectual property and human skills that are not replaceable easily. This sounds binary, but isn't. You can build long-lasting partnerships. German Mittelstand did that over decades.
satvikpendem [3 hidden]5 mins ago
I just read about Railway doing something similar, sadly their prices are still high compared to other bare metal providers and even VPS such as Hetzner with Dokploy, very similar feature set yet for the same 5 dollars you get way more CPU, storage and RAM.
Their pricing page is so confusing:
CPU: $0.00000772 per vCPU / sec
This seems to imply $40 / month for 2 vCPU which seems very high?
Or maybe they mean "used" CPU versus idle?
Neil44 [3 hidden]5 mins ago
Billing per used or not idle cpu cycle would be quite interesting. Number of cores would just effectively be your cost cap. Efficiency would be even more important. And if the provider over subscribes cores you just pay less. Actually that's probably why they don't do it...
efreak [3 hidden]5 mins ago
Don't most big clouds not share cores between tenants? I have a vague feeling that around spectre/meltdown this was stopped. I wouldn't be surprised to be wrong, but if you're dedicating a core to a VM, you're not going to charge less for unused CPU that nobody else can use.
lawrenceyan [3 hidden]5 mins ago
Hetzner bare metal ran much of crypto for many years before they cracked down on it.
monster_truck [3 hidden]5 mins ago
Don't even have to go this far. Colocating in a couple regions will give you most of the logistical thrills at a fraction of the cost!
coffeebeqn [3 hidden]5 mins ago
Heavy ML workloads make this more worthwhile since you get to design it to squeeze value out of every facet. For a basic web server and database it’s definitely overkill and something like a colocation makes much more sense
infecto [3 hidden]5 mins ago
I love this article. Great write up. Gave me the same feeling when I would read about Stackoverflows handful of servers that ran all of the sites.
throwaway-aws9 [3 hidden]5 mins ago
The cloud is a psyop, a scam. Except at the tiniest free-tier / near free-tier use cases, or true scale to zero setups.
I've helped a startup with 2.5M revenue reduce their cloud spend from close to 2M/yr to below 1M/yr. They could have reached 250k/yr renting bare-metal servers. Probably 100k/yr in colos by spending 250k once on hardware. They had the staff to do it but the CEO was too scared.
Cloud evangelism (is it advocacy now?) messed up the minds of swaths of software engineers. Suddenly costs didn't matter and scaling was the answer to poor designs. Sizing your resource requirements became a lost art, and getting into reaction mode became law.
Welcome to "move fast and get out of business", all enabled by cloud architecture blogs that recommend tight integration with vendor lock-in mechanisms.
Use the cloud to move fast, but stick to cloud-agnostic tooling so that it doesn't suck you in forever.
I've seen how much cloud vendors are willing to spend to get business. That's when you realize just how massive their margins are.
re-thc [3 hidden]5 mins ago
> The cloud is a psyop, a scam.
You're just young.
> Suddenly costs didn't matter and scaling was the answer to poor designs.
It did.
Did you know that cloud cost less than what the internal IT team at a company would charge you?
Let's say you worked on product A for a company and needed additional VM. Besides paperwork, the cost to you (for your cost center) would be more than using the company credit card for the cloud.
> Sizing your resource requirements became a lost art
In what way? We used to size for 2-4x since getting additional resources (for the in-house team) would be weeks to months. Same old - just cloud edition.
throwaway-aws9 [3 hidden]5 mins ago
> You're just young.
And I feel great!
> Did you know that cloud cost less than what the internal IT team at a company would charge you?
Yes. Internal IT teams ran old-school are inefficient. And that's what the vendor tells you while they create shadow IT inside your company. Skip ITSM and ITIL... do it the SRE way.
Until the cloud economist (real role) comes in and finds a way to extract more rent out of their customer base (like GCP's upcoming doubling rates on CDN Interconnect). And until internal IT kills shadow IT and regains management of cloud deployments. Cybersecurity and stuff...
Back to square one. ITIL with cloud deployments. Some use cases will be way cheaper... but for your 100s of PBs of enterprise data, that's another story. And data gravity will kill many initiatives just based on bit movement costs.
> Besides paperwork, the cost to you (for your cost center) would be more than using the company credit card for the cloud.
To some extent. One is hard dollars the other is funny money. But I thought paying for cloud with the company credit card was a 2016 thing. Now it's paid through your internal IT cost center, with internal IT markup.
I've seen petabytes of data move to the cloud and then we couldn't perform some queries on it anymore as that store wouldn't support it, and we'd need to spend 7 figures to move to another cloud database to query it. And that's hard dollars.
Yes, during early cloud days it was lean and aimed at startups. Now it's aimed at enterprise, and for some reason lots of startups still think it's optimized for them. It's not and it hasn't been for a long time.
re-thc [3 hidden]5 mins ago
> Yes. Internal IT teams ran old-school are inefficient.
They aren't. It's politics. They want to protect and improve their own headcount and resources.
> One is hard dollars the other is funny money.
All the same to a team / department. It's not like people run it like their own wallet.
> finds a way to extract more rent out of their customer base
I find you just have a grudge against the cloud and hence too young. For every example you have the so-called "internal" IT team can and will do just the same. Go back to 90s, 00s - it was the same. The infra team wanted some fancy new storage arrays and charge everyone 2x for the new service etc.
> and for some reason lots of startups still think it's optimized for them. It's not and it hasn't been for a long time.
The problem isn't the cloud. Startups have always worked like this even 10-20 years ago. It's about wastage. They can raise and grow faster. So they think. The problem, if any is recently money isn't as cheap. Nothing new.
dagi3d [3 hidden]5 mins ago
> San Diego power cost is over 40c/kWh, ~3x the global average. It’s a ripoff, and overpriced simply due to political dysfunction.
Mind anyone elaborate? Always thought this is was a direct cause of the free market. Not sure if by dysfunction the op means lack of intervention.
omoikane [3 hidden]5 mins ago
Electricity cost in California is generally more expensive than most other US states, except Hawaii. Not sure why.
Perhaps Comma needed the datacenter to be in San Diego for latency or other reasons, but if they need it mostly for compute, it would have been cheaper to operate their datacenter elsewhere... but if we keep going down that path, maybe it actually becomes cheaper to rent a cloud after all.
throwawaypath [3 hidden]5 mins ago
>Mind anyone elaborate? Always thought this is was a direct cause of the free market. Not sure if by dysfunction the op means lack of intervention.
The majority of Californians have no say and cannot choose their utilities provider. This is the polar opposite of the "free market".
amluto [3 hidden]5 mins ago
Did you say “free market”? There is one provider. There is a lot of regulation, mostly incompetent. It’s a mess.
nickorlow [3 hidden]5 mins ago
Even at the personal blog level, I'd argue it's worth it to run your own server (even if it's just an old PC in a closet). Gets you on the path to running a home lab.
drnick1 [3 hidden]5 mins ago
Absolutely. I don't have a blog but run my own email, several game servers, Matrix instance, Nextcloud and other internal services on a retired gaming PC. The total cost of my cloud subscriptions is $0, and no one is snooping on me. It's a great setup when combined with Linux machines and GrapheneOS phones, completely private and free of Big Tech.
stego-tech [3 hidden]5 mins ago
IT dinosaur here, who has run and engineered the entire spectrum over the course of my career.
Everything is a trade-off. Every tool has its purpose. There is no "right way" to build your infrastructure, only a right way for you.
In my subjective experience, the trade-offs are generally along these lines:
* Platform as a Service (Vercel, AWS Lambda, Azure Functions, basically anything where you give it your code and it "just works"): great for startups, orgs with minimal talent, and those with deep pockets for inevitable overruns. Maximum convenience means maximum cost. Excellent for weird customer one-offs you can bill for (and slap a 50% margin on top). Trade-off is that everything is abstracted away, making troubleshooting underlying infrastructure issues nigh impossible; also that people forget these things exist until the customer has long since stopped paying for them or a nasty bill arrives.
* Infrastructure as a Service (AWS, GCP, Azure, Vultr, etc; commonly called the "Public Cloud"): great for orgs with modest technical talent but limited budgets or infrastructure that's highly variable (scales up and down frequently). Also excellent for everything customer-facing, like load balancers, frontends, websites, you name it. If you can invoice someone else for it, putting it in here makes a lot of sense. Trade-off is that this isn't yours, it'll never be yours, you'll be renting it forever from someone else who charges you a pretty penny and can cut you off or raise prices anytime they like.
* Managed Service/Hosting Providers (e.g., ye olde Rackspace): you don't own the hardware, but you're also not paying the premium for infrastructure orchestrators. As close to bare metal as you can get without paying for actual servers. Excellent for short-term "testing" of PoCs before committing CapEx, or for modest infrastructure needs that aren't likely to change substantially enough to warrant a shift either on-prem or off to the cloud. You'll need more talent though, and you're ultimately still renting the illusion of sovereignty from someone else in perpetuity.
* Bare Metal, be it colocation or on-premises: you own it, you decide what to do with it, and nobody can stop you. The flip side is you have to bootstrap everything yourself, which can be a PITA depending on what you actually want - or what your stakeholders demand you offer. Running VMs? Easy-peasy. Bare metal K8s clusters? I mean, it can be done, but I'd personally rather chew glass than go without a managed control plane somewhere. CapEx is insane right now (thanks, AI!), but TCO is still measured in two to three years before you're saving more than you'd have spent on comparable infrastructure elsewhere, even with savings plans. Talent needs are highly variable - a generalist or two can get you 80% to basic AWS functionality with something like Nutanix or VCF (even with fancy stuff like DBaaS), but anything cutting edge is going to need more headcount than a comparable IaaS build. God help you if you opt for a Microsoft stack, as any on-prem savings are likely to evaporate at your next True-Up.
In my experience, companies have bought into the public cloud/IaaS because they thought it'd save them money versus the talent needed for on-prem; to be fair, back when every enterprise absolutely needed a network team and a DB team and a systems team and a datacenter team, this was technically correct. Nowadays, most organizational needs can be handled with a modest team of generalists or a highly competent generalist and one or two specialists for specific needs (e.g., a K8s engineer and a network engineer); modern software and operating systems make managing even huge orgs a comparable breeze, especially if you're running containers or appliances instead of bespoke VMs.
As more orgs like Comma or Basecamp look critically at their infrastructure needs versus their spend, or they seriously reflect on the limited sovereignty they have by outsourcing everything to US Tech companies, I expect workloads and infrastructure to become substantially more diversified than the current AWS/GCP/Azure trifecta.
segmondy [3 hidden]5 mins ago
I cancelled my digital ocean server of almost a decade late last year and replaced it with a raspberry pi 3 that was doing nothing. We can do it, we should do it.
CodeCompost [3 hidden]5 mins ago
Microsoft made the TCO argument and won. Self-hosting is only an option if you can afford expensive SysOps/DevOps/WhateverWeAreCalledTheseDays to manage it.
davsti4 [3 hidden]5 mins ago
So.... you're saying they must be understaffed and paying poverty range wages to afford the San Diego climate and still cut a profit? ;)
faust201 [3 hidden]5 mins ago
Look the bottom of that page:
An error occurred: API rate limit already exceeded for installation ID 73591946.
Hey, how do SSDs fail lately? Do they ... vanish off the bus still? Or do they go into read only mode?
gwbas1c [3 hidden]5 mins ago
TLDR:
> In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.
IMO, that's the biggie. It's enough to justify paying someone to run their datacenter. I wish there was a bit more detail to justify those assumptions, though.
That being said, if their needs grow by orders of magnitude, I'd anticipate that they would want to move their servers somewhere with cheaper electricity.
Havoc [3 hidden]5 mins ago
Interesting that they go for no redundancy
figmert [3 hidden]5 mins ago
What redundancy are we talking about? AWS has proven to the world on multiple occasions that redundancy across geo locations is useless, because if us-east-1 is down, their whole cloud is done, causing a big chunk of the world to be down.
Half sarcasm of course, but it goes to show that the world is not going to fall apart in many cases when it comes to software. Sure, it's not ideal in lots of cases, but we'll survive without redundancy.
assaddayinh [3 hidden]5 mins ago
Is there a client to sell on your own unused private cloud?
tirant [3 hidden]5 mins ago
Well, their comment section is fore sure not running on premises, but on the cloud:
"An error occurred: API rate limit already exceeded for installation ID 73591946."
langarus [3 hidden]5 mins ago
This is a great solution for a very specific type of team but I think most companies with consistent GPU workloads will still just rent dedicated servers and call it a day.
hyperbovine [3 hidden]5 mins ago
I agree, and cloud compute is poised to become even more commoditized in the coming years (gazillion new data centers + AI plateauing + efficiency gains, the writing is on the wall). There’s no way this makes sense for most companies.
NitpickLawyer [3 hidden]5 mins ago
> AI plateauing
Ummm is that plateauing with us in the room?
The advantage of renting vs. owning is that you can always get the latest gen, and that brings you newer capabilities (i.e. fp8, fp4, etc) and cheaper prices for current_gen-1. But betting on something plateauing when all the signs point towards the exact opposite is not one of the bets i'd make.
lelanthran [3 hidden]5 mins ago
> Ummm is that plateauing with us in the room?
Well, the capabilities have already plateaued as far as I can tell :-/
Over the next few yeas we can probably wring out some performance improvements, maybe some efficiency improvements.
A lot of the current AI users right now are businesses trying to on-sell AI (code reviewers/code generators, recipe apps, assistant apps, etc), and there's way too many of them in the supply/demand ratio, so you can expect maybe 90% of these companies to disappear in the next few years, taking the demand for capacity with them.
ocdtrekkie [3 hidden]5 mins ago
It's the opposite. The more consistent your workload the more practical and cost-effective it is to go on-prem.
Cloud excels for bursty or unpredictable workloads where quickly scaling up and down can save you money.
langarus [3 hidden]5 mins ago
Other benefits: easy access to reliable infrastructure and latest hardware which you can swap as you please. There are cases where it makes sense to navigate away from the big players (like dropbox going from aws to on-prem), but again you make this move when you want to optimize costs and are not worried about the trade-offs.
rvz [3 hidden]5 mins ago
Not long ago Railway moved from GCP to their own infrastructure since it was very expensive for them. [0] Some go for a Oxide rack [1] for a full stack solution (both hardware and software) for intense GPU workloads, instead of building it themselves.
It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.
It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.
I was under impression that Oxide rack does not currently ship with GPU's - at least with buildin. . Has this changed recently ?
panick21_ [3 hidden]5 mins ago
Oxide racks don't yet have a GPU solution. But it is a good options for general compute and even with GPU required, general compute hasn't gone away.
RT_max [3 hidden]5 mins ago
The observation about incentives is underappreciated here. When your compute is fixed, engineers optimize code. When compute is a budget line, engineers optimize slide decks. That's not really a cloud vs on-prem argument, it's a psychology-of-engineering argument.
squeefers [3 hidden]5 mins ago
mark my words. cloud will fall out of fashion, but it will come back in fashion under another name in some amount of years. its cyclical.
Semaphor [3 hidden]5 mins ago
In case anyone from comma.ai reads this: "CTO @ comma.ai" the link at the end is broken, it’s relative instead of absolute.
croisillon [3 hidden]5 mins ago
no because it's on premise you see? you don't need to access the world wide web, just their server
/s
lovegrenoble [3 hidden]5 mins ago
I've just shifted to Hetzner, no regret
deadbabe [3 hidden]5 mins ago
Clouds suck. But so does “on premises”. Or co-location.
In the future, what you will need to remain competitive is computing at the edge. Only one company is truly poised to deliver on that at massive scale.
petesergeant [3 hidden]5 mins ago
One thing I don't really understand here is why they're incurring the costs of having this physically in San Diego, rather than further afield with a full-time server tech essentially living on-prem, especially if their power numbers are correct. Is everyone being able to physically show up on site immediately that much better than a 24/7 pair of remote hands + occasional trips for more team members if needed?
mgaunard [3 hidden]5 mins ago
Coolness factor of having a datacenter right in your office.
davsti4 [3 hidden]5 mins ago
... and you can be one good earthquake away from insolvency.
intalentive [3 hidden]5 mins ago
I like Hotz’s style: simply and straightforwardly attempting the difficult and complex. I always get the impression: “You don’t need to be too fancy or clever. You don’t need permission or credentials. You just need to go out and do the thing. What are you waiting for?”
tirant [3 hidden]5 mins ago
This was written by Harald Schäfer, the CTO of comma.ai. I'm not so sure if G. Hotz is still involved in comma.ai.
piker [3 hidden]5 mins ago
Don't think he is, but it does seem like he inspired a hacker mentality in the shop during his tenure.
intalentive [3 hidden]5 mins ago
Ah I missed that.
pelasaco [3 hidden]5 mins ago
if i understood correctly, you dont kubernetes, rights? Did you consider it?
rob_c [3 hidden]5 mins ago
And finally we reach the point where you're not shot for explaining if you invest in ownership after everything is over you have something left that has intrinsic value regardless of what you were doing with it.
Otherwise, well just like that gym membership, you get out what you put into it...
kaon_2 [3 hidden]5 mins ago
Am I the only one that is simply scared of running your own cloud? What happens if your administrator credentials get leaked? At least with Azure I can phone microsoft and initiate a recovery. Because of backups and soft deletion policies quite a lot is possible. I guess you can build in these failsafe scenarios locally too? But what if a fire happens like in South Korea? Sure most companies run more immediate risks such as going bankrupt, but at least Cloud relieves me from the stuff of nightmares.
Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...
direwolf20 [3 hidden]5 mins ago
Do you have a computer at home? Are you scared of its credentials leaking? A server is just another computer with a good internet connection.
You can equip your server with a mouse, keyboard and screen and then it doesn't even need credentials. The credential is your physical access to the mouse and keyboard.
geodel [3 hidden]5 mins ago
I mean people are nowadays are really scared of using microwave oven too. What happens if I heat my coffee 1 min too long. Could be near death experience. Thats why I always drive down to Starbucks for coffee!
direwolf20 [3 hidden]5 mins ago
True! Decline of defiance or something. Everyone is suddenly a follower. Any idea what caused it? Micro plastics in the brain? Social media?
vachina [3 hidden]5 mins ago
Then literally own the cloud, like run the hardware on-prem yourself.
architsingh15 [3 hidden]5 mins ago
Looks insanely daunting imo
devmor [3 hidden]5 mins ago
> In a future blog post I hope I can tell you about how we produce our own power and you should too.
Rackmounted fusion reactors, I hope. Would solve my homelab wattage issues too.
vasco [3 hidden]5 mins ago
Having worked only with the cloud I really wonder if these companies don't use other software with subscriptions. Even though AWS is "expensive" its a just another line item compared to most companies overall SaaS spend. Most businesses don't need that much compute or data transfer in the grand scheme of things.
mrbluecoat [3 hidden]5 mins ago
Stopped reading at "Our main storage arrays have no redundancy". This isn't a data center, it's a volatile AI memory bank.
sgarland [3 hidden]5 mins ago
You should have kept reading:
> Redundancy is not needed since no specific data is critical.
> we have a redundant mkv storage array to store all of our trained models and training metrics.
That's just called understanding your failure domains, and RTO/RPO needs.
huntaub [3 hidden]5 mins ago
This turns out to be a more and more important primitive for companies who are building their own models [1].
Or better; write your software such that you can scale to tens of thousands of concurrent users on a single machine. This can really put the savings into perspective.
swiftcoder [3 hidden]5 mins ago
If you were to read TFA, it is about ML training workloads, not web servers
jongjong [3 hidden]5 mins ago
Well the article starts out with a suggestion that we should all get a data center... It's quite a jump to assume that everyone reading this article needs to train their own LLMs.
macmac_mac [3 hidden]5 mins ago
Chatgpt:
# don’t own the cloud, rent instead
the “build your own datacenter” story is fun (and comma’s setup is undeniably cool), but for most companies it’s a seductive trap: you’ll spend your rarest resource (engineer attention) on watts, humidity, failed disks, supply chains, and “why is this rack hot,” instead of on the product. comma can justify it because their workload is huge and steady, they’re willing to run non-redundant storage, and they’ve built custom GPU boxes and infra around a very specific ML pipeline. ([comma.ai blog][1])
## 1) capex is a tax on flexibility
a datacenter turns “compute” into a big up-front bet: hardware choices, networking choices, facility choices, and a depreciation schedule that does not care about your roadmap. cloud flips that: you pay for what you use, you can experiment cheaply, and you can stop spending the minute a strategy changes. the best feature of renting is that quitting is easy.
## 2) scaling isn’t a vibe, it’s a deadline
real businesses don’t scale smoothly. they spike. they get surprise customers. they do one insane training run. they run a migration. owning means you either overbuild “just in case” (idle metal), or you underbuild and miss the moment. renting means you can burst, use spot/preemptible for the ugly parts, and keep steady stuff on reserved/committed discounts.
## 3) reliability is more than “it’s up most days”
comma explicitly says they keep things simple and don’t need redundancy for ~99% uptime at their scale. ([comma.ai blog][1]) that’s a perfectly valid trade—if your business can tolerate it. many can’t. cloud providers sell multi-zone, multi-region, managed backups, managed databases, and boring compliance checklists because “five nines” isn’t achieved by a couple heroic engineers and a PID loop.
## 4) the hidden cost isn’t power, it’s people
comma spent ~$540k on power in 2025 and runs up to ~450kW, plus all the cooling and facility work. ([comma.ai blog][1]) but the larger, sneakier bill is: on-call load, hiring niche operators, hardware failures, spare parts, procurement, security, audits, vendor management, and the opportunity cost of your best engineers becoming part-time building managers. cloud is expensive, yes—because it bundles labor, expertise, and economies of scale you don’t have.
## 5) “vendor lock-in” is real, but self-lock-in is worse
cloud lock-in is usually optional: you choose proprietary managed services because they’re convenient. if you’re disciplined, you can keep escape hatches: containers, kubernetes, terraform, postgres, object storage abstractions, multi-region backups, and a tested migration plan. owning your datacenter is also lock-in—except the vendor is past you, and the contract is “we can never stop maintaining this.”
## the practical rule
*if you have massive, predictable, always-on utilization, and you want to become good at running infrastructure as a core competency, owning can win.* that’s basically comma’s case. ([comma.ai blog][1])
*otherwise, rent.* buy speed, buy optionality, and keep your team focused on the thing only your company can do.
if you want, tell me your rough workload shape (steady vs spiky, cpu vs gpu, latency needs, compliance), and i’ll give you a blunt “rent / colo / own” recommendation in 5 lines.
And now go do that in another region. Bam, savings gone. /s
What I mean is that I'm assuming the math here works because the primary purpose of the hardware is training models. You don't need 6 or 7 nines for that is what I'm imagining. But when you have customers across geography that use your app hosted on those servers pretty much 24/7 then you can't afford much downtime.
1 - Cloud – This is minimising cap-ex, hiring, and risk, while largely maximising operational costs (its expensive) and cost variability (usage based).
2 - Managed Private Cloud - What we do. Still minimal-to-no cap-ex, hiring, risk, and medium-sized operational cost (around 50% cheaper than AWS et al). We rent or colocate bare metal, manage it for you, handle software deployments, deploy only open-source, etc. Only really makes sense above €$5k/month spend.
3 - Rented Bare Metal – Let someone else handle the hardware financing for you. Still minimal cap-ex, but with greater hiring/skilling and risk. Around 90% cheaper than AWS et al (plus time).
4 - Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.
A good provider for option 3 is someone like Hetzner. Their internal ROI on server hardware seems to be around the 3 year mark. After which I assume it is either still running with a client, or goes into their server auction system.
Options 3 & 4 generally become more appealing either at scale, or when infrastructure is part of the core business. Option 1 is great for startups who want to spend very little initially, but then grow very quickly. Option 2 is pretty good for SMEs with baseline load, regular-sized business growth, and maybe an overworked DevOps team!
[0] https://lithus.eu, adam@
A core at this are all the 'managed' services - if you have a server box, its in your financial interest to squeeze as much per out of it as possible. If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.
This 'microservices' push usually means that instead of having an on-server session where you can serve stuff from a temporary cache, all the data that persists between requests needs to be stored in a db somewhere, all the auth logic needs to re-check your credentials, and something needs to direct the traffic and load balance these endpoint, and all this stuff costs money.
I think if you have 4 Java boxes as servers with a redundant DB with read replicas on EC2, your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.
These crazy AWS bills usually come from using every service under the sun.
1) Senior engineer starts on AWS
2) Senior engineer leaves because our industry does not value longevity or loyalty at all whatsoever (not saying it should, just observing that it doesn't)
3) New engineer comes in and panics
4) Ends up using a "managed service" to relieve the panic
5) New engineer leaves
6) Second new engineer comes in and not only panics but outright needs help
7) Paired with some "certified AWS partner" who claims to help "reduce cost" but who actually gets a kickback from the extra spend they induce (usually 10% if I'm not mistaken)
Calling it it ransomware is obviously hyperbolic but there are definitely some parallels one could draw
On top of it all, AWS pricing is about to massively go up due to the RAM price increase. There's no way it can't since AWS is over half of Amazon's profit while only around 15% of its revenue.
In theory with perfect documentation they’d have a good head start to learn it, but there is always a lot of unwritten knowledge involved in managing an inherited setup.
With AWS the knowledge is at least transferable and you can find people who have worked with that exact thing before.
Engineers also leave for a lot of reasons. Even highly paid engineers go off and retire, change to a job for more novelty, or decide to try starting their own business.
unfortunately it lot of things in AWS that also could be messed up so it might be really hard to research what is going on. For example, you could have hundreds of Lambdas running without any idea where original sources and how they connected to each-other, or complex VPCs network routing where some rules and security groups shared randomly between services so if you do small change it could lead to completely difference service to degrade (like you were hired to help with service X but after you changes some service Y went down and you even not aware that it existed)
"Today, we are going to calculate the power requirements for this rack, rack the equipment, wire power and network up, and learn how to use PXE and iLO to get from zero to operational."
[1] https://xkcd.com/705/
Part of what clouds are selling is experience. A "cloud admin" bootcamp graduate can be a useful "cloud engineer", but it takes some serious years of experience to become a talented on prem sre. So it becomes an ouroboros: moving towards clouds makes it easier to move to the clouds.
If by useful you mean "useful at generating revenue for AWS or GCP" then sure, I agree.
These certificates and bootcamps are roughly equivalent to the Cisco CCNA certificate and training courses back in the 90's. That certificate existed to sell more Cisco gear - and Cisco outright admitted this at the time.
That is not true. It takes a lot more than a bootcamp to be useful in this space, unless your definition is to copy-paste some CDK without knowing what it does.
But will the market demand it? AWS just continues to grow.
The number of things that these 24x7 people from AWS will cover for you is small. If your application craps out for any number of reasons that doesn't have anything to do with AWS, that is on you. If your app needs to run 24x7 and it is critical, then you need your own 24x7 person anyway.
Meanwhile AWS breaks once or twice a year.
I've only had one outage I could attribute to running on-prem, meanwhile it's a bit of a joke with the non-IT staff in the office that when "The Internet" (i.e. Cloudflare, Amazon) goes down with news reports etc our own services are all running fine.
I am sure it happens a multitude of ways but I have never seen the case you are describing.
I'll give you an alternative scenario, which IME is more realistic.
I'm a software developer, and I've worked at several companies, big and small and in-between, with poor to abysmal IT/operations. I've introduced and/or advocated cloud at all of them.
The idea that it's "more expensive" is nonsense in these situations. Calculate the cost of the IT/operations incompetence, and the cost of the slowness of getting anything done, and cloud is cheap.
Extremely cheap.
Not only that, it can increase shipping velocity, and enable all kinds of important capabilities that the business otherwise just wouldn't have, or would struggle to implement.
Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers.
This has been my experience as well. There are legitimate points of criticism but every time I’ve seen someone try to make that argument it’s been comparing significantly different levels of service (e.g. a storage comparison equating S3 with tape) or leaving out entire categories of cost like the time someone tried to say their bare metal costs for a two server database cluster was comparable to RDS despite not even having things like power or backups.
> 4) Ends up using a "managed service" to relieve the panic
It's not as though this is unique to cloud.
I've seen multiple managers come in and introduce some SaaS because it fills a gap in their own understanding and abilities. Then when they leave, everyone stops using it and the account is cancelled.
The difference with cloud is that it tends to be more central to the operation, so can't just be canceled when an advocate leaves.
As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.
What you’re asking for can mostly be pieced together, but no, it doesn’t exist as-is.
Failover: this has been a thing for a long time. Set up a synchronous standby, then add a monitoring job that checks heartbeats and promotes the standby when needed. Optionally use something like heartbeat to have a floating IP that gets swapped on failover, or handle routing with pgbouncer / pgcat etc. instead. Alternatively, use pg_auto_failover, which does all of this for you.
Clustering: you mean read replicas?
Volume-based snaps: assuming you mean CoW snapshots, that’s a filesystem implementation detail. Use ZFS (or btrfs, but I wouldn’t, personally). Or Ceph if you need a distributed storage solution, but I would definitely not try to run Ceph in prod unless you really, really know what you’re doing. Lightbits is another solution, but it isn’t free (as in beer).
Cross-region replication: this is just replication? It doesn’t matter where the other node[s] are, as long as they’re reachable, and you’ve accepted the tradeoffs of latency (synchronous standbys) or potential data loss (async standbys).
Metrics: Percona Monitoring & Management if you want a dedicated DB-first, all-in-one monitoring solution, otherwise set up your own scrapers and dashboards in whatever you’d like.
What you will not get from this is Aurora’s shared cluster volume. I personally think that’s a good thing, because I think separating compute from storage is a terrible tradeoff for performance, but YMMV. What that means is you need to manage disk utilization and capacity, as well as properly designing your failure domain. For example, if you have a synchronous standby, you may decide that you don’t care if a disk dies, so no messing with any kind of RAID (though you’d then miss out on ZFS’ auto-repair from bad checksums). As long as this aligns with your failure domain model, it’s fine - you might have separate physical disks, but co-locate the Postgres instances in a single physical server (…don’t), or you might require separate servers, or separate racks, or separate data centers, etc.
tl;dr you can fairly closely replicate the experience of Aurora, but you’ll need to know what you’re doing. And frankly, if you don’t, even if someone built a OSS product that does all of this, you shouldn’t be running it in prod - how will you fix issues when they crop up?
Nobody doubts one could build something similar to Aurora given enough budget, time, and skills.
But that's not replicating the experience of Aurora. The experience of Aurora is I can have all of that, in like 30 lines of terraform and a few minutes. And then I don't need to worry about managing the zpools, I don't need to ensure the heartbeats are working fine, I don't need to worry about hardware failures (to a large extent), I don't need to drive to multiple different physical locations to set up the hardware, I don't need to worry about handling patching, etc.
You might replicate the features, but you're not replicating the experience.
Managed services have a clear value proposition. I personally think they're grossly overpriced, but I understand the appeal. Asking for that experience but also free / cheap doesn't make any sense.
If ECS is faster, then you're more satisfied with AWS and less likely to migrate. You're also open to additional services that might bring up the spend (e.g. ECS Container Insights or X-Ray)
Source: Former Amazon employee
We used EFS to solve that issue, but it was very awkward, expensive and slow, its certainly not meant for that.
My biggest gripe with this is async tasks where the app does numerous hijinks to avoid a 10 minute lambda processing timeout. Rather than structure the process to process many independent and small batches, or simply using a modest container to do the job in a single shot - a myriad of intermediate steps are introduced to write data to dynamo/s3/kinesis + sqs/and coordination.
A dynamically provisioned, serverless container with 24 cores and 64 GB of memory can happily process GBs of data transformations.
Microservices is a killer with cost. For each microservices pod - you're often running a bunch of side cars - datadog, auth, ingress - you pay massive workload separation overhead with orchestration, management, monitoring and ofc complexity
I am just flabbergasted that this is how we operate as a norm in our industry.
If you can keep 4 "Java boxes" fed with work 80%+ of the time, then sure EC2 is a good fit.
We do a lot of batch processing and save money over having EC2 boxes always on. Sure we could probably pinch some more pennies if we managed the EC2 box uptime and figured out mechanisms for load balancing the batches... But that's engineering time we just don't really care to spend when ECS nets us most of the savings advantage and is simple to reason about and use.
You don’t need colocation to save 4x though. Bandwidth pricing is 10x. EC2 is 2-4x especially outside US. EBS for its iops is just bad.
[0] https://carolinacloud.io, derek@
So in practice cloud has become the more expensive option the second your spend goes over the price of 1 engineer.
- 2x Intel Xeon 5218
- 128gb Ram
- 2x960GB SSD
- 30TB monthly bandwidth
I pay around an extra $200/month for "premium" support and Acronis backups, both of which have come in handy, but are probably not necessary. (Automated backups to AWS are actually pretty cheap.) It definitely helps with peace of mind, though.
I have setup encrypted backups to go to my backup server in the office. We have a gigabit service at the office. Critical data changes are backed up every hour and full backup once a day.
I see it from the other direction, when if something fails, I have complete access to everything, meaning that I have a chance of fixing it. That's down to hardware even. Things get abstracted away, hidden behind APIs and data lives beyond my reach, when I run stuff in the cloud.
Security and regular mistakes are much the same in the cloud, but I then have to layer whatever complications the cloud provide comes with on top. If cost has to be much much lower if I'm going to trust a cloud provider over running something in my own data center.
We figured, "Okay, if we can do this well, reliably, and de-risk it; then we can offer that as a service and just split the difference on the cost savings"
(plus we include engineering time proportional to cluster size, and also do the migration on our own dime as part of the de-risking)
Expect a significant exit expense, though, especially if you are shifting large volumes of S3 data. That's been our biggest expense. I've moved this to Wasabi at about 8 euros a month (vs about $70-80 a month on S3), but I've paid transit fees of about $180 - and it was more expensive because I used DataSync.
Retrospectively, I should have just DIYed the transfer, but maybe others can benefit from my error...
https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...
But. Don't leave it until the last minute to talk to them about this. They don't make it easy, and require some warning (think months, IIRC)
Hopefully someone else will benefit from this helpful advice.
Out of interest, how old are you? This was quite normal expectation of a technical department even 15 years ago.
It’s not rocket science, especially when you’re talking about small amounts of data (small credit union systems in my example).
I find it equally disingenuous to suggest that Heroku was only for startups with lavish budgets. Absolutely not true. That’s my only purpose here. Everyone has different experiences but don’t go and push your own narrative as the only one especially when it’s not true.
Even at their peak, Heroku was a niche. If you’d gone conferences like WWDC or Pycon at the time, they’d be well represented, yes, and plenty of people liked them but it wasn’t a secret that they didn’t cover everyone’s needs or that pricing was off putting for many people, and that tended to go up the bigger the company you talked to because larger organizations have more complex needs and they use enough stuff that they already have teams of people with those skills.
The world's a lot bigger than startups
Your original statement is factually incorrect.
It's 2026 and banks are still running their mainframe, running windows VMs on VMware and building their enterprise software with Java.
The big boys still have their own datacenters they own.
Sure, they try dabbling with cloud services, and maybe they've pushed their edge out there, and some minor services they can afford to experiment with.
See, turning up a VM, installing and running Postgres is easy.
The hard part is keeping it updated, keeping the OS updated, automate backups, deploying replicas, encrypting the volumes and the backups, demonstrating to a third party auditor all of the above... and mind that there might be many other things I honestly ignore!
I'm not saying I won't go that path, it might be a good idea after a certain scale, but in the first and second year of a startup your mind should 100% be on "How can I make my customer happy" rather than "We failed again the audit, we won't have the SOC 2 Type I certification in time to sign that new customer".
If deciding between Hetzner and AWS was so easy, one of them might not be pricing its services correctly.
Also, just availability of these things on AWS has been a real pain - I think every startup got a lot of credits there, so flood of people trying to then use them.
Take two equivalent machines, set up with streaming replication exactly as described in the documentation, add Bacula for backups to an off-site location for point-in-time recovery.
We haven't felt the need to set up auto fail-over to the hot spare; that would take some extra effort (and is included with AWS equivalents?) but nothing I'd be scared of.
Add monitoring that the DB servers are working, replication is up-to-date and the backups are working.
Same here. But, I assume you have managed PostgreSQL in the past. I have. There are a large number of people software devs who have not. For them, it is not a low complexity task. And I can understand that.
I am a software dev for our small org and I run the servers and services we need. I use ansible and terraform to automate as much as I can. And recently I have added LLMs to the mix. If something goes wrong, I ask Claude to use the ansible and terraform skills that I created for it, to find out what is going on. It is surprisingly good at this. Similarly I use LLMs to create new services or change configuration on existing ones. I review the changes before they are applied, but this process greatly simplifies service management.
I'd say needing to read the documentation for the first time is what bumps it up from low complexity to medium. And then at medium you should still do it if there's a significant cost difference.
I think if it were true that the tuning is easier if you run the infrastructure yourself, then this would be a good point. But in my experience, this isn't the case for a couple reasons. First of all, the majority of tuning wins (indexes, etc.) are not on the infrastructure side, so it's not a big win to run it yourself. But then also, the professionals working at a managed DB vendor are better at doing the kind of tuning that is useful on the infra side.
With a managed solution, all of that is amortized into your monthly payment, and you're sharing the cost of it across all the customers of the provider of the managed offering.
Personally, I would rather focus on things that are in or at least closer to the core competency of our business, and hire out this kind of thing.
this part is actually scariest, since there are like 10 different 3rd party solutions of unknown stability and maintanability.
The flip side is that compliance is a little more involved. Rather than, say, carve out a whole swathe of SOC-2 ops, I have to coordinate some controls. It's not a lot, and it's still a lot lighter than I used to do 10+ years ago. Just something to consider.
There is a world of difference between renting some cabinets in an Equinix datacenter and operating your own.
5 - Datacenter (DC) - Like 4, except also take control of the space/power/HVAC/transit/security side of the equation. Makes sense either at scale, or if you have specific needs. Specific needs could be: specific location, reliability (higher or lower than a DC), resilience (conflict planning).
There are actually some really interesting use cases here. For example, reliability: If your company is in a physical office, how strong is the need to run your internal systems in a data centre? If you run your servers in your office, then there's no connectivity reliability concerns. If the power goes out, then the power is out to your staff's computers anyway (still get a UPS though).
Or perhaps you don't need as high reliability if you're doing only batch workloads? Do you need to pay the premium for redundant network connections and power supplies?
If you want your company to still function in the event of some kind of military conflict, do you really want to rely on fibre optic lines between your office and the data center? Do you want to keep all your infrastructure in such a high-value target?
I think this is one of the more interesting areas to think about, at least for me!
Offices are usually very expensive real estate in city centers and with very limited cooling capabilities.
Then again the US is a different place, they don't have cities like in Europe (bar NYC).
Thank goodness we did all the capex before the OpenAI ram deal and expensive nvidia gpus were the worst we had to deal with.
Is it still the cheapest after you take into account that skills, scale, cap-ex and long term lock-in also have opportunity costs?
You can get locked into cloud too.
The lock in is not really long term as it is an easy option to migrate off.
It sounds like they probably have revenue in the €500mm range today. And given that the bare metal cost of AWS-equivalent bills tends to be a 90% reduction, we'll say a €10mm+ bare metal cost.
So I would say a cautious and qualified "yes". But I know even for smaller deployments of tens or hundreds of servers, they'll ask you what the purpose is. If you say something like "blockchain," they're going to say, "Actually, we prefer not to have your business."
I get the strong impression that while they naturally do want business, they also aren't going to take a huge amount of risk on board themselves. Their specialism is optimising on cost, which naturally has to involve avoiding or mitigating risk. I'm sure there'd be business terms to discuss, put it that way.
(While we’re all speculating)
I wouldn't be surprised if mining is also associated with fraud (e.g. using stolen credit cards to buy compute).
Netflix might be spending as much as $120m (but probably a little less), and I thought they were probably Amazon's biggest customer. Does someone (single-buyer) spend more than that with AWS?
Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer, and Netflix's shareholders would probably be worried about risk relying on a vendor that is much smaller than them.
Sometimes if the companies are friendly to the idea, they could form a joint venture or maybe Netflix could just acquire Hertzner (and compete with Amazon?), but I think it unlikely Hertzner could take on Netflix-sized for nontechnical reasons.
However increasing pop capacity by 30% within 6mo is pretty realistic, so I think they'd probably be able to physically service Netflix without changing too much if management could get comfortable with the idea
I'm not convinced.
I assume someone at Netflix has thought about this, because if that were true and as simple as you say, Netflix would simply just buy Hetzner.
I think there lots of reasons you could have this experience, and it still wouldn't be Netflix's experience.
For one, big applications tend to get discounts. A decade ago when I (the company I was working for) was paying Amazon a mere $0,2M a month and getting much better prices from my account manager than were posted on the website.
There are other reasons (mostly from my own experiences pricing/costing big applications, but also due to some exotic/unusual Amazon features I'm sure Netflix depends on) but this is probably big enough: Volume gets discounts, and at Netflix-size I would expect spectacular discounts.
I do not think we can estimate the factor better than 1.5-2x without a really good example/case-study of a company someplace in-between: How big are the companies you're thinking about? If they're not spending at least $5m a month I doubt the figures would be indicative of the kind of savings Netflix could expect.
When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.
I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.
A little scare for both sides.
Unless we're misunderstanding something I think the $100Ms figure is hard to consider in a vacuum.
I’m not surprised, but you’d think there would be some point where they would decide to build a data center of their own. It’s a mature enough company.
If you're willing to share, I'm curious who else you would describe as being in this space.
My last decade and a half or so of experience has all been in cloud services, and prior to that it was #3 or #4. What was striking to me when I went to the Lithus website was that I couldn't figure out any details without hitting a "Schedule a Call" button. This makes it difficult for me to map my experiences in using cloud services onto what Lithus offers. Can I use terraform? How does the kubernetes offering work? How does the ML/AI data pipelines work? To me, it would be nice if I could try it out in a very limited way as self-service, or at least read some technical documentation. Without that, I'm left wondering how it works. I'm sure this is a conscious decision to not do this, and for good reasons, but I thought I'd share my impressions!
We're not really that kind of product company; we're more of a services company. What we do is deploy Kubernetes clusters onto bare metal servers. That's the core technical offering. However, everything beyond that is somewhat per-client. Some clients need a lot of compute. Some clients need a custom object storage cluster. Some clients need a lot of high-speed internal networking. Which is why we prefer to have a call to figure out specifically what your needs are. But I can also see how this isn't necessarily satisfying if you're used to just grabbing the API docs and having a look around.
What we will do is take your company's software stack and migrate it off AWS/Azure/Google and deploy it onto our new infrastructure. We will then become (or work with) your DevOps team to supporting you. This can be anything from containerising workloads to diagnosing performance issues to deploying a new multi-region Postgres cluster. Whatever you need done on your hardware that we feel we can reasonably support. We are the ones on-call should NATS fall over at 4am.
Your team also has full access to the Kubernetes cluster to deploy to as you wish.
I think the pricing page is the most concrete thing on our website, and it is entirely accurate. If you were to phone us and say, "I want that exact hardware," we would do it for you. But the real value we also offer is in the DevOps support we provide, actually doing the migration up-front (at our own cost), and being there working with your team every week.
In my current job, I think we're honestly a bit past the phase where I would want to take on a migration to a service like yours. We already have a good team of infrastructure folks running our cloud infrastructure, and we have accepted the lock-in of various AWS managed services. So the high-touch devops support doesn't sound that useful to me (we already have people who are good at this), and replacing all the locked-in components seems unlikely to have good ROI. I think we'd be more likely to go straight to #3 if we decided to take that on to save money.
But I'll probably be a founder or early employee at a new startup again someday, and I'm intrigued by your offering from that perspective. But it seems pretty clear to me that I shouldn't call you up on day 1, because I'm going to be nowhere near $5k a month, and I want to move faster than calling someone up to talk about my needs. I want to self-serve a small amount of usage, and cloud services seem really great for that. But this is how they get you! Once you've started with a particular cloud service, it's always easiest to take on more lock-in.
At some point between these two situations, though, I can see where your offering would be great. But the decision point isn't all that clear to me. In my experience, by the time you start looking at your AWS bill and thinking "crap, that seems pretty expensive", you have better things to do than an infrastructure migration, and you have taken on some lock-in.
I do like the idea of high-touch services to solve the breaking-the-lock-in challenge! I'll certainly keep this in mind next time I find myself in this middle ground where the cloud starts feeling more expensive than it's worth, but we don't want to go straight to #3.
Unfortunately, (successful) startups can quickly get trapped into this option. If they're growing fast, everyone on the board will ask why you'd move to another option at the first place. The cloud becomes a very deep local minimum that's hard to get out off.
It works because bare metal is about 10% the cost of cloud, and our value-add is in 1) creating a resilient platform on top of that, 2) supporting it, 3) being on-call, and 4) being or supporting your DevOps team.
This starts with us providing a Kubernetes cluster which we manage, but we also take responsibility for the services run on it. If you want Postgres, Redis, Clickhouse, NATS, etc, we'll deploy it and be SLA-on-call for any issues.
If you don't want to deal with Kubernetes then you don't have to. Just have your software engineers hand us the software and we'll handle deployment.
Everything is deployed on open source tooling, you have access to all the configuration for the services we deploy. You have server root access. If you want to leave you can do.
Our customers have full root access, and our engineers (myself included) are in a Slack channel with you engineers.
And, FWIW, it doesn't have to be Hetzner. We can colocate or use other providers, but Hetzner offer excellent bang-per-buck.
Edit: And all this is included in the cluster price, which comes out cheaper than the same hardware on the major cloud providers
You're a brave DevOps team. That would cause a lot of friction in my experience, since people with root or other administrative privileges do naughty things, but others are getting called in on Saturday afternoon.
We rent hardware and also some VPS, as well as use AWS for cheap things such as S3 fronted with Cloudflare, and SES for priority emails.
We have other services we pay for, such as AI content detection, disposable email detection, a small postal email server, and more.
We're only a small business, so having predictable monthly costs is vital.
Our servers are far from maxed out, and we process ~4 million dynamic page and API requests per day.
https://docs.hetzner.com/cloud/technical-details/faq/#what-k...
> Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills
back then this type of "skill" was abundant. You could easily get sysadmin contractors who would take a drive down to the data-center (probably rented facilities in a real-estate that belonged to a bank or insurance) to exchange some disks that died for some reason. such a person was full stack in a sense that they covered backups, networking, firewalls, and knew how to source hardware.
the argument was that this was too expensive and the cloud was better. so hundreds of thousands of SME's embraced the cloud - most of them never needed Google-type of scale, but got sucked into the "recurring revenue" grift that is SaaS.
If you opposed this mentality you were basically saying "we as a company will never scale this much" which was at best "toxic" and at worst "career-ending".
The thing is these ancient skills still exist. And most orgs simply do not need AWS type of scale. European orgs would do well to revisit these basic ideas. And Hetzner or Lithus would be a much more natural (and honest) fit for these companies.
Even some really old (2000s-era) junk I found in a cupboard at work was all hot-swap drives.
But more realistically in this case, you tell the data centre "remote hands" person that a new HDD will arrive next-day from Dell, and it's to go in server XYZ in rack V-U at drive position T. This may well be a free service, assuming normal failure rates.
Remote hands is a thing indeed. Servers also tend to be mostly pre-built now days by server retailers, even when buying more custom made ones like servermicro where you pick each component. There isn't that many parts to a generic server purchase. Its a chassi, motherboard, cpu, memory, and disks. PSU tend to be determined by the motherboard/chassi choice, same with disk backplanes/raid/ipmi/network/cables/ventilation/shrouds. The biggest work is in doing the correct purchase, not in the assembly. Once delivered you put on the rails, install any additional item not pre-built, put it in the rack and plug in the cables.
It baffles me that my career trajectory somehow managed to insulate me from ever having to deal with the cloud, while such esoteric skills as swapping a hot swap disk or racking and cabling a new blade chassis are apparently on the order of finding a COBOL developer now. Really?
I can promise you that large financial institutions still have datacenters. Many, many, many datacenters!
Software development isn't a typical SME however. Mike's Fish and Chips will not buy a server and that's fine.
plus, infra flexibility removes random constraints that e.g. Cloudflare Workers have
Reality is these days you just boot a basic image that runs containers
[0] Longer list here: https://github.com/alexellis/awesome-baremetal
The argument made 2 decades ago was that you shouldn't own the infrastructure (capital expense) and instead just account for the cost as operational expense (opex). The rationale was you exchange ownership for rent. Make your headache someone else's headache.
The ping pong between centralized vs decentralized, owned vs rented, will just keep going. It's never an either or, but when companies make it all-or-nothing then you have to really examine the specifics.
The Cloud providers made a lot of sense to finance departments since aside from the promised savings, you would take that cloud expense now and lower your tax rate.
After the passing of the One Beautiful Bill ("OBB"), the law allows you to accelerate CapEx to instead expense it[1], similar to the benefit given by cloud service providers.
This puts way more wind on the sails of the on-prem movement, for sure
[1] https://www.iqxbusiness.com/big-beautiful-bill-impact-on-cap...
That was part of the reason.
The real reason was the internal infrastructure team in many orgs got nowhere. There was a huge queue and many teams instead had to find infinite workarounds including standing up their own. The "cloud" provided a standardized way to at least deal with this mess e.g. single source of billing.
> A 1990s VP of IT would look at this post and say, what's new?
Speed. The US lives in luxury but outside of that it often takes a LONG time to get proper servers. You don't just go online. There are many places where you have to talk to a vendor with no list price and the drama continues. Being out of capacity can mean weeks to months before you get anywhere.
All teams will henceforth expose their data and functionality through service interfaces
https://gist.github.com/chitchcock/1281611
Oh man, this is bad advice. Airborn humidity and contaminants will KILL your servers on a very short horizon in most places - even San Diego. I highly suggest enthalpy wheel coolers (kyotocooling is one vendor - switch datacenters runs very similar units on their massive datacenters in the Nevada desert) as they remove the heat from the indoor air using outdoor air (+can boost slightly with an integrated refrigeration unit to hit target intake temps) without switching the air from one side to the other. This has huge benefits for air control quality and outdoor air tolerance and a single 500KW heat rejection unit uses only 25KW of input power (when it needs to boost the AC unit's output). You can combine this with evaporative cooling on the exterior intakes to lower the temps even further at the expense of some water consumption (typically far cheaper than the extra electricity to boost the cooling through an hvac cycle).
Not knocking the achievement just speaking from experience that taking outdoor air (even filtered + mixed) into a datacenter is a recipe for hardware failure and the mean time to failure for that is highly dependant on your outdoor air conditions. I've run 3MW facilities with passive air cooling and taking outdoor air directly into servers requires a LOT more conditioning and consideration than is outlined in this article.
Likewise the impact on server longevity is not a finite boundary but rather "exposure over time" gradient that, if exceeding the "low risk" boundary (>-12'C/10'f dew point or >15'C/59'f dry bulb temp) results in lower MTBF than design. This is defined (and server equipment manufacturers conform and build to) ASHRAE TC 9.9. This mean - if you're running your servers above high risk curve for humidity and temperature, you're shortening the life considerably compared to low risk curve.
Generally, 15% RH is considered suboptimal and can be dangerous near freezing temperatures - in San Diego in January there were several 90%+RH scenarios that would have been dangerous for servers even when mixed down with warm exhaust air - furthermore, the outdoor air at 76'f during that period means you have limited capacity to mix in warm exhaust air (which btw came from that same 99%RH input air) without getting into higher-than-ideal intake temps.
Any dew points above 62.5'f are considered high risk for servers - as are any intake temps exceeding 32'C/90'f. You want to be on the midpoint between those and 16'C/65'f temps & -12'C/10'f dew point to have no impact on server longevity or MTBF rates.
As a recent example:
Lastly, air contaminants - in the form of dust (that can be filtered out) and chemicals (which can't without extensive scrubbing) are probably the most detrimental to server equipment if not properly managed, and require very intentional and frequent filter changes (typically high MERV pleated filters changed on a time or pressure drop signal) to prevent server degradation and equipment risks.The last consideration is fire suppression - permitted datacenters usually require compliance with separate fire code, such that direct outdoor air exchange without active shutdown and dry suppression is not permitted - this is to prevent a scenario where your equipment catches on fire and a constant supply of fresh oxygen-rich outdoor air turns that into an inferno. Smoke detection systems don't operate well with outdoor-mixed air or any level of airborn particulates.
So - for those reasons - among a few others - open air datacenters are not recommended unless you're doing them at google or meta scale, and in those scenarios you typically have much more extensive systems and purpose-designed hardware in order to operate for the design life of the equipment without issues.
For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.
Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.
I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.
I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.
I worked in a company with two server farms (a main and a a backup one essentially) in Italy located in two different regions and we had a total of 5 employees taking care of them.
We didn't hear about them, we didn't know their names, but we had almost 100% uptime and terrific performance.
There was one single person out of 40 developers who's main responsibility were deploys, and that's it.
It costed my company 800k euros per year to run both the server farms (hardware, salaries, energy), and it spared the company around 7-8M in cloud costs.
Now I work for clients that spend multiple millions in cloud for a fraction of the output and traffic, and I think employ around 15+ dev ops engineers.
Running full scale kubernets, with multiple databases and services and expected 99.99% uptime likely can't be handled by one person.
Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
If you want true reliability, you need redundant physical locations, power, networking. That’s extremely easy to achieve on cloud providers.
It doesn't make sense if you only have few servers, but if you are renting equivalent of multiple racks of servers from cloud and run them for most of the day, on-prem is staggeringly cheaper.
We have few racks and we do "move to cloud" calculation every few years and without fail they come up at least 3x the cost.
And before the "but you need to do more work" whining I hear from people that never did that - it's not much more than navigating forest of cloud APIs and dealing with random blackbox issues in cloud that you can't really debug, just go around it.
On cloud it's out of your control when an AZ goes down. When it's your server you can do things to increase reliability. Most colos have redundant power feeds and internet. On prem that's a bit harder, but you can buy a UPS.
If your head office is hit by a meteor your business is over. Don't need to prepare for that.
It is a different skillset. SRE is also an under-valued/paid (unless one is in FAANGO).
It’s also nontrivial once you go past some level of complexity and volume. I have made my career at building software and part of that requires understanding the limitations and specifics of the underlying hardware but at the end of the day I simply want to provision and run a container, I don’t want to think about the security and networking setup it’s not worth my time.
Because those services solve the problem for them. It is the same thing with GitHub.
However, as predicted half a decade ago with GitHub becoming unreliable [0] and as price increases begin to happen, you can see that self-hosting begins to make more sense and you have complete control of the infrastructure and it has never been more easier to self host and bring control over costs.
> its also fun to solve technical issues you may have.
What you have just seen with coding agents is going to have the same effect on "developers" that will have a decline in skills the moment they become over-reliant on coding agents and won't be able to write a single line of code at all to fix a problem they don't fully understand.
[0] https://news.ycombinator.com/item?id=22867803
I agree that solving technical issues is very fun, and hosting services is usually easy, but having resilient infrastructure is costly and I simply don't like to be woken up at night to fix stuff while the company is bleeding money and customers.
Speaking as someone who does this, it is very straightforward. You can rent space from people like Equinix or Global Switch for very reasonable prices. They then take care of power, cooling, cabling plant etc.
We also rely on github. It has historically been good a service, but getting worth it.
(hardware engineer trying to understand wtaf software people are saying when they speak)
When I'm launching a project it's easier for me to rent $250 worth of compute from AWS. When the project consumes $30k a month, it's easier for me to rent a colocation.
My point is that a good engineer should know how to calculate all the ups and downs here to propose a sound plan to the management. That's the winning thing.
In 99.999999% of cases management has already decided and is just informing you, because they know better.
Perhaps an exception (yet so far, I've never encounter the situation you describe)
Now on-prem is cool again.
Makes me wonder whether we’re already setting up the next cycle 10 years from now, when everyone rediscovers why cloud was attractive in the first place and starts saying “on-prem is a bad idea” again.
My entire career I’ve encountered people passionately pushing for on-prem and railing against anything cloud. I can’t remember a time when Hacker News comments leaned pro-cloud because it’s always been about self-hosting.
The few times the on-prem people won out in my career never went exactly as they imagined. Buying a couple servers and setting them up at the colo is easy enough, but the slow and steady drag of maintaining your own infrastructure starts to work its way into every development cycle after that. In my experience, every team has significantly underestimated how all the little things add up to a drag on available time for other work.
The best case for on-prem that I saw was when a company was basically in maintenance mode. Engineers had a lot of extra time to optimize, update. maintain, and cost reduce without subtracting from feature development or bug fixes.
The worst cases for on-prem I’ve seen have been funded startups. In this situation it’s imperative that everyone focus on feature development and rapid iteration. Letting some of the engineers get sidetracked with setting up and maintaining their own hosting to save a dollar amount that barely hires 1-2 more engineers but sets the schedule back by many months was a huge mistake.
In my experience, most engineers become less enchanted with rolling their own on premises hosting as they get older. Their work becomes more about getting the job done quickly and to budget, not hyper-optimizing the hosting situation at the expense of inviting more complexity and miscellaneous tasks into their workload.
This is cyclical and I see the main axis of contention as centralized vs de-centralized computing.
Mainframes (network) gave way to mini and microcomputers (PCs). PCs gave way to server farms and web-based applications. Private servers and data centers gave way to the Cloud. Edge computing is again a push towards a more decentralized model.
Like all good engineering problems, where data and applications are hosted involve tradeoffs. Priorities change. Technologies change. But oftentimes, what works in one generation doesn't in the next. Part of it is the slow march of progress. But I think some of it is just not wanting to use your parent's technology stack and wanting to build your own.
The cloud vs. on-prem tradeoff is one of flexibility, capacity, maintenance, and capex vs opex.
It's a similar story in application development. At one point, we're navigating text forms on a mainframe, the next it's a GUI local application, followed by Electron or Web applications with remote data. We'll cycle back to local-first data (likely on-phone local models).
When you start to hear about the network being the computer again, you'll know we've started to swing back the other way again.
That's pretty much the dogma of the 2010s.
It doesn't matter that my org runs a line-of-business datacentre that is a fraction of the cost of public cloud. It doesn't matter that my "big" ERP and admin servers take up half a rack in that datacentre. MBA dogma says that I need to fire every graybeard sysadmin, raze our datacentre facility to the ground, and move to AWS.
Fun fact, salaries and hardware purchases typically track inflation, because switching cost for hardware is nil and hiring isn't that expensive. Whereas software is usually 5-10% increases every year because they know that vendor lock-in and switching costs for software are expensive.
AWS has redundant data centres across the world and within each region. A file in S3 will never be lost, even if you store it for a thousand years.
What happens if your city has a tornado and your data centre gets hit? Is your company now dead?
And how much do you spend on all these sysadmins? 200k each? If you’re saving 20k/month by paying 100k/month in salaries, you aren’t saving anything.
Pains I faced running BIG clusters on-prem.
1. Supply chain Management -- everything from power supplies all the way to GPUs and storage has to be procured, shipped, disassembled and installed. You need labor pool and dedicated management.
2. Inventory Management -- You also need to manage inventory on hand for parts that WILL fail. You can expect 20% of your cluster to have some degree of issues on an ongoing basis
3. Networking and security -- You are on your own defending your network or have to pay a ton of money to vendors to come in and help you. Even with the simplest of storage clusters, we've had to deal with pretty sophisticated attacks.
When I ran massive clusters, I had a large team dealing with these. Obviously, with PaaS, you dont need anyone.
I have had a similar transformation. I still host non-critical services on-prem. They are exceptionally cheap to run. Everything else, I host it on Hetzner.
There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.
People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.
The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.
This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
You rent a dataspace, which is OPEX not CAPEX, and you just lease the servers, which turns big CAPEX into monthly OPEX bill
Running your own DC is "we have two dozen racks of servers" endeavour, but even just renting DC space and buying servers is much cheaper than getting same level of performance from the cloud.
> This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
YOU NEED THOSE PEOPLE TO MANAGE CLOUD TOO. That's what always get ignore in calculations, people go "oh, but we really need like 2-3 ops people to cover datacenter and have shifts on the on-call", but you need same thing for cloud too, it is just dumped on programmers/devops guys in the team rather than having separate staff.
We have few racks and the part related to hardware is small part of total workload, most of it is same as we would (and do for few cloud customers) in cloud, writing manifests for automation.
Finally, some sense! "Cloud" was meant to make ops jobs disappear, but they just increased our salary by turning us into "DevOps Engineers" and the company's hosting bill increased fivefold in the process. You will never convince even 1% of devs to learn the ops side properly, therefore you'll still end up hiring ops people and we will cost you more now. On top of that, everyone that started as a "DevOps Engineer" knows less about ops than those that started as ops and transitioned into being "DevOps Engineers" (or some flavour of it like SREs or Platform Engineers).
If you're a programmer scared into thinking AI is going to take away your job, re-read my comment.
Just database management is a pretty specialized skill, separate from development or optimizing the structures of said data... For a lot of SaaS providers, if you aren't at a point where you can afford a dedicated DBA/Ops staff just for data, that's one reason you might lean into cloud operations or hybrid ops just for dbms management, security and backups. This is a low hanging fruit in terms of cloud offerings evem... but can shift a lot of burden in terms of operational overhead.
Again, depending on your business and data models.
But it is significantly cheaper and faster
As a hear-say anecdote, thats why some startups have db servers with hundreds of gb of ram and dozens of cpus to run a workload that could be served from a 5 year old laptop.
Once they are up and running that employee is spending at most a few hours a month on them. Maybe even a few hours every six months.
OTOH you are specifically ignoring that you'll require mostly the same time from a cloud trained person if you're all-in on AWS.
I expect the marginal cost of one employee over the other is zero.
You should also calculate the cost of getting it up and running. With Google Cloud (I don't actually use AWS), I mainly worry about building docker containers in CI and deploying them to vms and triggering rolling restarts as those get replaced with new ones. I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours. Also, where does the hardware live? What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where? Do you pay for security for wherever all that happens? What about cleaning, AC, or a special server room in your building. All that stuff is cost. Some of it is upfront cost. Some of it is recurring cost.
The article is a about a company that owns its own data center. The cost they are citing (5 million) is substantial and probably a bit more complete. That's one end of the spectrum.
> I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
These are not difficult problems. You can use the same/similar cloud install images.
A 10 year old nerd can install Linux on a computer; if you're a professional developer I'm sure you can read the documentation and automate that.
> And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours.
You could use the same person who is on standby to fix the cloud system if that has some failure.
> Also, where does the hardware live?
In rented rackspace nearby, and/or in other locations if you need more redundancy.
> What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where?
It will probably report the hardware failure to Dell/HP/etc automatically and open a case. Email or phone to confirm, the part will be sent overnight, and you can either install it yourself (very, very easy for things like failed disks) or ask a technician to do it (I only did this once with a CPU failure on a brand new server). Dell/HP/etc will provide the technician, or your rented datacentre space will have one for simpler tasks like disks.
The installation itself was handled by the vendor and datacenter. For hard drive failures, our vendor (who provided the warranty) shipped a drive and had a technician drive to the site. We had to 1. tell the datacenter to expect the package and let the tech in, and 2. be online to run the command to blink the lights on the drive that needed replacing and then verify that the drive came online. This 6-company dance (us, vendor, DC, tech, fedex, HDD manufacturer) was more annoying than just terminating an EC2 instance and recreating it (or having EBS handle drive failures behind the scenes) but it wasn't that bad in the grand scheme of things.
It is sad that the knowledge of how easy it really is, is getting extinct. The cloud and SaaS companies benefit greatly.
I was not doing the calculation. I was only pointing out that it was not as simple as you make it out to be.
Okay, a few other things that aren't in most calculations:
1. Looking at jobs postings in my area, the highest paid ones require experience with specific cloud vendors. The FTEs you need to "manage" the cloud are a great deal more expensive than developers.
2. You don't need to compare on-prem data center with AWS - you can rent a pretty beefy VPS or colocate for a fraction of the cost of AWS (or GCP, or Azure) services. You're comparing the most expensive alternative when avoiding cloud services, not the most typical.
3. Even if you do want to build your own on-prem rack, FTEs aren't generally paid extra for being on the standby rota. You aren't paying extra. Where you will pay extra is for hot failovers, or machine room maintenance, etc, which you don't actually need if your hot failover is a cheap beefy VPS-on-demand on Hetzner, DO, etc.
4. You are measuring the cost of absolute 0% downtime. I can't think of many businesses that have such high sensitivity to downtime. Even banks handle downtime much larger than that even while their IT systems are still up. With such strict requirements you're getting into the spot where the business itself cannot continue because of catastrophe, but the IT systems can :-/. What use is the IT systems when the business itself may be down?
The TLDR is:
1. If you have highly paid cloud-trained FTEs, and
2. Your only option other than Cloud is on-prem, and
3. Your FTEs are actually FT-contractors who get paid per hour, and
4. Your uptime requirements are moire stringent than national banks,
yeah, then cloud services are only slightly more expensive.
You know how many businesses fall into that specific narrow set of requirements?
If you do it only a few hours every 6 months, you are not maintaining your infrastructure, you are letting it die (until the need arises and everything must be done and this is a massive project)
Cloud integrations, for example, allow you to simply use a different database instance altogether per customer, while you can share services that utilize a given db connection. But actually setting up and managing that type of database infrastructure yourself may be much more resource intensive from a head count perspective.
I mention this, because having completely separate databases is an abstraction that cloud operations have already solved... while you can choose other options, such as more complex data models to otherwise isolate or share resources how does this complexity affect your services down-stream and the overall data complexities across one or all clients.
Harder still, if your data/service is centered around b2b clients of yours that have direct consumer interactions... then what if the industry is health or finance where there are even more legal concerns. Figuring a minimal (off the top) cost of each client of yours and scaling to the number of users under them isn't too hard to consider if you're using a mix of cloud services in concert with your own systems/services.
So yeah.. there's definitely considerations in either direction.
Here's what TFA says about this:
> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out.
and I think they're right. Be careful how you start because you may be stuck in the initial situation for a long time.
The upfront capex does not need to be that high, unless you're running your own AI models. Other than leasing new ones, as a sibling comment stated, you can buy used. You can get a solid Dell 2U with a full service contract (3 years) for ~$5-10K depending on CPU / memory / storage configuration. Or if you don't mind going older - because honestly, most webapps aren't doing anything compute-heavy - you can drop that to < $1K/node. Replacement parts for those are cheap, so buy an extra of everything.
It really depends on the business model as to how well you might support your own infrastructure vs. relying on a new backend instance per client in a cloud infrastructure that has already solved many of the issues at play.
Then you're probably going to need some combination of HIPAA / SOC 2 / PCI DSS certification, regardless of where your servers are physically located. AWS has certified the infrastructure side for you, but that doesn't remove your obligations for the logical side.
> Are you prepared for appropriate data isolation/sharding and controls? Do you have a strategy for scaling database operations per client or across all clients?
Again, you're going to need that regardless of where your servers physically exist.
> vs. relying on a new backend instance per client in a cloud infrastructure
You want to spin up an EC2 per client, and run an isolated copy of the application, isolated DB, etc. inside of it? That sounds like a nightmare to manage, especially if you want or need HA capabilities.
Just that utility at the database management layer is probably worth the price of entry for using cloud resources if you can't justify and cover the cost of say 5+ employees just for the data management infrastructure.
Or use Citus Postgres, and get sharding by schema for free, so you have both isolation and more or less infinite growth.
I’m not sure why if you think it would take 5 employees to manage self-hosted DBs, that it won’t take close to that to manage cloud-hosted ones. The only real difference you’re going to have once both are set up is dealing with any possible hardware issues. The initial setup for backups, streaming replication, etc. is a one-time thing, and then it just works. Hire a contractor for that, optionally keeping them on retainer for emergencies if you want.
You still have to deal with DB issues with a managed service: things like schema management, table design, index maintenance, parameter tuning, query optimization are all your responsibility, not the cloud provider’s.
The issue with comma.ai is that the company is HEAVILY burdened with Geohotz ideals, despite him no longer even being on the board. I used to be very much into his streams and he rants about it plenty. A large reason of why they run their own datacenter is that they ideologically refuse to give money to AWS or Google (but I guess Microsoft passes their non woke test).
Which is quite hilarious to me because they live in a very "woke" state and complain about power costs in the blog post. They could easily move to Wyoming or Montana and with low humidity and colder air in the winter run their servers more optimally.
The climate in Wyoming and Montana are actually worse in terms of climate. San Diego's climate extremes are less extreme than those places. Though moving out of CA is a good idea for power cost reasons, also addressed in the blog.
It's typically going to cost significantly less; it can make a lot of sense for small companies, especially.
You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.
Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.
I would be vary of a smallish company building their own Jira in house in a similar way.
>Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
Yes, but one differentiating factor is always price and you don't want to lose all your margins to some infrastructure provider.
Think of a ~5000 employee startup. Two scenarios:
1. if they win the market, they capture something like ~60% margin
2. if that doesn't happen, they just lose, VC fund runs out and then they leave
In this dynamic, costs associated with infrastructure don't change the bottomline of profitability. The risk involved with rolling out their on infrastructure can hurt their main product's existence itself.
>Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.
Well, exactly. But the degree to which the price of a specific input affects your bottom line depends on your product.
During the dot com era, some VC funded startups (such as Google) made a decision to avoid using Windows servers, Oracle databases and the whole super expensive scale-up architecture that was the risk-free, professional option at the time. If they hadn't taken this risk, they might not have survived.
[Edit] But I think it's not just about cloud vs on-premises. A more important question may be how you're using the cloud. You don't have to lock yourself into a million proprietary APIs and throw petabytes of your data into an egress jail.
But most importantly, the attractive power that companies doing on-premise infrastructure have towards the best talent.
If you don’t, you’ll be stuck trying to figure out data centres. Hiring tons of infrastructure experts, trying to manage power consumption. And for what? You won’t sell any more nails.
If you’re a company like Google, having better data centres does relate to your products, so it makes sense to focus on them and build your own.
Capex needs work. A couple of years, at least.
If you are willing to put in the work. Your mundane computer is always better than the shiny one you don't own.
Of course creating a VM is still a teraform commit away (you're not using clickops in prod surely)
If you want a custom server, one or a thousand, it's at least a couple of weeks.
If you want a powerful GPU server, that's rack + power + cooling (and a significant lead time). A respectable GPU server means ~2KW of power dissipation and considerable heat.
If you want a datacenter of any size, now that's a year at least from breaking ground to power-on.
But we are talking about a cost difference of tens of times, maybe a few hundred. The cloud is not like "most of the time".
Scale up, prove the market and establish operations on the credit card, and if it doesn’t work the money moves onto more promising opportunities. If the operation is profitable you transition away from the too expensive cloud to increase profitability, and use the operations incoming revenue to pay for it (freeing up more money to chase more promising opportunities).
Personally I can’t imagine anything outside of a hybrid approach, if only to maintain power dynamics with suppliers on both sides. Price increases and forced changes can be met with instant redeployments off their services/stack, creating room for more substantive negotiations. When investments come in the form of saving time and money, it’s not hard to get everyone aligned.
I think the primary reason that people over fixate on the cloud is that they can't do math. So renting is a hedge.
Even spending 10k recurring can be easier administratively that spending 10k on a one time purchase that depreciates over a 3 year cycle in some organisations because you don’t have to go into meetings to debate whether it’s actually a 2 or 4 year depreciation or discuss opportunity costs of locking up capital for 3 years etc.
Getting things done is mostly a matter of getting through bureaucracy. Projects fail because of getting stuck in approvals far more often than they fail because of going overbudget.
Of course not.
Old hardware is _plenty_ powerful for a lot of tasks today.
No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.
RAM that is plugged in and operating isn't subject to external ESD, unless you count lightning strikes. Where are you getting this?
Things would be different in a colder climate where humidity goes --> 0% in the winter
It is much cheaper to use external air for cooling if you can.
Also this is where cutting corner indeed results in lower cost, which was the reason for the OP to begin with. It just means you won't get as good a datacenter as people who are actually tuning this whole day and have decades of experience.
I have a feeling AI is going to be similar in the future. Sure, you can "rent" access to LLM's and have agents doing all your code. And in the future, it'll likely be as good as most engineers today. But the tradeoff is that you are effectively renting your labor from a single source instead of having a distributed workforce. I don't know what the long-term ramifications are here, if any, but I thought it was an interesting parallel.
For most businesses, it’s a false economy. Hardware is cheap, but having proper redundancy and multiple sites isn’t. Having a 24/7 team available to respond to issues isn’t.
What happens if their data centre loses power? What if it burns down?
https://intellectia.ai/news/stock/ibm-mainframe-business-ach...
60% YoY growth is pretty excellent for an "outdated" technology.
[1] https://www.techradar.com/news/remember-the-ovhcloud-data-ce...
[2] https://blocksandfiles.com/wp-content/uploads/2023/03/ovhclo...
When someone point out how safe are cloud providers, as if they have multiple levels of redundancy and are fully protected against even an alien invasion, I remember the OVH fire.
It's their "Compute" under "Public Cloud" that is competing against AWS EC2. https://us.ovhcloud.com/public-cloud/compute/
They handled the fire terribly and after that they improved a bit, but an OVH VPS is just a VM running on a single piece of hardware. Quite not the same thing as the "Compute" which is running on clusters.
Something very similar happened at work. Water valve monitoring wasn’t up yet. Fire didn’t respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.
why build one when you can have two at twice the price?
But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.
You need however plan for 1MM+ pa in OPEX because good SREs ain’t cheap (or hardware guys building and maintaining machines)
IIRC, Slurm came out of LLNL, and it finally made both usage and management of a cluster of nodes really easy and fun.
Compare Slurm to something like AWS Batch or Google Batch and just laugh at what the cloud has created...
The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.
This company sounds more like a hobby interest than a business focused on solving genuine problems.
Re: the "hobby" part is where I agree with you the most. Where you say it's not solving genuine problems is where I differ the most.
It really feels to me like Comma is staffed by people who recognize that they never stopped enjoying playing with Lego -- their bricks just grew up, and they realized they can:
1) solve real-world problems
2) not be jerks about it
3) get paid to do it
Not everything has to be about optimizing for #3.
I'm a happy paying customer of Comma.ai (Comma four, baby!) -- their product is awesome, extremely consumer-friendly, and I hope they can grow in their success!
This is becoming increasingly common as far as I can tell.
There are benefits either direction, and I think that each company needs to evaluate the pros and cons themselves. Emotional pros/cons are something companies need to evaluate as employee morale can make or break a company. If the company is super technical in culture and they gain something intangible that is boosting the bottom line, having a datacenter as a "cool" factor is probably worth it.
It's easy to inspire people when you have great engineers in the first place. That's a given at a place like comma.ai, but there are many companies out there where administering a datacenter is far beyond their core competencies.
I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies. The same way that comma.ai employees likely don't have an in-house canteen, it can make sense to focus on what you are good at and outsource the rest.
They spend too much time on yet another cloud native support group call, learning for ThatOneCloudProvider certificates, figuring out that single implementation caveats, standardizing security procedures between cloud teams, and so on.
Yet people in the article just throw a 1000 lines of code KV store mkv [0] on a huge raw storage server and call it a day. And it's a legit choice, they did actual study beforehand and concluded: we don't need redundancy in most cases. At all. I respect that.
[0] https://github.com/geohot/minikeyvalue
The only two valid reasons to build/operate a datacenter: 1) what you're doing is so costly that building your own factory is the only profitable way for your business to produce its widgets, 2) you can't find a datacenter with the location or capacity you need and there is no other way to serve your business needs.
There's many valid reasons to run your own servers (colo), although most people will not run into them in a business setting.
Sure you have to schedule your own hardware repairs or updates but it also means you don't need to wrangle with the ridiculous cost-engineering, reserved instances, cloud product support issues or API deprecations, proprietary configuration languages, etc.
Bare metal is better for a lot of non-cost reasons too, as the article notes it's just easier/better to reason about the lower level primitives and you get more reliable and repeatable performance.
I have run bare metal and manage services you just have to be clear on what you have coverage for when disaster strikes or be willing to proactively replace hard drives before they die.
If you’re at the scale of hundreds of instances, that math changes significantly.
And a lot of it depends on what type of business you have and what percent of your budget hosting accounts for.
The opposite is also true: one is risking being banned by exascalers.
That said from the risk perspective I assume for what their doing in the data center there is low risk if downtime happens.
This is one reason I hate dealing with AWS. It feels like a waste of time in some ways. Like learning a fly-by-night javascript library - maybe I'm better off spending that time writing the functionality on my own, to increase my knowledge and familiarity?
(And it would be fun too.)
It was fun to build - especially Infiniband - but my next iteration is going to be a single beefy server, maybe with storage attached externally. What I had had outstanding uptime, but ultimately it was massively overkill, noisy, hot, and sucked power down.
While I have no intention to scale up low spec hardware like this, it at least seems to beat the Azure VMs we use at work with "4 CPUs", which corresponds to two physical cores on an AMD EPYC CPU.
And that super slow machine I understand costs more than $100 per month, and that's without charges for disk space slower than the SSD, or network traffic.
Renting at Azure seems to be a terrible decision, particularly for desktop use.
On the other hand, there's significant vendor lockin, complexity, etc. And I'm not really sure we actually end up with less people over time, headcount always expands over time, and there's always cool new projects like monitoring, observability, AI, etc.
My feeling is, if we rented 20-30 chunky machines and ran Linux on them, with k8s, we'd be 80% there. For specific things I'd still use AWS, like infinite S3 storage, or RDS instances for super-important data.
If I were to do a startup, I would almost certainly not base it off AWS (or other cloud), I'd do what I write above: run chunky servers on OVH (initially just 1-2), and use specific AWS services like S3 and RDS.
A bit unrelated to the above, but I'd also try to keep away from expensive SaaS like Jira, Slack, etc. I'd use the best self-hosted open source version, and be done with it. I'd try Gitea for git hosting, Mattermost for team chat, etc.
And actually, given the geo-political situation as an EU citizen, maybe I wouldn't even put my data on AWS at all and self-host that as well...
It would probably even make sense for some companies to still use cloud for their API but do the training on prem as that may be the expensive part.
Cost and lock-in are obvious factors, but "sovereignty" has also become a key factor in the sales cycle, at least in Europe.
Handing health data, Juvoly is happy to run AI work loads on premise.
There are good business and technical reasons to choose a public cloud.
There are good business and technical reasons to choose a private cloud.
There are good business and technical reasons to do something in-between or hybrid.
The endless "public cloud is a ripoff" or "private clouds are impossible" is just a circular discussion past each other. Saying to only use one or another is textbook cargo-culting.
If you don't have any kind of serious compliance requirement, using Amazon is probably not ideal. I would say that Azure AD is ok too if you have to do Microsoft stuff, but I'd never host an actual VM on that cloud.
Compliance and "Microsoft stuff" covers a lot of real world businesses. Going on prem should only be done if it's actually going to make your life easier. If you have to replicate all of Azure AD or Route53, it might be better to just use the cloud offerings.
I was going to post the same comment.
Most of the people agreeing to foot the AWS bill do it because they see how much the compliance is worth to them.
Ps... bx cable instead of conduit for electrical looks cringe.
Budget hosts such as Hetzner/OVH have been known to suddenly pull the plug for no reason.
My kit is old, second hand old (Cisco UCS 220 M5, 2xDell somethings) and last night I just discovered I can throw in two NVIDIA T4's and turn it in to a personal LLM.
I'm quite excited having my own colocated server with basic LLM abilities. My own hardware with my own data and my own cables. Just need my own IP's now.
The same would apply for any number of hosts. Hetzner/OVH are cheap, but as your own numbers show the location price gap is more than sufficient to cover the costs of servers.
In fact you can colocate with Hetzner too, and you'd get a similar price gap - the lower cost of real-estate is a large part of the reason why they can be as cheap as they are.
Data centre operations is a real estate play - to the point that at least one UK data centre operator is owned by a real estate investment company.
Where I feel that data has become a commodity in that I can sell your username and email for a few pence, I would rather prefer to have my own hardware in my own possession and that any request of it has to go to me, nor some server provider.
Gives a whole new level to the idea of "full stack developer"
Now the company is looking at doing further cost savings as the buildings rented for running on-prem are sitting mostly unused, but also the prices of buildings have gone up in recent years, notably too, so we're likely to be saving money moving into the cloud. This is likely to make the cloud transition permanent.
I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.
Its the other way around. How do you think all businesses moved to the cloud in the first place?
[0] - https://azure-int.microsoft.com/en-us/pricing/tco/calculator...
Their "assumption" for hardware purchase prices seems way off compared to what we buy from Dell or HP.
It's interesting that the "IT labour" cost they estimate is $140k for DIY, and $120k for Azure.
Their saving is 5 times more than what we spend...
Even so, a rough configuration for a 2-processor 16 core/processor server with 256GiB RAM comes to $20k, vs $22k + 100% = $44k quoted by MS. (The 100% is MS' 20%-per-year "maintenance cost" that they add on to the estimate. In reality this is 0% as everything is under Dell's warranty.)
And most importantly, the tool is only comparing the cost of Azure to constructing and maintaining a data centre! Unless there are other requirements (which would probably rule out Azure anyway) that's daft, a realistic comparison should be to colocation or hired dedicated servers, depending on the scale.
One thing to keep in mind is that the curve for GPU depreciation (in the last 5 years at least) is a little steeper than 3 years. Current estimates is that the capital depreciation cost would plunge dramatically around the third year. For a top tier H100 depreciation kicks in around the 3rd year but they mentioned for the less capable ones like the A100 the depreciation is even worse.
https://www.silicondata.com/use-cases/h100-gpu-depreciation/
Now this is not factoring cost of labour. Labor at SF wages is dreadfully expensive, now if your data center is right across the border in Tijuana on the other hand..
I find this to be applicable on a smaller scale too! I'd rather setup and debug a beefy Linux VPS via SSH than fiddle with various propietary cloud APIs/interfaces. Doesn't go as low-level as Watts, bits and FLOPs but I still consider knowledge about Linux more valuable than knowing which Azure knobs to turn.
The trick is in how to create mostly self-maintaining deployable/swappable data centers at low cost...
https://blog.railway.com/p/launch-week-02-welcome
This seems to imply $40 / month for 2 vCPU which seems very high?
Or maybe they mean "used" CPU versus idle?
I've helped a startup with 2.5M revenue reduce their cloud spend from close to 2M/yr to below 1M/yr. They could have reached 250k/yr renting bare-metal servers. Probably 100k/yr in colos by spending 250k once on hardware. They had the staff to do it but the CEO was too scared.
Cloud evangelism (is it advocacy now?) messed up the minds of swaths of software engineers. Suddenly costs didn't matter and scaling was the answer to poor designs. Sizing your resource requirements became a lost art, and getting into reaction mode became law.
Welcome to "move fast and get out of business", all enabled by cloud architecture blogs that recommend tight integration with vendor lock-in mechanisms.
Use the cloud to move fast, but stick to cloud-agnostic tooling so that it doesn't suck you in forever.
I've seen how much cloud vendors are willing to spend to get business. That's when you realize just how massive their margins are.
You're just young.
> Suddenly costs didn't matter and scaling was the answer to poor designs.
It did.
Did you know that cloud cost less than what the internal IT team at a company would charge you?
Let's say you worked on product A for a company and needed additional VM. Besides paperwork, the cost to you (for your cost center) would be more than using the company credit card for the cloud.
> Sizing your resource requirements became a lost art
In what way? We used to size for 2-4x since getting additional resources (for the in-house team) would be weeks to months. Same old - just cloud edition.
And I feel great!
> Did you know that cloud cost less than what the internal IT team at a company would charge you?
Yes. Internal IT teams ran old-school are inefficient. And that's what the vendor tells you while they create shadow IT inside your company. Skip ITSM and ITIL... do it the SRE way.
Until the cloud economist (real role) comes in and finds a way to extract more rent out of their customer base (like GCP's upcoming doubling rates on CDN Interconnect). And until internal IT kills shadow IT and regains management of cloud deployments. Cybersecurity and stuff...
Back to square one. ITIL with cloud deployments. Some use cases will be way cheaper... but for your 100s of PBs of enterprise data, that's another story. And data gravity will kill many initiatives just based on bit movement costs.
> Besides paperwork, the cost to you (for your cost center) would be more than using the company credit card for the cloud.
To some extent. One is hard dollars the other is funny money. But I thought paying for cloud with the company credit card was a 2016 thing. Now it's paid through your internal IT cost center, with internal IT markup.
I've seen petabytes of data move to the cloud and then we couldn't perform some queries on it anymore as that store wouldn't support it, and we'd need to spend 7 figures to move to another cloud database to query it. And that's hard dollars.
Yes, during early cloud days it was lean and aimed at startups. Now it's aimed at enterprise, and for some reason lots of startups still think it's optimized for them. It's not and it hasn't been for a long time.
They aren't. It's politics. They want to protect and improve their own headcount and resources.
> One is hard dollars the other is funny money.
All the same to a team / department. It's not like people run it like their own wallet.
> finds a way to extract more rent out of their customer base
I find you just have a grudge against the cloud and hence too young. For every example you have the so-called "internal" IT team can and will do just the same. Go back to 90s, 00s - it was the same. The infra team wanted some fancy new storage arrays and charge everyone 2x for the new service etc.
> and for some reason lots of startups still think it's optimized for them. It's not and it hasn't been for a long time.
The problem isn't the cloud. Startups have always worked like this even 10-20 years ago. It's about wastage. They can raise and grow faster. So they think. The problem, if any is recently money isn't as cheap. Nothing new.
Mind anyone elaborate? Always thought this is was a direct cause of the free market. Not sure if by dysfunction the op means lack of intervention.
Perhaps Comma needed the datacenter to be in San Diego for latency or other reasons, but if they need it mostly for compute, it would have been cheaper to operate their datacenter elsewhere... but if we keep going down that path, maybe it actually becomes cheaper to rent a cloud after all.
The majority of Californians have no say and cannot choose their utilities provider. This is the polar opposite of the "free market".
Everything is a trade-off. Every tool has its purpose. There is no "right way" to build your infrastructure, only a right way for you.
In my subjective experience, the trade-offs are generally along these lines:
* Platform as a Service (Vercel, AWS Lambda, Azure Functions, basically anything where you give it your code and it "just works"): great for startups, orgs with minimal talent, and those with deep pockets for inevitable overruns. Maximum convenience means maximum cost. Excellent for weird customer one-offs you can bill for (and slap a 50% margin on top). Trade-off is that everything is abstracted away, making troubleshooting underlying infrastructure issues nigh impossible; also that people forget these things exist until the customer has long since stopped paying for them or a nasty bill arrives.
* Infrastructure as a Service (AWS, GCP, Azure, Vultr, etc; commonly called the "Public Cloud"): great for orgs with modest technical talent but limited budgets or infrastructure that's highly variable (scales up and down frequently). Also excellent for everything customer-facing, like load balancers, frontends, websites, you name it. If you can invoice someone else for it, putting it in here makes a lot of sense. Trade-off is that this isn't yours, it'll never be yours, you'll be renting it forever from someone else who charges you a pretty penny and can cut you off or raise prices anytime they like.
* Managed Service/Hosting Providers (e.g., ye olde Rackspace): you don't own the hardware, but you're also not paying the premium for infrastructure orchestrators. As close to bare metal as you can get without paying for actual servers. Excellent for short-term "testing" of PoCs before committing CapEx, or for modest infrastructure needs that aren't likely to change substantially enough to warrant a shift either on-prem or off to the cloud. You'll need more talent though, and you're ultimately still renting the illusion of sovereignty from someone else in perpetuity.
* Bare Metal, be it colocation or on-premises: you own it, you decide what to do with it, and nobody can stop you. The flip side is you have to bootstrap everything yourself, which can be a PITA depending on what you actually want - or what your stakeholders demand you offer. Running VMs? Easy-peasy. Bare metal K8s clusters? I mean, it can be done, but I'd personally rather chew glass than go without a managed control plane somewhere. CapEx is insane right now (thanks, AI!), but TCO is still measured in two to three years before you're saving more than you'd have spent on comparable infrastructure elsewhere, even with savings plans. Talent needs are highly variable - a generalist or two can get you 80% to basic AWS functionality with something like Nutanix or VCF (even with fancy stuff like DBaaS), but anything cutting edge is going to need more headcount than a comparable IaaS build. God help you if you opt for a Microsoft stack, as any on-prem savings are likely to evaporate at your next True-Up.
In my experience, companies have bought into the public cloud/IaaS because they thought it'd save them money versus the talent needed for on-prem; to be fair, back when every enterprise absolutely needed a network team and a DB team and a systems team and a datacenter team, this was technically correct. Nowadays, most organizational needs can be handled with a modest team of generalists or a highly competent generalist and one or two specialists for specific needs (e.g., a K8s engineer and a network engineer); modern software and operating systems make managing even huge orgs a comparable breeze, especially if you're running containers or appliances instead of bespoke VMs.
As more orgs like Comma or Basecamp look critically at their infrastructure needs versus their spend, or they seriously reflect on the limited sovereignty they have by outsourcing everything to US Tech companies, I expect workloads and infrastructure to become substantially more diversified than the current AWS/GCP/Azure trifecta.
An error occurred: API rate limit already exceeded for installation ID 73591946.
Error from https://giscus.app/
Fellow says one thing and uses another.
Hey, how do SSDs fail lately? Do they ... vanish off the bus still? Or do they go into read only mode?
> In comma’s case I estimate we’ve spent ~5M on our data center, and we would have spent 25M+ had we done the same things in the cloud.
IMO, that's the biggie. It's enough to justify paying someone to run their datacenter. I wish there was a bit more detail to justify those assumptions, though.
That being said, if their needs grow by orders of magnitude, I'd anticipate that they would want to move their servers somewhere with cheaper electricity.
Half sarcasm of course, but it goes to show that the world is not going to fall apart in many cases when it comes to software. Sure, it's not ideal in lots of cases, but we'll survive without redundancy.
"An error occurred: API rate limit already exceeded for installation ID 73591946."
Ummm is that plateauing with us in the room?
The advantage of renting vs. owning is that you can always get the latest gen, and that brings you newer capabilities (i.e. fp8, fp4, etc) and cheaper prices for current_gen-1. But betting on something plateauing when all the signs point towards the exact opposite is not one of the bets i'd make.
Well, the capabilities have already plateaued as far as I can tell :-/
Over the next few yeas we can probably wring out some performance improvements, maybe some efficiency improvements.
A lot of the current AI users right now are businesses trying to on-sell AI (code reviewers/code generators, recipe apps, assistant apps, etc), and there's way too many of them in the supply/demand ratio, so you can expect maybe 90% of these companies to disappear in the next few years, taking the demand for capacity with them.
Cloud excels for bursty or unpredictable workloads where quickly scaling up and down can save you money.
It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.
It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.
[0] https://blog.railway.com/p/data-center-build-part-one
[1] https://oxide.computer/
/s
In the future, what you will need to remain competitive is computing at the edge. Only one company is truly poised to deliver on that at massive scale.
Otherwise, well just like that gym membership, you get out what you put into it...
Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...
You can equip your server with a mouse, keyboard and screen and then it doesn't even need credentials. The credential is your physical access to the mouse and keyboard.
Rackmounted fusion reactors, I hope. Would solve my homelab wattage issues too.
> Redundancy is not needed since no specific data is critical.
> we have a redundant mkv storage array to store all of our trained models and training metrics.
That's just called understanding your failure domains, and RTO/RPO needs.
[1] https://si.inc/posts/the-heap/
# don’t own the cloud, rent instead
the “build your own datacenter” story is fun (and comma’s setup is undeniably cool), but for most companies it’s a seductive trap: you’ll spend your rarest resource (engineer attention) on watts, humidity, failed disks, supply chains, and “why is this rack hot,” instead of on the product. comma can justify it because their workload is huge and steady, they’re willing to run non-redundant storage, and they’ve built custom GPU boxes and infra around a very specific ML pipeline. ([comma.ai blog][1])
## 1) capex is a tax on flexibility
a datacenter turns “compute” into a big up-front bet: hardware choices, networking choices, facility choices, and a depreciation schedule that does not care about your roadmap. cloud flips that: you pay for what you use, you can experiment cheaply, and you can stop spending the minute a strategy changes. the best feature of renting is that quitting is easy.
## 2) scaling isn’t a vibe, it’s a deadline
real businesses don’t scale smoothly. they spike. they get surprise customers. they do one insane training run. they run a migration. owning means you either overbuild “just in case” (idle metal), or you underbuild and miss the moment. renting means you can burst, use spot/preemptible for the ugly parts, and keep steady stuff on reserved/committed discounts.
## 3) reliability is more than “it’s up most days”
comma explicitly says they keep things simple and don’t need redundancy for ~99% uptime at their scale. ([comma.ai blog][1]) that’s a perfectly valid trade—if your business can tolerate it. many can’t. cloud providers sell multi-zone, multi-region, managed backups, managed databases, and boring compliance checklists because “five nines” isn’t achieved by a couple heroic engineers and a PID loop.
## 4) the hidden cost isn’t power, it’s people
comma spent ~$540k on power in 2025 and runs up to ~450kW, plus all the cooling and facility work. ([comma.ai blog][1]) but the larger, sneakier bill is: on-call load, hiring niche operators, hardware failures, spare parts, procurement, security, audits, vendor management, and the opportunity cost of your best engineers becoming part-time building managers. cloud is expensive, yes—because it bundles labor, expertise, and economies of scale you don’t have.
## 5) “vendor lock-in” is real, but self-lock-in is worse
cloud lock-in is usually optional: you choose proprietary managed services because they’re convenient. if you’re disciplined, you can keep escape hatches: containers, kubernetes, terraform, postgres, object storage abstractions, multi-region backups, and a tested migration plan. owning your datacenter is also lock-in—except the vendor is past you, and the contract is “we can never stop maintaining this.”
## the practical rule
*if you have massive, predictable, always-on utilization, and you want to become good at running infrastructure as a core competency, owning can win.* that’s basically comma’s case. ([comma.ai blog][1]) *otherwise, rent.* buy speed, buy optionality, and keep your team focused on the thing only your company can do.
if you want, tell me your rough workload shape (steady vs spiky, cpu vs gpu, latency needs, compliance), and i’ll give you a blunt “rent / colo / own” recommendation in 5 lines.
[1]: https://blog.comma.ai/datacenter/ "Owning a $5M data center - comma.ai blog"
What I mean is that I'm assuming the math here works because the primary purpose of the hardware is training models. You don't need 6 or 7 nines for that is what I'm imagining. But when you have customers across geography that use your app hosted on those servers pretty much 24/7 then you can't afford much downtime.