I had a 20 GiB JSON file of everything that has ever happened on Hacker News
I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.
jakegmaths [3 hidden]5 mins ago
Your query for Java will include all instances of JavaScript as well, so you're over representing Java.
smarnach [3 hidden]5 mins ago
Similarly, the Rust query will include "trust", "antitrust", "frustration" and a bunch of other words
Ah right… maybe even more unexpected then to see a decline
cs02rm0 [3 hidden]5 mins ago
I'm not so sure, while Java's never looked better to me, it does "feel" to me to be in significant decline in terms of what people are asking for on LinkedIn.
I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.
tacker2000 [3 hidden]5 mins ago
Yea, i also get the feeling that these rust evangelists get more annoying every day ;p
stefs [3 hidden]5 mins ago
please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.
ashish01 [3 hidden]5 mins ago
I wrote one a while back https://github.com/ashish01/hn-data-dumps and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.
jasonthorsness [3 hidden]5 mins ago
Yeah I’m really happy HN offers an API like this instead of locking things down like a bunch of other sites…
I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.
// DefaultStaleIf marks stale at 60 seconds after creation, then frequently for the first few days after an item is
// created, then quickly tapers after the first week to never again mark stale items more than a few weeks old.
const DefaultStaleIf = "(:now-refreshed)>" +
"(60.0*(log2(max(0.0,((:now-Time)/60.0))+1.0)+pow(((:now-Time)/(24.0*60.0*60.0)),3)))"
I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.
minimaxir [3 hidden]5 mins ago
That's not cheating, that's just pragmatic.
9rx [3 hidden]5 mins ago
> The Rise Of Rust
Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!
emilbratt [3 hidden]5 mins ago
The chart is a stacked one, so we are looking at the height each category takes up and not the height each category reach.
matsemann [3 hidden]5 mins ago
One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?
minimaxir [3 hidden]5 mins ago
The only vote data that is visible via any HN API is the scores on submissions.
Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.
ryandrake [3 hidden]5 mins ago
Too bad! I’ve always sort of wanted to be able to query things like what were my most upvoted and downvoted comments, how often are my comments flagged, and so on.
saagarjha [3 hidden]5 mins ago
I did this once by scraping the site (very slowly, to be nice). It’s not that hard since the HTML is pretty consistent.
nottorp [3 hidden]5 mins ago
> Are there users I constantly upvote/downvote?
Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
matsemann [3 hidden]5 mins ago
Same, which is why it would be cool to see. Perhaps there are people I both upvote and downvote?
thaumasiotes [3 hidden]5 mins ago
> It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?
9rx [3 hidden]5 mins ago
> What's my upvote/downvote ratio?
Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?
It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.
For those who seek fidget toys, there are better devices for that.
immibis [3 hidden]5 mins ago
Actually, its most useful purpose is to hide opinions you disagree with - if enough people agree with you.
Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.
9rx [3 hidden]5 mins ago
So, what you are saying is that if the masses agree that some opinion is disagreeable, they will hide it from themselves? But they already read it to know it was disagreeable, so... What are they hiding it for, exactly? So that they don't have to read it again when they revisit the same comments 10 years later? Does anyone actually go back and reread the comments from 10 years ago?
matsemann [3 hidden]5 mins ago
Since there are no rules on down voting, people probably use it for different things. Some to show dissent, some to down vote things they think don't belong only, etc. Which is why it would be interesting to see. Am I overusing it compared to the community? Underusing it?
saagarjha [3 hidden]5 mins ago
If Hacker News had reactions I’d put an eye roll here.
9rx [3 hidden]5 mins ago
You could have assigned 'eye roll' to one of the arrow buttons! Nobody else would have been able to infer your intent, but if you are pressing the arrow buttons it is not like you want anyone else to understand your intent anyway.
hsbauauvhabzb [3 hidden]5 mins ago
Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.
andrewshadura [3 hidden]5 mins ago
Funny nobody's mentioned "correct horse battery staple" in the comments yet…
pier25 [3 hidden]5 mins ago
would love to see the graph of React, Vue, Angular, and Svelte
I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.
I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.
I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.
https://github.com/jasonthorsness/unlurker/blob/main/hn/core...Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!
Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.
Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?
Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?
It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.
For those who seek fidget toys, there are better devices for that.
Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.