Hacker News

Four Column ASCII (2017)

320 points by tempodox - 76 comments

HocusLocus [3 hidden]5 mins ago

I have lived my whole professional life with this being 'beyond obvious'... It's hard to imagine a generation where it's not. But then again, I did work with EBCDIC for awhile and we were reading and translating ASCII log tapes (ITT/Alcatel 1210 switch, phone calls, memory dumps).

I once got drunk with my elderly unix supernerd friend and he was talking about TTYs and how his passwords contained embedded ^S and ^Q characters and he traced the login process to learn they were just stalling the tty not actually used to construct the hash. No one else at the bar got the drift. He patched his system to put do 'raw' instead of 'cooked' mode for login passwords. He also used backspaces ^? ^H as part of his passwords. He was a real security tiger. I miss him.

Eduard [3 hidden]5 mins ago

Regarding ^?: shouldn't that be ^_ instead?

dcminter [3 hidden]5 mins ago

It doesn't seem to have been mentioned in the comments so far, but as a floppy-disk era developer I remember my mind was blown by the discovery that DEL was all-bits-set because this allowed a character on paper tape and punched card to be deleted by punching any un-punched holes!

axblount [3 hidden]5 mins ago

Bit-level skeuomorphism! And since NUL is zero, does that mean the program ends wherever you stop punching? I've never used punch cards so I don't know how things were organized.

fix4fun [3 hidden]5 mins ago

For me was interesting that all digits in ASCII starts with 0x3, eg. 0x30 - 0, 0x31 - 1, ..., 0x39 - 9. I thought it was accidental, but in real it was intended. This was giving possibility to build simple counting/accounting machines with minimal circuit logic with BCD (Binary Coded Decimals). That was wow for me ;)

satiated_grue [3 hidden]5 mins ago

ASCII was started in 1960. A terminal then would have been a mostly-mechanical teletype (keyboard and printer, possibly with paper tape reader/punch), without much by way of "circuit logic". Think of it more as a bit caused a physical shift of a linkage to do something like hit the upper or lower part of a hammer, or a separate set of hammers for the same remaining bits.

Look at the Teletype ASR-33, introduced in 1963.

fix4fun [3 hidden]5 mins ago

Yes, that's true ASR-33 was first application, but IBM has impact on ANSI/ASA comeete and ASCII standardisation. In 1963 IBM System/360 was using BCD with digits quick "parse" and in it's peripherals. I remember it from some interview with old IBM tech employee ;)

zahlman [3 hidden]5 mins ago

And this is exactly why I find the usual 16x8 at least as insightful as this proposed 32x4 (well, 4x32, but that's just a rotation).

kibwen [3 hidden]5 mins ago

I still wonder if it wouldn't have been better to let each digit be represented by its exact value, and then use the high end of the scale rather than the low end for the control characters. I suppose by 1970 they were already dealing with the legacy of backwards-compatibility, and people were already accustomed to 0x0 meaning something akin to null?

mmilunic [3 hidden]5 mins ago

Either way you would still need some check to ensure your digits are digits and not some other type of character. Having zeroed out memory read as a bunch of NUL characters instead of like “00000000” would probably be useful, as “000000” is sometimes a legitimate user input

gpvos [3 hidden]5 mins ago

NUL was often sent as padding to slow (printing) terminals. Although that was just before my time.

kazinator [3 hidden]5 mins ago

This is by design, so that case conversion and folding is just a bit operation.

The idea that SOH/1 is "Ctrl-A" or ESC/27 is "Ctrl-[" is not part of ASCII; that idea comes from they way terminals provided access to the control characters, by a Ctrl key that just masked out a few bits.

muyuu [3 hidden]5 mins ago

I guess it's an age thing, but I thought this was really basic CS knowledge. But I can see why this may be much less relevant nowadays.

teddyh [3 hidden]5 mins ago

It’s on the list:

<https://web.archive.org/web/20251103035213/https://www.catb....>

Cthulhu_ [3 hidden]5 mins ago

I've been in IT for decades but never knew that ctrl was (as easy as) masking some bits.

muyuu [3 hidden]5 mins ago

You can go back maybe 2 decades without this being very relevant, but not 3 given the low level scope that was expected in CS and EE back then.

kazinator [3 hidden]5 mins ago

I learned about from 6502 machine language programming, from some example that did a simple bit manipulation to switch lower case to upper case. From that it became obvious that ASCII is divided into four banks of 32.

aa-jv [3 hidden]5 mins ago

Been an ASCII-naut since the 80's, so .. its always amusing to see people type 'man ascii' for the first time, gaze upon its beauty, and wonder at its relevance, even still today ...

nine_k [3 hidden]5 mins ago

Yes, the diagram just shows the ASCII table for the old teletype 6-bit code (and 5-bit code before), with the two most significant bits spread over 4 columns to show the extension that happened while going 5→6→7 bits. It makes obvious what was very simple bit operations on very limited hardware 70–100 years ago.

(I assume everybody knows that on mechanical typewriters and teletypes the "shift" key physically shifted the caret position upwards, so that a different glyph would be printed when hit by a typebar.)

kazinator [3 hidden]5 mins ago

If Unicode had used a full 32 bits from the start, it could have usefully reserved a few bits as flags that would divide it into subspaces, and could be easily tested.

Imagine a Unicode like this:

8:8:16

- 8 bits of flags. - 8 bit script family code: 0 for BMP. - 16 bit plane for every script code and flag combination.

The flags could do usefuil things like indicate character display width, case, and other attributes (specific to a script code).

Unicode peaked too early and applied an economy of encoding which rings false now in an age in which consumer devices have two digit gigabyte memories, multi terabyte of storage, and high definition video is streamed over the internet.

taejavu [3 hidden]5 mins ago

For whatever reason, there are extraordinarily few references that I come back to over and over, across the years and decades. This is one of them.

taejavu [3 hidden]5 mins ago

Tangentially related, there is much insight about Unix idioms to be gained from understanding the key layout of the terminal Bill Joy used to create vi

https://news.ycombinator.com/item?id=21586980

aa-jv [3 hidden]5 mins ago

Not 'man ascii'?

california-og [3 hidden]5 mins ago

I made an interactive viewer some time ago (scroll down a bit):

https://blog.glyphdrawing.club/the-origins-of-del-0x7f-and-i...

It really helps understand the logic of ASCII.

pixelbeat__ [3 hidden]5 mins ago

Some of this elegance discussed from a programmatic point of view

https://www.pixelbeat.org/docs/utf8_programming.html

mbreese [3 hidden]5 mins ago

I came across this a week ago when I was looking at some LLM generated code for a ToUpper() function. At some point I “knew” this relationship, but I didn’t really “grok” it until I read a function that converted lowercase ascii to uppercase by using a bitwise XOR with 0x20.

It makes sense, but it didn’t really hit me until recently. Now, I’m wondering what other hidden cleverness is there that used to be common knowledge, but is now lost in the abstractions.

Findecanor [3 hidden]5 mins ago

A similar bit-flipping trick was used to swap between numeric row + symbol keys on the keyboard, and the shifted symbols on the same keys. These bit-flips made it easier to construct the circuits for keyboards that output ASCII.

I believe the layout of the shifted symbols on the numeric row were based on an early IBM Selectric typewriter for the US market. Then IBM went and changed it, and the latter is the origin of the ANSI keyboard layout we have now.

auselen [3 hidden]5 mins ago

xor should toggle?

munk-a [3 hidden]5 mins ago

That's correct, a toUpper would just use OR.

mbreese [3 hidden]5 mins ago

I left out that the line before there was a check to make sure the input byte was between ‘a’ and ‘z’. This ensures that if the char is already upper case, you don’t do an extraneous OR. And at this point, OR, XOR, or even a subtract 0x20 would work. For some reason the LLM thought the XOR was faster.

I honestly wouldn’t have thought anything of it if I hadn’t seen it written as `b ^ 0x20`.

jez [3 hidden]5 mins ago

I have a command called `ascii-4col.txt` in my personal `bin/` folder that prints this out:

https://github.com/jez/bin/blob/master/ascii-4col.txt

It's neat because it's the only command I have that uses `tail` for the shebang line.

gpvos [3 hidden]5 mins ago

Back in early times, I used to type ctrl-M in some situations because it could be easier to reach than the return key, depending on what I was typing.

dveeden2 [3 hidden]5 mins ago

Also easy to see why Ctrl-D works for exiting sessions.

rbanffy [3 hidden]5 mins ago

This is also why the Teletype layout has parentheses on 8 and 9 unlike modem keyboards that have them on 9 and 0 (a layout popularised by the IBM Selectric). The original Apple IIs had this same layout, with a “bell” on top of the G.

spragl [3 hidden]5 mins ago

Modern keyboards = some keyboards. In the Nordic Countries modern keyboards have parantheses on 8 and 9.

debugnik [3 hidden]5 mins ago

According to the layouts on this site, there're more European layouts with parenthesis on 8, 9 than on 9, 0. (I had to zoom out to see the right-side of the comparisons.)

https://www.farah.cl/Keyboardery/A-Visual-Comparison-of-Diff...

Terretta [3 hidden]5 mins ago

What happened to this block and the keyboard key arrangement?

  ESC  [  {  11011
  FS   \  |  11100
  GS   ]  }  11101

Also curious why the keys open and close braces, but ... the single and double curly quotes don't open and close, but are stacked. Seems nuts every time I type Option-{ and Option-Shift-{ …

kazinator [3 hidden]5 mins ago

You're no longer talking about ASCII. ASCII has only a double quote, apostrophe (which doubles as a single quote) and backtick/backquote.

Note on your Mac that the Option-{ and Option-}, with and without Shift, produce quotes which are all distinct from the characters produced by your '/" key! They are Unicode characters not in ASCII.

In the ASCII standard (1977 version here: https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-197...) the example table shows a glyph for the double quote which is vertical: it is neither an opening nor closing quote.

The apostrophe is shown as a closing quote, by slanting to the right; approximately a mirror image of the backtick. So it looks as though those two are intended to form an opening and closing pair. Except, in many terminal fonts, the apostrophe is a just vertical tick, like half of a double quote.

The ' being veritcal helps programming language '...' literals not look weird.

jolmg [3 hidden]5 mins ago

> What happened to this block and the keyboard key arrangement?

There's also these:

  | ASCII      | US keyboard |
  |------------+-------------|
  | 041/0x21 ! | 1 !         |
  | 042/0x22 " | 2 @         |
  | 043/0x23 # | 3 #         |
  | 044/0x24 $ | 4 $         |
  | 045/0x25 % | 5 %         |
  |            | 6 ^         |
  | 046/0x26 & | 7 &         |

kps [3 hidden]5 mins ago

https://en.wikipedia.org/wiki/Bit-paired_keyboard

seyz [3 hidden]5 mins ago

This is why Ctrl+C is 0x03 and Ctrl+G is the bell. The columns aren't arbitrary. They're the control codes with bit 6 flipped. Once you see it, you can't unsee it. Best ASCII explainer I've read.

dang [3 hidden]5 mins ago

Related. Others?

Four Column ASCII (2017) - https://news.ycombinator.com/item?id=21073463 - Sept 2019 (40 comments)

Four Column ASCII - https://news.ycombinator.com/item?id=13539552 - Feb 2017 (68 comments)

unnah [3 hidden]5 mins ago

If Ctrl sets bit 6 to 0, and Shift sets bit 5 to 1, the logical extension is to use Ctrl and Shift together to set the top bits to 01. Surely there must be a system somewhere that maps Ctrl-Shift-A to !, Ctrl-Shift-B to " etc.

maybewhenthesun [3 hidden]5 mins ago

It's more that shift flips that bit. Also I'd call them bit 0 and 1 and not 5 and 6 as 'normally' you count bits from the right (least significant to most significant). But there are lots of differences for 'normal' of course ('middle endian' :-P )

Leszek [3 hidden]5 mins ago

I guess in this system, you'd also type lowercase letters by holding shift?

ezekiel68 [3 hidden]5 mins ago

I love this stuff. It's the kind of lore that keeps getting forgotten and re-discovered by swathes of curious computer scientists over the years. So easy to assume many of the old artifacts (such as the ASCII table) had no rhyme or reason to them.

renox [3 hidden]5 mins ago

I still find weird that they didn't make A,B... just after the digits, that would make binary to hexadecimal conversion more efficient..

iguessthislldo [3 hidden]5 mins ago

Going off the timelines on Wikipedia, the first version of ASCII was published (1963) before the 0-9,A-F hex notation became widely used (>=1966):

- https://en.wikipedia.org/wiki/ASCII#History

- https://en.wikipedia.org/wiki/Hexadecimal#Cultural_history

jolmg [3 hidden]5 mins ago

The alphanumeric codepoints are well placed hexadecimally-speaking though. I don't imagine that was just an accident. For example, they could've put '0' at 050/0x28, but they put it at 060/0x30. That seems to me that they did have hexadecimal in consideration.

kubanczyk [3 hidden]5 mins ago

It's a binary consideration if you think of it rather than hexadecimal.

If you have to prominently represent 10 things in binary, then it's neat to allocate slot of size 16 and pad the remaining 6 items. Which is to say it's neat to proceed from all zeroes:

    x x x x 0 0 0 0
    x x x x 0 0 0 1
    x x x x 0 0 1 0
    ....
    x x x x 1 1 1 1

It's more of a cause for hexadecimal notation than an effect of it.

jolmg [3 hidden]5 mins ago

Currently 'A' is 0x41 and 0101, 'a' is 0x61 and 0141, and '0' is 0x30 and 060. These are fairly simple to remember for converting between alphanumerics and their codepoint. Seems more advantageous, especially if you might be reasonably looking at punchcards.

tgv [3 hidden]5 mins ago

[0-9A-Z] doesn't fit in 5 bits, which impedes shift/ctrl bits.

vanderZwan [3 hidden]5 mins ago

I'm not sure if our convention for hexadecimal notation is old enough to have been a consideration.

EDIT: it would need to predate the 6-bit teletype codes that preceded ASCII.

kps [3 hidden]5 mins ago

They put : ; immediately after the digits because they were considered the least used of the major punctuation, so that they could be replaced by ‘digits’ 10 and 11 where desired.

(I'm almost reluctant to to spoil the fun for the kids these days, but https://en.wikipedia.org/wiki/%C2%A3sd )

mac3n [3 hidden]5 mins ago

credit to William Crosby, "Note on an ASCII-Octal Code Table", CACM 8.10, Oct 1965

https://dl.acm.org/doi/epdf/10.1145/365628.365652

also defined 6-bit ASCII subset

mac3n [3 hidden]5 mins ago

anyone remember 005 ENQ (also called WRU who are you) and its effect on a teletype?

meken [3 hidden]5 mins ago

Very cool.

Though the 01 column is a bit unsatisfying because it doesn’t seem to have any connection to its siblings.

y42 [3 hidden]5 mins ago

first I was like "What but why? You don't save any space or what's that excercise about" then I read it again and it blew my mind. I thought I knew everything about ASCII. What a fool I am, Sokrates was right. Always.

msarnoff [3 hidden]5 mins ago

On early bit-paired keyboards with parallel 7-bit outputs, possibly going back to mechanical teletypes, I think holding Control literally tied the upper two bits to zero. (citation needed)

Also explains why there is no difference between Ctrl-x and Ctrl-Shift-x.

joshcorbin [3 hidden]5 mins ago

Just wait until someone finally gets why CSI ( aka the “other escape” from the 8-bit ansi realm, which is now eternalized in unicode C1 block ) is written ESC [ in 7-bit systems, such as the equally now eternal utf-8 encoding

SUDEEPSD25 [3 hidden]5 mins ago

Love this!

timonoko [3 hidden]5 mins ago

where does this character set come from? It looks different on xterm.

for x in range(0x0,0x20): print(chr(x),end=" ")

voxelghost [3 hidden]5 mins ago

What are you trying to achieve, none of those characters are printable, and definetly not going to show up on the web.

    for x in range(0x0,0x20): print(f'({chr(x)})', end =' ')
    (0|) (1|) (2|) (3|) (4|) (5|) (6|) (7|) (8) (9| ) (10|
    ) (11|
          ) (12|
    ) (14|) (15|) (16|) (17|) (18|) (19|) (20|) (21|) (22|) (23|) (24|) (25|)    (26|␦) (27|8|) (29|) (30|) (31|)

timonoko [3 hidden]5 mins ago

Just asking why they have different icons in different environments? Maybe it is UTF-8 vs ISO-8859?

rbanffy [3 hidden]5 mins ago

They shouldn't show as visual representations, but some "ASCII" charts show the IBM PC character set instead of the ASCII set. IIRC, up to 0xFF UTF-8 and 8859 are very close with the exceptions being the UTF-8 escapes for the longer characters.

timonoko [3 hidden]5 mins ago

Opera AI solved the problem:

If you want to use symbols for Mars and Venus for example,they are not in range(0,0x20). They are in Miscellanous Symbols block.

timonoko [3 hidden]5 mins ago

Ok this set does not even show on Android, just some boxes. Very strange.

Aardwolf [3 hidden]5 mins ago

Imho ascii wasted over 20 of its precious 128 values on control characters nobody ever needs (except perhaps the first few years of its lifetime) and could easily have had degree symbol, pilcrow sign, paragraph symbol, forward tick and other useful symbols instead :)

ogurechny [3 hidden]5 mins ago

Smaller, 6-bit code pages existed before and after that. They did not even have space for upper and lower case letters, but had control characters. Those codes directly moved the paper, switched to next punch card or cut the punched tape on the receiving end, so you would want them if you ever had to send more than a single line of text (or a block of data), which most users did.

Even smaller 5-bit Baudot code had already had special characters to shift between two sets and discard the previous character. Murray code, used for typewriter-based devices, introduced CR and LF, so they were quite frequently needed in way more than few years.

gpvos [3 hidden]5 mins ago

Maybe 32 was a bit much, but even fitting a useful set of control characters into, say, 16, would be tricky for me. For example, ^S and ^Q are still useful when text is scrolling by too fast.

bee_rider [3 hidden]5 mins ago

On top of the control symbols being useful, providing those symbols would have reduced the motivation for Unicode, right?

ASCII did us all the favor of hitting a good stopping point and leaving the “infinity” solution to the future.

zygentoma [3 hidden]5 mins ago

I started using the separator symbols (file, group, record, unit separator, ascii 60-63 ... though mostly the last two) for CSV like data to store in a database. Not looking back!

gschizas [3 hidden]5 mins ago

ASCII 60-63 is just <=>?

You probably mean 28-31 (∟↔▲▼, or ␜␝␞␟)

Unless this is octal notation? But 0o60-0o63 in octal is 0123

mmooss [3 hidden]5 mins ago

I've wanted to do that but don't you have compatibility problems? What can read/import files with those deliminters? Don't people you are working with have problems?

mmooss [3 hidden]5 mins ago

It is interesting that, as a guess, we waste an average of ~5% of storage capacity for text (12.5% of Unicode's first byte, but many languages regularly use higher bytes of course).

I don't fault the creators of ASCII - those control characters were probably needed at the time. The fault is ours for not moving on from the legacy technology. I think some non-ASCII/Unicode encodings did reuse the control character bytes. Why didn't Unicode implement that? I assume they were trying to be be compatible with some existing encodings, but couldn't they have chosen the encodings that made use of the control character code points?

If Unicode were to change it now (probably not happening, but imagine ...), what would they do with those 32 code points? We couldn't move other common characters over to them - those already have well-known, heavily used code points in Unicode and also iirc Unicode promises backward compability with prior versions.

There still are scripts and glyphs not in Unicode, but those are mostly quite rare and effectively would continue to waste the space. Is there some set of characters that would be used and be a good fit? Duplicate the most commonly used codepoints above 8 bits, as a form of compression? Duplicate combining characters? Have a contest? Make it a private area - I imagine we could do that anyway, because I doubt most systems interpret those bytes now.

Also, how much old data, which legitimately uses the ASCII control characters, would become unreadable?

y42 [3 hidden]5 mins ago

only that would have broken the whole thing back in the days ;)