All About ASCII

 ⇑Contents⇑ 

Contents

Character encodings

Early on in the history of computing, pretty much all of the computer’s output was either a bunch of blinkenlights that told the educated operator how the computer was feeling and what the contents of its registers were. This worked reasonably well for the era; computers were made to do number-crunching work and not to baby their users around by using…TEXT. (shudder).

Soon enough it was decided that text was actually a useful thing for business and communications, and so printers were made that took various numeric codes as input and raucously outputted readable text. Different computer manufacturers used different numeric coding schemes for text output, different numbers of bits, and different cables, so a printer had to be made by the same manufacturer that made the computer. (As did everything else attached to the computer.)

6-bit character codes were fairly common at the time, and variations on character numbering and ordering abounded. You had 64 slots in which to fit 26 alphabetic characters (upper-case only), 10 decimal digits, and any punctuation that might be useful in the programming language your system supported. (Probably FORTRAN or COBOL.) If you wanted to port your code over to another system, chances are its character encoding was different from yours. This was, to put it succinctly, a sucky situation.

In 1963, the American Standards Association (ASA, now ANSI) dumped out a standard for a 7-bit interchange code called ASCII (American Standard Code for Information Interchange), pronounced “ASS-key”. The 8-bit byte was starting to become more standard, so using a 7-bit character code allowed an extra bit for parity, since one of the common uses for computers was in controlling teletypes long-distance.

Most computer manufacturers ever since then have used ASCII as the primary character coding, with the notable exception of IBM, which created EBCDIC. (I’ve only ever heard this pronounced “EBB-ka-DIK”. These were the days before acronyms had to be cute and pronounceable, and maybe recursive if possible.) EBCDIC is an interesting coding, but nobody really uses it much any more, so if you care about it look it up.

Anyway, here’s the Grand Chart of ASCII Characters. Note that this is the most current-ish version of ASCII; earlier versions had slightly different mappings, but nobody cares about them now.

0x 1x 2x3x4x5x6x7x
x0 NUL DLE SP0 @ P ` p
x1 SOH DC1/XON!1AQa q
x2 STX DC2" 2 B R b r
x3 ETX DC3/XOFF#3CScs
x4 EOT DC4$ 4 D T d t
x5 ENQ NAK% 5 E U e u
x6 ACK SYN&6FVf v
x7 BEL ETB' 7 G W g w
x8 BS CAN( 8 H X h x
x9 HT EM ) 9 I Y i y
xA LF SUB* : J Z j z
xB VT ESC+ ; K [ k {
xC FF FS , <L\l |
xD CR GS - = M ] m }
xE SO RS . >N^n ~
xF SI US / ? O _ o DEL
 ↑To top↑ 

The control characters

Most of the characters from ASCII 0 to 0x1F are wasted space nowadays, but I’ll give you a full listing of what they are and how you can use them anyway. Along with the full names, I give you the special character escape in C, if any, and the control sequence you can usually press at a terminal to generate the character.

NUL: Null ('\0', Control+@)
This could be used as a keepalive for a teletype back in the day, and if you print it it should do absolutely nothing. It’s used mostly to terminate strings of text nowadays.
SOH: Start of header (Control+A)
When you sent a message on a teletype, you sent a header and some text, generally. This started a header. Pretty much unused nowadays.
STX: Start of text (Control+B)
This would start the text of a message; again, pretty much unused now.
ETX: End of text (Control+C)
This would end the text of a message; unused now. This is a break character on UNIX and DOS. (The ETX meaning may have something to do with why this is used as a break-this-program character; I don’t know.)
EOT: End of transmission (Control+D)
This would signal to a teletype that you had no more characters to send it and it could disconnect. UNIX picked this up and uses it for the end-of-file (EOF) marker in text documents, and it can be used to “end” standard input when a program’s reading it from a terminal.
ENQ: Enquiry (Control+E)
Basically a glorified “ping”. This was sent from one station to another to see if anybody/anything on the other side was available. Unused except in hacker parlance nowadays.
ACK: Acknowledgement (Control+F)
A glorified “yes”. This would generally be sent back in response to an ENQ as the pong to its ping, or after a message was received to indicate that all had gone well with transmission.
BEL: Bell ('\a', Control+G)
Sending this would ring the bell on the teletype/printer to notify whomever was around there that the teletype/printer bell was ringing. (Specific reasons varied from annoying the other party to summoning them for emergencies.) Some modern printers will still beep when you send them this character; try it sometime and see. At a terminal, this generally just beeps nowadays, and is the reason that UNIX console users conjure a horrible din when they use VIM. In C, this is '\a' for “alert”.
Note: If you’d like to disable the beeps at a Linux terminal, you can run the following at a Bash prompt:
echo -en '\e[10;0]\e[11;0]'
This essentially tells the terminal driver to set the volume and pitch of beeps to 0. May or may not work in other operating systems.
BS: Backspace ('\b', Control+H)
This moves the print head or cursor to the left by one character, if possible. Back in the day you could use it to overstrike things, and you’ll still see it occasionally used to underline things in text files. (For example, if you want to underline “Hello” you send “Hello” BS BS BS BS BS “_____”, which is actually the purpose of having the underscore in ASCII.) About half the time the Backspace key at a terminal will generate this; the other half, it’ll generate DEL.
HT: Horizontal tabulation ('\t', Control+I or Tab)
This moves the print head or cursor to the next tab stop on the line, technically. Nowadays if you’re at a terminal this generally tabs you out to a multiple of 8 spaces, possibly dumping you onto the next line if there isn’t another multiple on the line. Since the dawn of time, we’ve had 80-character lines, so that gives you 10 stops per line.
LF: Line feed ('\n', Control+J)
This moves the print head or cursor down one line, but leaves the horizontal position the same… unless you’re in UNIX, in which case this has the name “NL” (“New Line”) and it homes the cursor/print head to the first column as well. 99.9% of the time nowadays this is either standing alone or preceded by a CR character; see the section on newlines. Some terminals generate this character for Enter.
VT: Vertical tabulation ('\v', Control+K)
Almost purely historical nowadays. This would move the print head down to a vertical tab stop on the page (say, a multiple of 1" or 2") back in the day; today, just about any printer or terminal will treat it as equivalent to a LF. (The only exception is that you really shouldn’t end lines in VT or CR/VT, because most things won’t know what to do with it.)
FF: Form feed ('\f', Control+L)
This scrolls the print head down to the next sheet of paper, or “scrolls” the terminal down to the next screen (really just clears it). Some terminals treat this as a LF. This is still actually fairly commonly used in printers, and you’ll find them lying around old ASCII text files to control page-breaking.
CR: Carriage return ('\r', Control+M)
This whangs the print head back to the first column, or does the same with the cursor if you’re on a terminal. On DOS/Windows/VMS computers, this is the first half of a newline sequence, and on Macintosh computers this is the newline sequence. See the section on newlines for more info. Sometimes Enter and Return generate this at a terminal.
SO: Shift out (Control+N)
This is used occasionally in extended character codings to do something called a lock shift into another character set. (For example, in an 8-bit coding you could use SO and SI to control what set of characters were mapped to 0x80–0xFF but leave 0x00–0x7F the same.) Becoming less common in general.
SI: Shift in (Control+O)
This is the counterpart to SO, and generally does the opposite if it does anything at all.
DLE: Data link escape (Control+P)
This escapes one or more characters following so that the receiving system doesn’t try to actually follow the commands encoded with them. For example, if you need to send ABCETXDEF in a message delimited by STX and ETX, you could send STXABCDLEETXDEFETX. This is pretty much unused now.
DC1: Device control 1 (Control+Q)
Technically this is just a generic device control code that the receiver and sender can use for whatever purpose they desire. In practice, this is almost always XON, which is half of a commonly used software flow control protocol. See the section on XON/XOFF.
DC2: Device control 2 (Control+R)
Just a generic device control code. This doesn’t have any common meaning that I’m aware of.
DC3: Device control 3 (Control+S)
Like DC1 and DC2, this could theoretically be used for any purpose, but it’s almost always XOFF. See the section on XON/XOFF.
DC4: Device control 4 (Control+T)
Like DC2; no special meaning.
NAK: Negative acknowledgement (Control+U)
The counterpart to ACK; used as a “no” or “you screwed up”. It could be used as a response to ENQ to indicate that the ENQed terminal isn’t ready, or it could be used in response to a message to indicate that it wasn’t received intact. Nowadays it’s mainly used in hacker parlance. This is sometimes a kill character on UNIX.
SYN: Synchronous idle (Control+V)
This was used either to establish a connection or as a keepalive to let the receiver know that the sender was still there. A common connection setup sequence was SYN SYN ENQ, to which one hoped the response was ACK. (Similar things still happen in TCP handshaking.) The character’s pretty much unused nowadays.
ETB: End of transmission block (Control+W)
This could be used to chunk data into packets; one could send data, ETB, data, ETB, data, EOT to break up a large transmission into smaller ones. Unused now.
CAN: Cancel (Control+X)
CAN is generally used to cancel a command sequence or discard the most recent outgoing message. It can still be used sometimes if you’re talking to a printer and you need to break out of a non-default printing mode, or if you’ve somehow outputted part of an escape sequence and need to break out of it. Sometimes works on terminals too.
EM: End of medium (Control+Y)
Indicates that the end of the current printing medium has been reached. This would generally be returned from a printer to let you know that you need to send it a FF. Never seen it actually used.
SUB: Substitute (Control+Z)
This could be used to replace a character that was invalid or couldn’t be rendered. (e.g., sending a Chinese character to a 1970’s-era American mainframe.) You might want to use this if you convert from larger character sets into ASCII and can’t find any suitable representation, but it’s probably best not to any more. This character is also DOS’s EOF, and has the same function in DOS as Control+D does in UNIX.
ESC: Escape ('\e', Control+[)
Escape is a commonly used escape sequence introducer, which means that if it shows up in the middle of text, a command to the receiver follows it. Different systems over the years have come up with different sets of escape codes to switch character sets, change font, change color, move the cursor around, fill the screen with Es (I’m not kidding, it still works on UNIX), and so forth, although the main set of escape codes that’s still used was standardized by ANSI and used on the DEC VT-series serial terminals.
Most UNIX systems today use escape codes fairly extensively, both to transmit keycodes that aren’t in ASCII (e.g., F1 or Home) and to send commands to the terminal driver (e.g., make text white on a red background). It’s horribly outmoded, but it’s The Way It’s Always Been Done, Dammit. Only newer/GCC-like versions of C support '\e', and Escape on most terminals generates this character.
FS: File separator (Control+\)
This could be used to separate text files if they were being transmitted en masse, but I’ve never really heard of anyone using it for anything except occasionally a kill character on UNIX.
GS: Group separator (Control+])
This could be used to separate groups of objects transmitted en masse. Never heard of it being used.
RS: Record separator (Control+^)
This could be used to separate records in databases, back when those were the only sorts of files; I’ve never heard of it being used.
US: Unit separator (Control+_)
This could be used to separate units within records in databases, but it’s never used like this any more.
DEL: Delete (Control+?)
Also called “rubout”, this would generally move the cursor left one character and delete the character there. Very few printers can do something like this. This is often generated by the Backspace key instead of BS. (Depends on the terminal and the context.)

Unfortunately, this control code has come to mean something completely different from what it first meant. It’s called “rubout” because, when punch-cards were used for data storage, you could “rub out” a character by poking out all the holes. All 7 bits would be 1, which gives you DEL, the deleted-character character.

SP: Space
This is technically not really a control code, but it’s a close relative. It’s our friend the space, no special key combos needed.
 ↑To top↑ 

Newlines

From a purely technical standpoint, if you want to move the cursor to the first column on the next line in ASCII, you send a CR and a LF (I’ll get to ordering). This homes the cursor and moves it down. In the years since ASCII was created, those who created UNIX realized that one very seldom needs to send a LF without a preceding CR, so they made LF into NL, which both homed the cursor and moved to the next line. Mind you, if you have a preceding CR it doesn’t really matter; it’s just extraneous.

Since then, numerous operating systems have sprung up and numerous programs have been written with Assumptions. UNIX programs now Assume that lines are ended with LF/NL and VMS/Windows/DOS programs now Assume that lines are ended with CR, then LF. Macintosh programs, oh dear oh dear, assume that lines are ended with CR; the only excuse I can surmise for this is that the Return key would generate a CR character, which could then be included literally in a file one were typing. There are occasionally even misguided programmers who output LF, then CR for newlines.

It’s generally wise to assume that your program might get a text file from a different system, so you should generally accept any of the four forms. When writing files, you can generally go with the default for your system, although most network protocols and many file formats require CR-LF sequences to be used for line ends. (Most servers are polite enough not to argue if you just use a LF character, however.) Make sure you find out what newline type you should be using before you write the program—it’s quite a hassle to have to go through and insert or remove every “\r” in your program.

Now a brief, vaguely interesting historical point about ordering. You may wonder, as yous are wont to do, why it’s always CR, then LF, and not the other way around; the reason stems from the mechanical properties of teletypes. A carriage return requires the carriage to be shoved all the way over from wherever it is on the page to the left side, which takes a while in computing terms, whereas a line feed only requires a tiny jerk of the motor controlling the paper. By sending a CR, then an LF, the LF can happen while the carriage is hurtling leftwards, which can cut down printing time significantly for a large number of lines.

None of this matters much nowadays, of course, unless you’re printing on something with an actual carriage, like a dot-matrix or daisy-wheel printer.

 ↑To top↑ 

XON and XOFF

I’ve seen countless UNIX newbies edit something at their terminal and press Control+S to save, then freak out because their terminal appears to have locked up. This is a Feature, and By Design.

Back in the day, serial terminals/printers were slow and could only accept a few characters before their buffers filled. Serial lines were also painfully simple affairs, and one sent data by bashing the bits directly onto the line. (Hence NUL’s position at ASCII 0—no transmission would do nothing, as expected.) If the receiving device filled its four-character buffer while those characters were being printed, it would have to tell the sender to stop sending until its buffer emptied.

This is where flow control comes in: The receiver can send an XOFF character to the sender to pause the data flow, then send an XON character when it’s ready to receive again. Of course, this is actually a Very Bad Idea on a full-duplex line, since the sender can have sent one or more characters to the receiver before the sender receives the XOFF character, and those characters would be lost.

Software flow control becomes more useful in modern bidirectional connections, such as a text terminal; oftentimes, one wants to slow the progress of the data flying at one (try “tar --help” some time), and in such cases XON/XOFF are one’s friend.

If you’re Mr. Newbie, though, nobody told you about this and you’re losing it because that was a very important document that you neglected to save. Fear not! For every XOFF there may be an XON, and pressing Control+Q will start data flowing again. (Most text-mode unix applications won’t quit if they receive this, contrary to our GUI-based expectations.) Hooray for history.

 ↑To top↑ 

Useful notes about ASCII

 ↑To top↑ 

ASCII and Unicode

ASCII works reasonably well for most English speakers and for most programs dealing with English speakers. Problem is, though, most of the world isn’t English speakers, and even English speakers run into problems when they try to format text correctly; there are actually no proper quotes or dashes in ASCII—the hyphen character serves as a hyphen, minus, en dash, or part of an em dash, all of which should have distinct renderings and meanings.

In the years since ASCII was formalized, a slew of character encodings has been created, each intended for a particular purpose or audience. Some encodings were created to add English-language formatting marks like the aforementioned dashes and quotes and such, some added mathematical characters, some added a few accented characters, and some added thousands of complex characters. Most of these encodings, fortunately, are roughly supersets of ASCII, so ASCII text is generally a safe default.

More recently, a batch of geeks got together and started working on Unicode, which is intended to encode all real (i.e., not Klingon) characters by their meanings, as well as provide a single coding to which most other existing codings could be converted. Unicode permits the encoding of 1,114,112 characters, although only certain sections of that code space are defined by the standard. Seems like the perfect solution, right?

Unfortunately, everything from the UNIX era on assumed that a character is a single byte, and most things assume that that byte has a value from 0 to 127. It’s rather hard to cram 1,114,112 values into a single byte.

So nobody tried. There are several low-level data formats for Unicode: UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. UTF-7 is sometimes used for antiquated mail servers that, again, assume that a character is a byte from 0 to 127. UTF-32LE and UTF-32BE use four bytes per character; this wastes quite a few bits, but makes it possible to encode any Unicode character in a single unit. UTF-32LE/BE is not an uncommon internal representation for Unicode characters.

UTF-16BE/LE are fairly common 16-bit representations for Unicode, and use one or two 16-bit words to encode each character. Most characters you’ll ever use will fit comfortably in a single 16-bit word, but if you do need one of the higher ones it’ll be represented by two surrogates, each of which provides part of the higher code value. Note that Java unfortunately uses UTF-16 for its characters; Java was created before any higher characters were assigned, and they apparently failed to pay attention to the large surrogate ranges.

UTF-8 is the most common representation for Unicode, and uses 1–4 bytes to represent all of Unicode. This is nice because it works well with our byte-oriented systems. (However, the top bit is used in each byte of a character code >127, so that kills off quite a few systems.) Problem is, you have to hunt through a UTF-8 string sequentially—you can’t just assume that the fifth character is at any given offset, because characters 1—4 could be multi-byte characters.

One big thing UTF-8 does have going for it is that it isn’t byte-order-dependent. If you’re writing a UTF-32/UTF-16 character, you can either write the highest byte first or the lowest byte first; this is termed big-endian (BE) or little-endian (LE). If somebody who writes a file uses UTF-16BE and sends it to somebody who reads it as UTF-16LE, the reader’s going to see garbage. To remedy this, there’s a byte order mark (BOM) specified in Unicode; if it’s out of order, it becomes an invalid character, and if it’s in order, it becomes a zero-width non-breaking space. (In other words, if it works it doesn’t do anything.) Most UTF-16 and UTF-32 files you’ll ever find will have a BOM as the first character so that whatever reads it knows what to do.

There’s some debate about whether one should write a BOM at the beginning of a UTF-8 file. Technically, UTF-8 needs no BOM, because it only has one possible byte ordering. The only reason to include a BOM is to mark the file as unequivocally UTF-8. While this would be a nice feature, this unfortunately breaks compatibility with normal ASCII text files. For example, scripts in UNIX need to start with a “shebang” (#!), which tells the OS how to run the script:

#!/bin/bash
echo "Lookie, I'm running."
If you saved that file as UTF-8 with a BOM, the shebang would no longer be the first thing in the file, and the OS wouldn’t know how to run the script. (This usually just means it defaults to using the default shell, which won’t know what to do with the BOM, either.)

So the verdict? Unless you’re writing plain text with no annotation, decoration, or interpretation, don’t put a BOM in your UTF-8 text. And now, a table of BOMs:

UTF-32 BOM UTF-16 BOMUTF-8 BOM
BE00 00 FE FFFE FF EF BB BF
LEFF FE 00 00FF FE
 ↑To top↑ 

Notes about related encodings

You’ll see a lot of “extended ASCII” charts if you go hunting around on the web, and you’ll see a lot of DOS programs refer to such things. Unfortunately, “extended ASCII” doesn’t actually exist, in any formal way. Basically, the video cards on the IBM PC series supported ASCII in the low 128 characters and a bunch of extra characters in a seemingly random order if the top bit was set. In addition, since the video card didn’t need to display control characters on the screen, most control characters were filled in with graphic characters too.

For historical purposes, all PCs in text mode and all DOS programs still use this character set; it’s generally referred to as “code page 437” (CP437) because that’s what one would use to select that set of characters in internationalized versions of DOS. Earlier versions of Windows, unfortunately, used something called “code page 1252”, which has a completely different set of high characters. Many Windows programs default to this code page when they drop characters to an eight-bit form.

The ISO 8859 standard created a bunch of pages based on ASCII, too; it’s fairly common to find ISO 8859-1, the Western European page, in use for text documents under Windows and UNIX. Code page 1252 (CP1252) is basically ISO 8859-1 with a bunch of extra characters jammed into a secondary control character section (the C1 controls, as opposed to the C0 controls in ASCII).

 ↑To top↑ 

Useful links

 ↑To top↑ 
 ⇑Contents⇑