| ⇑Contents⇑ |
Early on in the history of computing, pretty much all of the computer’s output was either a bunch of blinkenlights that told the educated operator how the computer was feeling and what the contents of its registers were. This worked reasonably well for the era; computers were made to do number-crunching work and not to baby their users around by using…TEXT. (shudder).
Soon enough it was decided that text was actually a useful thing for business and communications, and so printers were made that took various numeric codes as input and raucously outputted readable text. Different computer manufacturers used different numeric coding schemes for text output, different numbers of bits, and different cables, so a printer had to be made by the same manufacturer that made the computer. (As did everything else attached to the computer.)
6-bit character codes were fairly common at the time, and variations on character numbering and ordering abounded. You had 64 slots in which to fit 26 alphabetic characters (upper-case only), 10 decimal digits, and any punctuation that might be useful in the programming language your system supported. (Probably FORTRAN or COBOL.) If you wanted to port your code over to another system, chances are its character encoding was different from yours. This was, to put it succinctly, a sucky situation.
In 1963, the American Standards Association (ASA, now ANSI) dumped out a standard for a 7-bit interchange code called ASCII (American Standard Code for Information Interchange), pronounced “ASS-key”. The 8-bit byte was starting to become more standard, so using a 7-bit character code allowed an extra bit for parity, since one of the common uses for computers was in controlling teletypes long-distance.
Most computer manufacturers ever since then have used ASCII as the primary character coding, with the notable exception of IBM, which created EBCDIC. (I’ve only ever heard this pronounced “EBB-ka-DIK”. These were the days before acronyms had to be cute and pronounceable, and maybe recursive if possible.) EBCDIC is an interesting coding, but nobody really uses it much any more, so if you care about it look it up.
Anyway, here’s the Grand Chart of ASCII Characters. Note that this is the most current-ish version of ASCII; earlier versions had slightly different mappings, but nobody cares about them now.
| 0x | 1x | 2x | 3x | 4x | 5x | 6x | 7x | |
|---|---|---|---|---|---|---|---|---|
| x0 | NUL | DLE | SP | 0 | @ | P | ` | p |
| x1 | SOH | DC1/XON | ! | 1 | A | Q | a | q |
| x2 | STX | DC2 | " | 2 | B | R | b | r |
| x3 | ETX | DC3/XOFF | # | 3 | C | S | c | s |
| x4 | EOT | DC4 | $ | 4 | D | T | d | t |
| x5 | ENQ | NAK | % | 5 | E | U | e | u |
| x6 | ACK | SYN | & | 6 | F | V | f | v |
| x7 | BEL | ETB | ' | 7 | G | W | g | w |
| x8 | BS | CAN | ( | 8 | H | X | h | x |
| x9 | HT | EM | ) | 9 | I | Y | i | y |
| xA | LF | SUB | * | : | J | Z | j | z |
| xB | VT | ESC | + | ; | K | [ | k | { |
| xC | FF | FS | , | < | L | \ | l | | |
| xD | CR | GS | - | = | M | ] | m | } |
| xE | SO | RS | . | > | N | ^ | n | ~ |
| xF | SI | US | / | ? | O | _ | o | DEL |
Most of the characters from ASCII 0 to 0x1F are wasted space nowadays, but I’ll give you a full listing of what they are and how you can use them anyway. Along with the full names, I give you the special character escape in C, if any, and the control sequence you can usually press at a terminal to generate the character.
Unfortunately, this control code has come to mean something completely different from what it first meant. It’s called “rubout” because, when punch-cards were used for data storage, you could “rub out” a character by poking out all the holes. All 7 bits would be 1, which gives you DEL, the deleted-character character.
From a purely technical standpoint, if you want to move the cursor to the first column on the next line in ASCII, you send a CR and a LF (I’ll get to ordering). This homes the cursor and moves it down. In the years since ASCII was created, those who created UNIX realized that one very seldom needs to send a LF without a preceding CR, so they made LF into NL, which both homed the cursor and moved to the next line. Mind you, if you have a preceding CR it doesn’t really matter; it’s just extraneous.
Since then, numerous operating systems have sprung up and numerous programs have been written with Assumptions. UNIX programs now Assume that lines are ended with LF/NL and VMS/Windows/DOS programs now Assume that lines are ended with CR, then LF. Macintosh programs, oh dear oh dear, assume that lines are ended with CR; the only excuse I can surmise for this is that the Return key would generate a CR character, which could then be included literally in a file one were typing. There are occasionally even misguided programmers who output LF, then CR for newlines.
It’s generally wise to assume that your program might get a text file from a different system, so you should generally accept any of the four forms. When writing files, you can generally go with the default for your system, although most network protocols and many file formats require CR-LF sequences to be used for line ends. (Most servers are polite enough not to argue if you just use a LF character, however.) Make sure you find out what newline type you should be using before you write the program—it’s quite a hassle to have to go through and insert or remove every “\r” in your program.
Now a brief, vaguely interesting historical point about ordering. You may wonder, as yous are wont to do, why it’s always CR, then LF, and not the other way around; the reason stems from the mechanical properties of teletypes. A carriage return requires the carriage to be shoved all the way over from wherever it is on the page to the left side, which takes a while in computing terms, whereas a line feed only requires a tiny jerk of the motor controlling the paper. By sending a CR, then an LF, the LF can happen while the carriage is hurtling leftwards, which can cut down printing time significantly for a large number of lines.
None of this matters much nowadays, of course, unless you’re printing on something with an actual carriage, like a dot-matrix or daisy-wheel printer.
I’ve seen countless UNIX newbies edit something at their terminal and press Control+S to save, then freak out because their terminal appears to have locked up. This is a Feature, and By Design.
Back in the day, serial terminals/printers were slow and could only accept a few characters before their buffers filled. Serial lines were also painfully simple affairs, and one sent data by bashing the bits directly onto the line. (Hence NUL’s position at ASCII 0—no transmission would do nothing, as expected.) If the receiving device filled its four-character buffer while those characters were being printed, it would have to tell the sender to stop sending until its buffer emptied.
This is where flow control comes in: The receiver can send an XOFF character to the sender to pause the data flow, then send an XON character when it’s ready to receive again. Of course, this is actually a Very Bad Idea on a full-duplex line, since the sender can have sent one or more characters to the receiver before the sender receives the XOFF character, and those characters would be lost.
Software flow control becomes more useful in modern bidirectional connections, such as a text terminal; oftentimes, one wants to slow the progress of the data flying at one (try “tar --help” some time), and in such cases XON/XOFF are one’s friend.
If you’re Mr. Newbie, though, nobody told you about this and you’re losing it because that was a very important document that you neglected to save. Fear not! For every XOFF there may be an XON, and pressing Control+Q will start data flowing again. (Most text-mode unix applications won’t quit if they receive this, contrary to our GUI-based expectations.) Hooray for history.
| AT&T syntax | Intel/Microsoft syntax | |
|---|---|---|
| cmpb $10,%al | cmp al,10 | |
| sbbb $0x69,%al | sbb al,69h | |
| das | das |
If you’re stuck with ASCII and you don’t care about your user’s vision, you can also swap between two characters at high speed to get the same effect on a monitor.
ASCII works reasonably well for most English speakers and for most programs dealing with English speakers. Problem is, though, most of the world isn’t English speakers, and even English speakers run into problems when they try to format text correctly; there are actually no proper quotes or dashes in ASCII—the hyphen character serves as a hyphen, minus, en dash, or part of an em dash, all of which should have distinct renderings and meanings.
In the years since ASCII was formalized, a slew of character encodings has been created, each intended for a particular purpose or audience. Some encodings were created to add English-language formatting marks like the aforementioned dashes and quotes and such, some added mathematical characters, some added a few accented characters, and some added thousands of complex characters. Most of these encodings, fortunately, are roughly supersets of ASCII, so ASCII text is generally a safe default.
More recently, a batch of geeks got together and started working on Unicode, which is intended to encode all real (i.e., not Klingon) characters by their meanings, as well as provide a single coding to which most other existing codings could be converted. Unicode permits the encoding of 1,114,112 characters, although only certain sections of that code space are defined by the standard. Seems like the perfect solution, right?
Unfortunately, everything from the UNIX era on assumed that a character is a single byte, and most things assume that that byte has a value from 0 to 127. It’s rather hard to cram 1,114,112 values into a single byte.
So nobody tried. There are several low-level data formats for Unicode: UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. UTF-7 is sometimes used for antiquated mail servers that, again, assume that a character is a byte from 0 to 127. UTF-32LE and UTF-32BE use four bytes per character; this wastes quite a few bits, but makes it possible to encode any Unicode character in a single unit. UTF-32LE/BE is not an uncommon internal representation for Unicode characters.
UTF-16BE/LE are fairly common 16-bit representations for Unicode, and use one or two 16-bit words to encode each character. Most characters you’ll ever use will fit comfortably in a single 16-bit word, but if you do need one of the higher ones it’ll be represented by two surrogates, each of which provides part of the higher code value. Note that Java unfortunately uses UTF-16 for its characters; Java was created before any higher characters were assigned, and they apparently failed to pay attention to the large surrogate ranges.
UTF-8 is the most common representation for Unicode, and uses 1–4 bytes to represent all of Unicode. This is nice because it works well with our byte-oriented systems. (However, the top bit is used in each byte of a character code >127, so that kills off quite a few systems.) Problem is, you have to hunt through a UTF-8 string sequentially—you can’t just assume that the fifth character is at any given offset, because characters 1—4 could be multi-byte characters.
One big thing UTF-8 does have going for it is that it isn’t byte-order-dependent. If you’re writing a UTF-32/UTF-16 character, you can either write the highest byte first or the lowest byte first; this is termed big-endian (BE) or little-endian (LE). If somebody who writes a file uses UTF-16BE and sends it to somebody who reads it as UTF-16LE, the reader’s going to see garbage. To remedy this, there’s a byte order mark (BOM) specified in Unicode; if it’s out of order, it becomes an invalid character, and if it’s in order, it becomes a zero-width non-breaking space. (In other words, if it works it doesn’t do anything.) Most UTF-16 and UTF-32 files you’ll ever find will have a BOM as the first character so that whatever reads it knows what to do.
There’s some debate about whether one should write a BOM at the beginning of a UTF-8 file. Technically, UTF-8 needs no BOM, because it only has one possible byte ordering. The only reason to include a BOM is to mark the file as unequivocally UTF-8. While this would be a nice feature, this unfortunately breaks compatibility with normal ASCII text files. For example, scripts in UNIX need to start with a “shebang” (#!), which tells the OS how to run the script:
| #!/bin/bash |
| echo "Lookie, I'm running." |
So the verdict? Unless you’re writing plain text with no annotation, decoration, or interpretation, don’t put a BOM in your UTF-8 text. And now, a table of BOMs:
| UTF-32 BOM | UTF-16 BOM | UTF-8 BOM | |
|---|---|---|---|
| BE | 00 00 FE FF | FE FF | EF BB BF |
| LE | FF FE 00 00 | FF FE |
You’ll see a lot of “extended ASCII” charts if you go hunting around on the web, and you’ll see a lot of DOS programs refer to such things. Unfortunately, “extended ASCII” doesn’t actually exist, in any formal way. Basically, the video cards on the IBM PC series supported ASCII in the low 128 characters and a bunch of extra characters in a seemingly random order if the top bit was set. In addition, since the video card didn’t need to display control characters on the screen, most control characters were filled in with graphic characters too.
For historical purposes, all PCs in text mode and all DOS programs still use this character set; it’s generally referred to as “code page 437” (CP437) because that’s what one would use to select that set of characters in internationalized versions of DOS. Earlier versions of Windows, unfortunately, used something called “code page 1252”, which has a completely different set of high characters. Many Windows programs default to this code page when they drop characters to an eight-bit form.
The ISO 8859 standard created a bunch of pages based on ASCII, too; it’s fairly common to find ISO 8859-1, the Western European page, in use for text documents under Windows and UNIX. Code page 1252 (CP1252) is basically ISO 8859-1 with a bunch of extra characters jammed into a secondary control character section (the C1 controls, as opposed to the C0 controls in ASCII).
| ↑To top↑ |
| ⇑Contents⇑ |