Return to BSD News archive
Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!doc.ic.ac.uk!uknet!yorkohm!minster!forsyth From: forsyth@minster.york.ac.uk Newsgroups: comp.unix.bsd Subject: multibyte character representations and Unicode Message-ID: <721993836.11625@minster.york.ac.uk> Date: 17 Nov 92 09:50:36 GMT Organization: Department of Computer Science, University of York, England Lines: 42 Terry Weber suggests that half one's disc space will vanish on adopting Unicode. Not so: I draw your attention to Plan 9, which uses Unicode very successfully. See the Plan 9 documentation on research.att.com (dist/plan9doc, I think). Basically, there is a multibyte encoding for Unicode that works well. Inside relatively FEW programs the multibyte encoding is converted to an integer representation (the type `Rune') to simplify manipulation. For instance, the text displayed in a text frame by sam or the window manager is kept as Runes, but ONLY the text displayed. Any hidden text -- and text in disc files -- is kept in the multibyte encoding. Some care is required in specifying the multibyte encoding. It seems that Plan 9 originally followed the encoding specified in the Unicode standard, but it has some messy consequences in practice: not least that the 2nd and 3rd bytes can appear to be valid ASCII. (Why anyone would design an encoding that does this is beyond me, since the problems are fairly obvious, but that's what Unicode did.) Eventually Plan 9 switched to a new encoding -- which apparently has now been proposed for use in ISO 10646 -- that lacks all the unfortunate features. The second and third bytes of the encoding do not look like ASCII characters. (All bytes of an encoded character have the 0x80 bit set.) The consequence is that even fewer programs are affected: most pass Unicode encodings straight through. In particular, the `normal' file system names can hold Unicode characters without fuss. There is certainly no need to switch to 16-bit representations for them, with all that that entails. Actually, on Plan 9 you cannot even run the window manager without using Unicode: it's name is `eight and a half' (ie, 8 followed by a 1/2 symbol!), entered as `8 ALT 1 2' (on my keyboard, anyhow). You can find much of the Plan 9 Rune support in the source for Pike's editor `sam', also on research.att.com (dist/sam, i think). (You also get a very decent editor, a library that gives you a sane interface to X11, and a library for managing text on a bitmap display.) Obviously programs can store Runes in disc files if that's really what they need, or if their authors work for disc manufacturers, but it isn't necessary.