Return to BSD News archive
Newsgroups: comp.unix.bsd Path: sserve!manuel.anu.edu.au!munnari.oz.au!hp9000.csc.cuhk.hk!saimiri.primate.wisc.edu!zaphod.mps.ohio-state.edu!sdd.hp.com!swrinde!gatech!news.byu.edu!ux1!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: multibyte character representations and Unicode Message-ID: <1992Nov23.193620.9513@fcom.cc.utah.edu> Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <721993836.11625@minster.york.ac.uk> Date: Mon, 23 Nov 92 19:36:20 GMT Lines: 73 In article <721993836.11625@minster.york.ac.uk> forsyth@minster.york.ac.uk writes: >Terry Weber suggests that half one's disc space will vanish >on adopting Unicode. Not so: I draw your attention to Plan 9, >which uses Unicode very successfully. See the Plan 9 documentation >on research.att.com (dist/plan9doc, I think). If you are talking about truly using the Unicode standard, then you are talking about using 16 bits for English characters instead of 8 bits. While your disk space wouldn "disappear", it would be halved, unless you mixed storage of Unicode and ASCII on the same disk. Unicode contains a total of 34,348 characters. This is 52% of the largest number of characters representable in 16 bits (65536), and is also larger than what can be represented with conditional multibyting (8th bit set on first character indicating multibyte, otherwise 7 bit ASCII), which is 32,896 characters (128 + 128 * 256). It seems to me that Unicode representation on disk requires 2 bytes per character (symbol). Thus a document file in English that used to tak 2K stored in ASCII would take 4K (2K symbols * 2 bytes per symbol) to store in Unicode. >Eventually Plan 9 switched to a new encoding -- which apparently has now been >proposed for use in ISO 10646 -- that lacks all the unfortunate features. >The second and third bytes of the encoding do not look like ASCII characters. >(All bytes of an encoded character have the 0x80 bit set.) >The consequence is that even fewer programs are affected: >most pass Unicode encodings straight through. This isn't really Unicode unless it follows Unicode encoding, and it lacks the ability to provide a fixed size per symbol storage mechanism, but I agree that ISO 10646 is a real possibility, although it seems rather English centric. In X, to provide an 8x8 Unicode font, it takes 274784 bytes of storage for the actual font glyphs, plus overhead; a 10x20 takes 1030440 (just under a Meg, assuming the overhead is less than 18K). Both could easily be done in ROM. Without multibyte encoding (ie: straight 16 bit multibyte), the output is straightforward using X. The same is true for an "English-only" or other ("Cyrillic -only", etc.) font, since X fonts are allowed to be sparse; thus the full Unicode font is only necessary for multinational use of the same device... even then, the amount of glyphs in a font need only be enough to intersect both sets. Thus, in many cases, font-fill centric encoding (ie: this is the font I used, and these are the 8 bit representations of the Unicode characters lexically within the font) is sufficient to provide 8 bit storage for all but Kanji. If the Japaneese could limit themselves to Kana (Katakana/Hirugana), then they could also benefit from this storage technique as well (this would also go a long way towards making them compute-competitive and reduce the hoops one jumps through when using a Kanji keyboard). >In particular, the `normal' file system names can hold Unicode >characters without fuss. There is certainly no need to switch to 16-bit >representations for them, with all that that entails. No argument here; however, I would say that picking a font-fill encoding as a file storage attribute would be sufficient for this as well. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------