Return to BSD News archive
Newsgroups: comp.unix.bsd Path: sserve!manuel.anu.edu.au!munnari.oz.au!news.hawaii.edu!ames!saimiri.primate.wisc.edu!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!eff!news.byu.edu!ux1!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: multibyte character representations and Unicode Message-ID: <1992Nov25.224757.4769@fcom.cc.utah.edu> Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <721993836.11625@minster.york.ac.uk> <1992Nov23.193620.9513@fcom.cc.utah.edu> <id.C19V.DY@ferranti.com> Date: Wed, 25 Nov 92 22:47:57 GMT Lines: 61 In article <id.C19V.DY@ferranti.com> peter@ferranti.com (peter da silva) writes: >In article <1992Nov23.193620.9513@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes: >> While your disk space wouldn "disappear", it would be halved, unless you >> mixed storage of Unicode and ASCII on the same disk. > >What, even for code? For anything sotred in the 16 bit character set (Unicode) instead of some 8 bit set (Extended ASCII, ISO Latin-1). Yes, this includes code, since code is something that you will access with supposedly "Unicode aware" editors, just like all other text. A 16 bit value less than 256 takes the same number of bits as a 16 bit value greater than 255; the high bits are just zeroed. >> This isn't really Unicode unless it follows Unicode encoding, > >Which it does. Which means that ch_type is unsigned short instead of unsigned char. >> and it lacks >> the ability to provide a fixed size per symbol storage mechanism, > >Why do you want one? So that I can declare an array of ch_type and be guaranteed that my array length is not dependant on the encoding -- otherwise I won't be able to guarantee a fixed number of characters for an input field in a language independant fashion. It's ridiculous to think that the number of characters you can input into a field based on a fixed storage declarator would vary based on what the characters entered were. In particular, can you imagine a database asking for a surname where you could enter 80 characters for an English surname, but some lesser number for a surname containing an umlaut-u or cedilla? This could easily happen for a fixed byte length field where "characters" (glyphs, actually) were encoded to take 1-3 characters to produce. This would be bad. The alternative, like I said, would be to have a set of standard "code page" mappings per 256 byte character sets to Unicode, and then store the files in 8-bit encode based on the code pages. The file would be attributed with a code page identifier, and everything would be converted to Unicode on read or write, but stored encoded. Directory entries themselves would have to be Unicode, unless the file system was mkfs'ed "nationalized" to a particular code page. This would save the European types, at least, from having to deal with 16 bit storage losses. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------