Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.mel.connect.com.au!munnari.OZ.AU!uunet!in2.uu.net!newsfeed.internetmci.com!gatech!sdd.hp.com!hamblin.math.byu.edu!park.uvsc.edu!usenet From: Terry Lambert <terry@lambert.org> Newsgroups: comp.unix.bsd.freebsd.misc Subject: Re: conflicting types for `wchar_t' Date: 1 Mar 1996 22:59:45 GMT Organization: Utah Valley State College, Orem, Utah Lines: 117 Message-ID: <4h7vh1$q3j@park.uvsc.edu> References: <4eijt4$5sa@dali.cs.uni-magdeburg.de> <SODA.96Feb17221233@srapc133.sra.CO.JP> <SODA.96Mar1044003@srapc133.sra.CO.JP> NNTP-Posting-Host: hecate.artisoft.com soda@sra.CO.JP (Noriyuki Soda) wrote: ] XmString cannot be used to interchange multi-script text, ] COMPOUND_TEXT is used for this purpose. Actually, it can, by using font sets. This makes the XmString an acceptable display encoding (note: not storage encoding and not process encoding). The COMPOUND_TEXT abstraction is just "Yet Another Incompatible Standard", IMO. ] IMHO, COMPOUND_TEXT (and ISO-2022 based encoding) is worth using, ] though it is not ideal encoding. :-) It's just another display encoding. It's not terribly useful for process or storage encoding. ] (We use ISO-2022 based encoding about 10 years.) ] For example, ISO-2022 provides straight forward way to use ] different font for each script. There is only a need to change fonts in a multilingual document where the character sets intersect (more on this below). In general, typographic representation is a rendering attribute, not a processing attribute. For instance, spell checking is spell checking without regard to whether or not the text being checked is a 14 or 20 point font, or has the attributes "bold" or "italic", etc. ] >>> 32 bits is a silly value for the size, since XDrawString16 exists ] >>> and XDrawString32 does not. ] ] I think this way (use only one font for all scripts) is only ] acceptable on low-end application. One font set. The difference is that the fonts are only encoded in a document in a compound document architecture (what the Unicode Standard calls a complex document). The Unicode standard fully expects that typographic data will be encoded as in-band "escape sequences" (like SGML and the SGML DTD HTML use today). ] > I frequently see the multinationalization argument used to ] > support ISO 10646 32 bit encoding. The problem is that most ] > uses are not for multilingual encoding. ] ] > Even translation, a multilingual use, uses encoding that is ] > frequently non-intersecting. Look at JIS208 + JIS212 for 24 ] > language encoding (World24). ] ] ? ] I'm sorry that I miss your point. What does "frequently ] non-intersecting" means ? It means that characters from multiple languages will generally use only a single round-trip standard instead of several standards that resolve to the same code point in the Unicode character set. For instance, If I have a multilingual document containing English and Japanese, there exists a standard JIS-208 such that I can display the document with a single font without compromising the characters of either language. The "multinationalization argument" complains that I can't do this for Japanese and Chinese simultaneously. What it neglects to note is that there is not a single character set standard that encodes Japanese and Chinese seperately, so there is no way to encode round-trip information in any case. The use of 32 bit process encoding for Unicode (a 32 bit wchar_t) neglects the fact that for an ISO 10646 code page 0 process encoding, the high bits *must* remain 0 in all cases. Really, there is *no* reason for 32 bit process encoding until such time as 10646 code pages other than 0 have been allocated. Storage encoding is another matter, though I believe there are very strong arguments for keeping storage and process encoding the same, instead of going to UTF or some other standard that destroys information (recour count from file size, etc.). ] > I dislike standardization of proprietary technology, and will be ] > using 16 bit (ISO10646/0) Unicode in most of my future text ] > processing work as storage encoding (just like Windows 95 and ] > Windows NT) for a good long time. ] ] Even if you use 16 bit Unicode, wchar_t should be 32bit. Because 32 ] bit wchar_t is right way to handle surrogated pairs of Unicode. I don't understand why such pairs can't be handled by way of in-band tokens, just like shift-JIS or other escape-based compound document frameworks. That is, the issue is output vs. process encoding. I might as well argue that the unification of English longhand (a ligatured character set, BTW) and printed French by ISO 8859-1 (also Unicode 16/0 and ISO 10646/0/0) was a mistake similar to the unification of CJK in Unicode 16 and ISO 10646/0. I'm against using the ISO 10646 16 bit code page identifier to segregate on the basis of language. 8-(. Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.