Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.mel.connect.com.au!munnari.OZ.AU!news.hawaii.edu!ames!usenet.kornet.nm.kr!usenet.hana.nm.kr!jagalchi.cc.pusan.ac.kr!news.kreonet.re.kr!news.dacom.co.kr!newsrelay.netins.net!solaris.cc.vt.edu!news.mathworks.com!gatech!sdd.hp.com!hamblin.math.byu.edu!park.uvsc.edu!usenet From: Terry Lambert <terry@lambert.org> Newsgroups: comp.unix.bsd.freebsd.misc Subject: Re: conflicting types for `wchar_t' Date: 10 Mar 1996 02:01:21 GMT Organization: Utah Valley State College, Orem, Utah Lines: 291 Message-ID: <4htd5h$dfi@park.uvsc.edu> References: <4eijt4$5sa@dali.cs.uni-magdeburg.de> <SODA.96Feb17221233@srapc133.sra.CO.JP> <SODA.96Mar8050147@srapc133.sra.CO.JP> NNTP-Posting-Host: hecate.artisoft.com soda@sra.CO.JP (Noriyuki Soda) wrote: ] > ] XmString cannot be used to interchange multi-script text, ] > ] COMPOUND_TEXT is used for this purpose. ] > ] > Actually, it can, by using font sets. ] > ] > This makes the XmString an acceptable display encoding (note: ] > not storage encoding and not process encoding). ] ] And XmString is not acceptable for inter-client communication ] encoding. COMPOUND_TEXT is used for it. Motif, and CDE (which specifies Motif) use a private ICCCMP mechanism, which unlike the COMPOUND_TEXT mechsnims, supports features like "Drag-N-Drop". Not to say that interoperability with standard selection mechanisms isn't important, it just seems that Motif has the lock on the only standard. ] > The COMPOUND_TEXT abstraction is just "Yet Another Incompatible ] > Standard", IMO. ] ] Incompatible? No, COMPOUND_TEXT is *fully* upper compatible with ] ISO-8859-1. ] And, COMPOUND_TEXT also has a compatibility to EUC-chinese, ] EUC-taiwanese, EUC-korean, EUC-japanese, ISO-2022-KR, ISO-2022-JP, ] ISO-2022-JP2, because these are all based on ISO-2022 framework. OLE? Motif? CDE? CORBA? It is capable of transferring the type of data you suggest, yes. But that is not a measure of compatability with software. Any method which can transfer 8 or 16 bit binary data can do the same thing using ISO2022 or ISO 10646/0 encoding. > # Unicode doesn't have such compatibility. :-) > > > ] IMHO, COMPOUND_TEXT (and ISO-2022 based encoding) is worth using, > > ] though it is not ideal encoding. :-) > > > > It's just another display encoding. It's not terribly useful > > for process or storage encoding. > > No, ISO-2022 based encodings are widely used in stroage encoding and > network encoding in Asia. Being ubiquitous does not make them more useful. Let's get down to it: Runic storage encoding of any kind sucks, period: o It destroys the ability to use fixed sized buffers for fixed field input screens (which are common place because there are still such things as text terminals, and they are still cheaper than X\ terminals). o It destroys the ability to get a maningful glyph count from a text file using the file size as a key. o It destroys the ability to reliably transfer the data interactively over a lossy serial link because of the associated resynchronization problems. o It destroys the ability to get a record count by dividing the file size by the record size because it destroys the ability to use fixed length fields in the data because the number of characters in a runic-encoded string bears no fixed relation to the field size. o It destroys interoperability with existing 8-bit clean OS that have been localized to a locale other than 7-bit ASCII (the 8th bit is used as a UTF selector instead of simply representing an 8 bit encoded value. An example might be an Italian or French NFS server servicing an existing system (using the full domain of the 8 bit 8859-1 set to encode characters -- no runic interoperability is possible in this case. The United states and the countries covered by ISO8859-x, and countries using 8-bit clean non-ASCII character sets on such systems can't afford to destroy their existing investment in data. Now I agree, the same goes for non-8-bit-clean countries; for existing data, that has to be handled in terms of backward compatability, where it can't be accomodated directly. But given Microsoft's market share, and their adoption of 16 bit storage for Unicode data in the NTFS, in the VFATFS, and in wire traffic for LANMAN for directory entry management, it seems the decision has already been made for us. ] # Note: ISO-8859-1 has also compatibility with ISO-2022 framework, so ] # that ISO-2022 based encoding is widely used in America/Europe. :-) This isn't true. My first professional coding project was in 1980 -- 16 years ago. Since then, the only place I have seen ISO2022 is in coding for localization in Japan for FIGSJ support. Typically, the encoding is *not* used unless it *has* to be used -- and then, reluctantly, for the reasons stated above with regard to runic encoding. ] I think you are misunderstanding my argument. ] (Probably because my English is not clear :-<) ] ] ISO-2022 represents each character by combination of code-set and ] code-point. So that, we can use code-set information for X11 font ] encoding, code-point information to X11 DrawString data. ] ] This is pretty compatible with X11 fontset abstraction, Unicode ] is not. First, Unicode is ISO2022 escape sequence selectable, so that simply is not true. Motif (and CDE), *the* interface specified for UNIX in the X/Open standardization of Spec 1170, does not *use* the X11 fontset abstraction. Applications written to this interface, therefore, do not use the X11 fontset abstraction. The benefit of ISO 2022 encoding is language attribution. It is not the province of a character set standard to do that, nor is it useful to do that in any case but for a multinationalized product (capable of displaying multilingual text for non-intersecting round trip character sets, unlike for an internationalized product, the sole purpose of which is to provide the ability to localize the software to a *single* locale-specific character set). Unicode is a tool for *internationalization*, not a tool for *multinationalization*. ] > One font set. The difference is that the fonts are only ] > encoded in a document in a compound document architecture ] > (what the Unicode Standard calls a complex document). ] ] I didn't talk about complex document. What I want to say is that ] using one *fontset* to display multi-script is good thing, and ] using one *font* to display multi-script (using Unicode and ] XDrawString16) is bad thing. (see below.) #define "multilingual document" "complex document" Each of the fonts in a fontset is a different version of the glyphs for a single character set. You *don't* put disparate fonts in a fontset. Fontset attribution (ie: picking a character set) is a problem for multilingual software. For software within a single defined character set (even if it covers 21 languages like JIS-208 + JIS-212), the ability to select othe font sets is *unnecessary*. For those applications that need that ability, they are by definition multilingual applications. As such, the onus is on their authors to deal with the encoding of set selection information in the complex document as they seee fit. ] Certainly No. Japanese uses combination of JIS-X0201 (representing ] English) and JIS-X0208 (represeting Japanese) mainly. Displaying ] document by single font is used for only low-end application. ] Middle/high-end application uses different fonts for English and ] Japanese, because English character set has more fonts than Japanese ] character set (making font which contains 256 characters is easier ] than making font which contains 8000 characters). Why is it necessary to use character set bits for font selection? A character set standard is not a glyph-encoding standard. Yes, it's possible to abuse ISO2022 encoding to get something resembling a glyph encoding standard, but that is properly the job of data in the text, not the character set code points. In other words, a complex document. ] > ] Even if you use 16 bit Unicode, wchar_t should be 32bit. Because 32 ] > ] bit wchar_t is right way to handle surrogated pairs of Unicode. ] > ] > I don't understand why such pairs can't be handled by way ] > of in-band tokens, just like shift-JIS or other escape-based ] > compound document frameworks. ] ] Because it is the way of multibyte-character (multi16bit-character in ] this case). The benefit of wide-character is treating character as ] single object. If you use multiple objects to represent one character, ] Why do you choose wchar_t ? Just use multibyte-character. The type wchar_t is a type for use as a character set index; it is the people who want to attach font encoding information (so that they can choose the presentation for the user instead of allowing the user their own choice) that should pick another encoding. The fact is, Microsoft wells the most prevalent OS's on Earth, and their compiler's wchar_t is 16 bits. Anything else is nothing more than a gratuitous incompatabiliy designed to make software that compiles with Microsoft's compiler break when compiled with GCC (or other compiler with a non-16-bit wchar_t). ] # We Japanese are using multibyte-character and wide-character ] # more than 10 years, we know about it :-) You also know that it is inconvenient (slows developement), destroys information (see above), and is a patch around the need to represent more than 256 total character-set code points. I can't suggest you discard non-8-bit alphabets; it would be a silly idea, and I'm no cultural imperialist. By the same token, however, you must admit that the 8-bit limit has been the biggest handicap that has been faced by Japanese information processing. One could argue that this is the single most important reason that the West, in general, has not been outstripped by Japan in the ability to compete in computer software. It's true that computer technology is poised to fix this problem, finally, many decades after the invention of the computer. Handwriting reconition, voice recognition, and so on are reducing the expense associated with imputting data in non-8-bit languages. But there are obvious disadvantages to the current "workaround" of using multibyte-characters that are just as much of a handicap. It may not seem so, coming from a greatly handicapped situation into one less problematic. The problems multibyte encoding introduces are *trivial* compared to coming up with an encoding scheme in the first place. But why solve even *trivial* problems, if they are artificially created? Why cause problems for yourself? ] > I'm against using the ISO 10646 16 bit code page identifier to ] > segregate on the basis of language. 8-(. ] ] I think your opinion of this point is quite correct, if Unicode ] doesn't have problems of design. But current Unicode specification ] has several problems which are not resolved without language ] segregation. :-< All I can say is "lobby for designation of code pages other than 0", and then "pick a code page to run in when you start up". This will buy you much of what you want. For applications where you want multiple code pages simultaneously, well you are free to write your applications using: typedef struct _my_wchar_t_ { unsigned short code_page; wchar_t code_point; } my_whchar_t; And then use my_wchar_t and copy liberally (trading the runic encoding copy penalty for a display representation copy penalty). This is similar to the page-inflation I have to do for an 8-bit data store mounted on a 16-bit process encoding system: I have to copy the 8-bit page to two 16-bit pages. This copy is expensive, but it is the penalty of backward compatability (a penalty that is only required to be paid in that case -- new data and new systems do not and will not have the problem). Personally, I'll code with a 16 bit wchar_t, keeping my tools compatabile with Microsofts, until such time as I'm called upon to write another multilingual application. It is the multilingual applications, which are much rarer than the single locale-specific language applications, which should have to pay the penalty for multinationalization. Regards, Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.