*BSD News Article 63534

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.mel.connect.com.au!munnari.OZ.AU!uunet!in2.uu.net!newsfeed.internetmci.com!gatech!sdd.hp.com!hamblin.math.byu.edu!park.uvsc.edu!usenet
From: Terry Lambert <terry@lambert.org>
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: Re: conflicting types for `wchar_t'
Date: 1 Mar 1996 22:59:45 GMT
Organization: Utah Valley State College, Orem, Utah
Lines: 117
Message-ID: <4h7vh1$q3j@park.uvsc.edu>
References: <4eijt4$5sa@dali.cs.uni-magdeburg.de> <SODA.96Feb17221233@srapc133.sra.CO.JP> <SODA.96Mar1044003@srapc133.sra.CO.JP>
NNTP-Posting-Host: hecate.artisoft.com

soda@sra.CO.JP (Noriyuki Soda) wrote:
] XmString cannot be used to interchange multi-script text,
] COMPOUND_TEXT is used for this purpose.

Actually, it can, by using font sets.

This makes the XmString an acceptable display encoding (note:
not storage encoding and not process encoding).

The COMPOUND_TEXT abstraction is just "Yet Another Incompatible
Standard", IMO.

] IMHO, COMPOUND_TEXT (and ISO-2022 based encoding) is worth using,
] though it is not ideal encoding. :-)

It's just another display encoding.  It's not terribly useful
for process or storage encoding.

] (We use ISO-2022 based encoding about 10 years.)
] For example, ISO-2022 provides straight forward way to use
] different font for each script.

There is only a need to change fonts in a multilingual
document where the character sets intersect (more on this
below).  In general, typographic representation is a
rendering attribute, not a processing attribute.  For
instance, spell checking is spell checking without regard
to whether or not the text being checked is a 14 or 20
point font, or has the attributes "bold" or "italic", etc.

] >>> 32 bits is a silly value for the size, since XDrawString16 exists
] >>> and XDrawString32 does not.
] 
] I think this way (use only one font for all scripts) is only
] acceptable on low-end application.

One font set.  The difference is that the fonts are only
encoded in a document in a compound document architecture
(what the Unicode Standard calls a complex document).

The Unicode standard fully expects that typographic data
will be encoded as in-band "escape sequences" (like SGML
and the SGML DTD HTML use today).

] > I frequently see the multinationalization argument used to
] > support ISO 10646 32 bit encoding.  The problem is that most
] > uses are not for multilingual encoding.
] 
] > Even translation, a multilingual use, uses encoding that is
] > frequently non-intersecting.  Look at JIS208 + JIS212 for 24
] > language encoding (World24).
] 
] ?
] I'm sorry that I miss your point. What does "frequently
] non-intersecting" means ?

It means that characters from multiple languages will generally
use only a single round-trip standard instead of several
standards that resolve to the same code point in the Unicode
character set.

For instance, If I have a multilingual document containing
English and Japanese, there exists a standard JIS-208 such
that I can display the document with a single font without
compromising the characters of either language.

The "multinationalization argument" complains that I can't
do this for Japanese and Chinese simultaneously.  What it
neglects to note is that there is not a single character
set standard that encodes Japanese and Chinese seperately,
so there is no way to encode round-trip information in any
case.

The use of 32 bit process encoding for Unicode (a 32 bit
wchar_t) neglects the fact that for an ISO 10646 code page
0 process encoding, the high bits *must* remain 0 in all
cases.

Really, there is *no* reason for 32 bit process encoding
until such time as 10646 code pages other than 0 have been
allocated.

Storage encoding is another matter, though I believe there
are very strong arguments for keeping storage and process
encoding the same, instead of going to UTF or some other
standard that destroys information (recour count from file
size, etc.).


] > I dislike standardization of proprietary technology, and will be
] > using 16 bit (ISO10646/0) Unicode in most of my future text
] > processing work as storage encoding (just like Windows 95 and
] > Windows NT) for a good long time.
] 
] Even if you use 16 bit Unicode, wchar_t should be 32bit. Because 32
] bit wchar_t is right way to handle surrogated pairs of Unicode.

I don't understand why such pairs can't be handled by way
of in-band tokens, just like shift-JIS or other escape-based
compound document frameworks.

That is, the issue is output vs. process encoding.

I might as well argue that the unification of English longhand
(a ligatured character set, BTW) and printed French by ISO 8859-1
(also Unicode 16/0 and ISO 10646/0/0) was a mistake similar to
the unification of CJK in Unicode 16 and ISO 10646/0.

I'm against using the ISO 10646 16 bit code page identifier to
segregate on the basis of language.  8-(.


                                        Terry Lambert
                                        terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.