*BSD News Article 63138

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.mel.connect.com.au!munnari.OZ.AU!news.hawaii.edu!ames!usenet.kornet.nm.kr!usenet.hana.nm.kr!jagalchi.cc.pusan.ac.kr!news.kreonet.re.kr!news.dacom.co.kr!newsrelay.netins.net!solaris.cc.vt.edu!news.mathworks.com!gatech!sdd.hp.com!hamblin.math.byu.edu!park.uvsc.edu!usenet
From: Terry Lambert <terry@lambert.org>
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: Re: conflicting types for `wchar_t'
Date: 10 Mar 1996 02:01:21 GMT
Organization: Utah Valley State College, Orem, Utah
Lines: 291
Message-ID: <4htd5h$dfi@park.uvsc.edu>
References: <4eijt4$5sa@dali.cs.uni-magdeburg.de> <SODA.96Feb17221233@srapc133.sra.CO.JP> <SODA.96Mar8050147@srapc133.sra.CO.JP>
NNTP-Posting-Host: hecate.artisoft.com

soda@sra.CO.JP (Noriyuki Soda) wrote:
] > ] XmString cannot be used to interchange multi-script text,
] > ] COMPOUND_TEXT is used for this purpose.
] > 
] > Actually, it can, by using font sets.
] > 
] > This makes the XmString an acceptable display encoding (note:
] > not storage encoding and not process encoding).
] 
] And XmString is not acceptable for inter-client communication
] encoding.  COMPOUND_TEXT is used for it.

Motif, and CDE (which specifies Motif) use a private ICCCMP
mechanism, which unlike the COMPOUND_TEXT mechsnims, supports
features like "Drag-N-Drop".

Not to say that interoperability with standard selection
mechanisms isn't important, it just seems that Motif has the
lock on the only standard.

] > The COMPOUND_TEXT abstraction is just "Yet Another Incompatible
] > Standard", IMO.
] 
] Incompatible? No, COMPOUND_TEXT is *fully* upper compatible with
] ISO-8859-1.
] And, COMPOUND_TEXT also has a compatibility to EUC-chinese,
] EUC-taiwanese, EUC-korean, EUC-japanese, ISO-2022-KR, ISO-2022-JP,
] ISO-2022-JP2, because these are all based on ISO-2022 framework.

OLE?  Motif?  CDE?  CORBA?

It is capable of transferring the type of data you suggest, yes.
But that is not a measure of compatability with software.  Any
method which can transfer 8 or 16 bit binary data can do the same
thing using ISO2022 or ISO 10646/0 encoding.


> # Unicode doesn't have such compatibility. :-)
> 
> > ] IMHO, COMPOUND_TEXT (and ISO-2022 based encoding) is worth using,
> > ] though it is not ideal encoding. :-)
> > 
> > It's just another display encoding.  It's not terribly useful
> > for process or storage encoding.
> 
> No, ISO-2022 based encodings are widely used in stroage encoding and
> network encoding in Asia.

Being ubiquitous does not make them more useful.

Let's get down to it: Runic storage encoding of any kind sucks,
period:

o	It destroys the ability to use fixed sized buffers
	for fixed field input screens (which are common
	place because there are still such things as text
	terminals, and they are still cheaper than X\
	terminals).

o	It destroys the ability to get a maningful glyph
	count from a text file using the file size as a
	key.

o	It destroys the ability to reliably transfer the
	data interactively over a lossy serial link because
	of the associated resynchronization problems.

o	It destroys the ability to get a record count by
	dividing the file size by the record size because
	it destroys the ability to use fixed length fields
	in the data because the number of characters in a
	runic-encoded string bears no fixed relation to the
	field size.

o	It destroys interoperability with existing 8-bit
	clean OS that have been localized to a locale other
	than 7-bit ASCII (the 8th bit is used as a UTF
	selector instead of simply representing an 8 bit
	encoded value.  An example might be an Italian
	or French NFS server servicing an existing system
	(using the full domain of the 8 bit 8859-1 set
	to encode characters -- no runic interoperability
	is possible in this case.

The United states and the countries covered by ISO8859-x,
and countries using 8-bit clean non-ASCII character sets on
such systems can't afford to destroy their existing investment
in data.


Now I agree, the same goes for non-8-bit-clean countries;
for existing data, that has to be handled in terms of
backward compatability, where it can't be accomodated
directly.

But given Microsoft's market share, and their adoption of
16 bit storage for Unicode data in the NTFS, in the VFATFS,
and in wire traffic for LANMAN for directory entry management,
it seems the decision has already been made for us.


] # Note: ISO-8859-1 has also compatibility with ISO-2022 framework, so
] #	that ISO-2022 based encoding is widely used in America/Europe. :-)

This isn't true.  My first professional coding project was
in 1980 -- 16 years ago.  Since then, the only place I have
seen ISO2022 is in coding for localization in Japan for FIGSJ
support.  Typically, the encoding is *not* used unless it *has*
to be used -- and then, reluctantly, for the reasons stated above
with regard to runic encoding.


] I think you are misunderstanding my argument.
] (Probably because my English is not clear :-<)
] 
] ISO-2022 represents each character by combination of code-set and
] code-point. So that, we can use code-set information for X11 font
] encoding, code-point information to X11 DrawString data.
] 
] This is pretty compatible with X11 fontset abstraction, Unicode
] is not.

First, Unicode is ISO2022 escape sequence selectable, so that
simply is not true.

Motif (and CDE), *the* interface specified for UNIX in the
X/Open standardization of Spec 1170, does not *use* the X11
fontset abstraction.

Applications written to this interface, therefore, do not use
the X11 fontset abstraction.

The benefit of ISO 2022 encoding is language attribution.  It
is not the province of a character set standard to do that, nor
is it useful to do that in any case but for a multinationalized
product (capable of displaying multilingual text for
non-intersecting round trip character sets, unlike for an
internationalized product, the sole purpose of which is to
provide the ability to localize the software to a *single*
locale-specific character set).

Unicode is a tool for *internationalization*, not a tool for
*multinationalization*.


] > One font set.  The difference is that the fonts are only
] > encoded in a document in a compound document architecture
] > (what the Unicode Standard calls a complex document).
] 
] I didn't talk about complex document. What I want to say is that
] using one *fontset* to display multi-script is good thing, and 
] using one *font* to display multi-script (using Unicode and
] XDrawString16) is bad thing. (see below.)

#define "multilingual document"	"complex document"

Each of the fonts in a fontset is a different version of
the glyphs for a single character set.  You *don't* put
disparate fonts in a fontset.

Fontset attribution (ie: picking a character set) is a problem
for multilingual software.  For software within a single
defined character set (even if it covers 21 languages like
JIS-208 + JIS-212), the ability to select othe font sets is
*unnecessary*.

For those applications that need that ability, they are by
definition multilingual applications.  As such, the onus is
on their authors to deal with the encoding of set selection
information in the complex document as they seee fit.


] Certainly No. Japanese uses combination of JIS-X0201 (representing
] English) and JIS-X0208 (represeting Japanese) mainly. Displaying
] document by single font is used for only low-end application.
] Middle/high-end application uses different fonts for English and
] Japanese, because English character set has more fonts than Japanese
] character set (making font which contains 256 characters is easier
] than making font which contains 8000 characters).

Why is it necessary to use character set bits for font selection?
A character set standard is not a glyph-encoding standard.  Yes,
it's possible to abuse ISO2022 encoding to get something resembling
a glyph encoding standard, but that is properly the job of data
in the text, not the character set code points.

In other words, a complex document.

 
] > ] Even if you use 16 bit Unicode, wchar_t should be 32bit. Because 32
] > ] bit wchar_t is right way to handle surrogated pairs of Unicode.
] > 
] > I don't understand why such pairs can't be handled by way
] > of in-band tokens, just like shift-JIS or other escape-based
] > compound document frameworks.
] 
] Because it is the way of multibyte-character (multi16bit-character in
] this case). The benefit of wide-character is treating character as
] single object. If you use multiple objects to represent one character,
] Why do you choose wchar_t ?  Just use multibyte-character.

The type wchar_t is a type for use as a character set index;
it is the people who want to attach font encoding information
(so that they can choose the presentation for the user instead
of allowing the user their own choice) that should pick another
encoding.

The fact is, Microsoft wells the most prevalent OS's on Earth,
and their compiler's wchar_t is 16 bits.  Anything else is
nothing more than a gratuitous incompatabiliy designed to make
software that compiles with Microsoft's compiler break when
compiled with GCC (or other compiler with a non-16-bit wchar_t).


] # We Japanese are using multibyte-character and wide-character
] # more than 10 years, we know about it :-)

You also know that it is inconvenient (slows developement),
destroys information (see above), and is a patch around the
need to represent more than 256 total character-set code points.

I can't suggest you discard non-8-bit alphabets; it would be
a silly idea, and I'm no cultural imperialist.

By the same token, however, you must admit that the 8-bit
limit has been the biggest handicap that has been faced by
Japanese information processing.  One could argue that this
is the single most important reason that the West, in general,
has not been outstripped by Japan in the ability to compete
in computer software.

It's true that computer technology is poised to fix this
problem, finally, many decades after the invention of the
computer.  Handwriting reconition, voice recognition, and so
on are reducing the expense associated with imputting data in
non-8-bit languages.  But there are obvious disadvantages to
the current "workaround" of using multibyte-characters that
are just as much of a handicap.  It may not seem so, coming
from a greatly handicapped situation into one less problematic.
The problems multibyte encoding introduces are *trivial*
compared to coming up with an encoding scheme in the first
place.  But why solve even *trivial* problems, if they are
artificially created?  Why cause problems for yourself?


] > I'm against using the ISO 10646 16 bit code page identifier to
] > segregate on the basis of language.  8-(.
] 
] I think your opinion of this point is quite correct, if Unicode
] doesn't have problems of design. But current Unicode specification 
] has several problems which are not resolved without language
] segregation. :-<

All I can say is "lobby for designation of code pages other than
0", and then "pick a code page to run in when you start up".
This will buy you much of what you want.

For applications where you want multiple code pages simultaneously,
well you are free to write your applications using:

typedef struct _my_wchar_t_ {
	unsigned short	code_page;
	wchar_t		code_point;
} my_whchar_t;

And then use my_wchar_t and copy liberally (trading the runic
encoding copy penalty for a display representation copy penalty).


This is similar to the page-inflation I have to do for an 8-bit
data store mounted on a 16-bit process encoding system: I have
to copy the 8-bit page to two 16-bit pages.  This copy is
expensive, but it is the penalty of backward compatability
(a penalty that is only required to be paid in that case -- new
data and new systems do not and will not have the problem).

Personally, I'll code with a 16 bit wchar_t, keeping my tools
compatabile with Microsofts, until such time as I'm called
upon to write another multilingual application.

It is the multilingual applications, which are much rarer than
the single locale-specific language applications, which should
have to pay the penalty for multinationalization.


					Regards,
                                        Terry Lambert
                                        terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.