*BSD News Article 9413

Received: by minnie.vk1xwt.ampr.org with NNTP
	id AA5736 ; Fri, 01 Jan 93 01:54:11 EST
Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
Message-ID: <1992Dec30.061759.8690@fcom.cc.utah.edu>
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <1992Dec19.083137.4400@fcom.cc.utah.edu> <2564@titccy.cc.titech.ac.jp> <1992Dec30.010216.2550@nobeltech.se>
Date: Wed, 30 Dec 92 06:17:59 GMT
Lines: 393

In article <1992Dec30.010216.2550@nobeltech.se> ppan@nobeltech.se (Per Andersson) writes:
>In article <2564@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>>
>>Do you know that Japan vote AGAINST ISO10646/Unicode, because it's not
>>good for Japanese?
>>
>>>So even if the Unicode standard ignores backward compatability
>>>with Japanese standards (and specific American and European standards),
>>>it better supports true internationalization.
>>
>>The reason of disapproval is not backward compatibility.
>>
>>The reason is that, with Unicode, we can't achieve internationalization.
>
>But, what has Unicode got to do with ISO-10646 ? Has the promised (very much
>needed IMHO) revision of Unicode arrived ? (1.1). Unicode is a 16bit character-
>set which I know did ugly things with asiatic languages. I thought 10646,
>which is a 32bit standard (by ISO !) did not, except for doing something
>the turks didn't like, don't remember what it was. Enlighten me !


The following is my trieste on the Unicode standard, and why I think it
is applicable or can be made applicable to internationalization of 386BSD.

Let me restate here (as I do below) that I do not believe the attribution
of language by the ordinal value of a character is a goal of the storage
representation of the character glyph.

The goal of attribution of the language a particular character is
adequately satisfied by the choice of I/O mechanisms and mappings, and can
also be satisfied by localized attribution of a file within the storage
mechanism for that file.  The only particular hurdle to overcome is the
provision of multiple concurrent language specific I/O mechanisms for the
benefit of translators.  This has been discussed elsewhere.


======================= ======================= =======================
======================= ======================= =======================
======================= ======================= =======================

ISO10646 is based on the Unicode standard.

The "ugly thing Unicode does with asiatic languages" is exactly what it
does with all other languages:  There is a single lexical assignment for
for each possible glyph.

This doesn't "screw up" anything, unless you expect your character set to
be attributed by language... ie:

					English '#' character
						|
						v
    -+-+-+-+-+-+-+-+-+-+-+-+-		-+-+-+-+-+-+-+-+-+-+-+-+-
....  | |!|"|#|$|%|&|'|(|)|*| ...     ... | |!|"| |$|%|&|'|(|)|*| ...
    -+-+-+-+-+-+-+-+-+-+-+-+-		-+-+-+-+-+-+-+-+-+-+-+-+-

	 US ASCII				UK ASCII

In the example above, the lexical set for US ASCII and UK ASCII are not
intersecting, even though they contain exactly the same glyphs for all
but one character

Thus by the lexical order of a character, you can tell if it is an American,
English, Japanese, or Chinese character.  The argument against Unicode, as
I understand it so far, is that the ordinal value of a character is not an
indicator of which language it came from... ie:

    Character set excerpt				Which set (count)


	    /----------------------\
	    |			   |
    -+-+-+-+-+-+-+-+-+-+-+-+-	   |
....  | |!|"| |$|%|&|'|(|)|*| ...   |			UK ASCII (96)
    -+-+-+-+-+-+-+-+-+-+-+-+-	   |
      | | |   | | | | | | |	   |
      v v v   v v v v v v v	   |
    -+-+-+-+-+-+-+-+-+-+-+-+-	   |
....  | |!|"|#|$|%|&|'|(|)|*| ...   |			US ASCII (96)
    -+-+-+-+-+-+-+-+-+-+-+-+-	   \------\
      | | | | | | | | | | |		  |
      v v v v v v v v v v v		  v
    -+-+-+-+-+-+-+-+-+-+-+-+-		-+-+-+-+-
....  | |!|"|#|$|%|&|'|(|)|*| ...     ... | | | |	Unicode (34348)
    -+-+-+-+-+-+-+-+-+-+-+-+-		-+-+-+-+-


This demonstrates the "problem", wherein the lexical order of the Unicode
character set does not map to lexically adjacent characters in the ASCII
sets.  This behaviour is greatly exaggerated for Japanese/Chinese character
sets, which have relatively large numbers of non-intersecting characters
(as opposed to the 7 non-intersecting characters for most Western European
languages and US ASCII), thus leaving a relatively large number of "gaps"
in the lexical tables for a particular Asian language... ie:


	* = A valid character in a language

    -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
....  |*|*| |*|*|*|*|*| |*|*| | | | |*|*|*|*|*|*|*| | ...	Japanese
    -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-

    -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
....  | |*|*| |*|*| |*|*|*|*|*|*|*|*|*| |*|*| | |*|*| ...	Chinese
    -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-


[ I do not view the attribution of language by the ordinal value of a
  character as a goal of the storage representation of the character
  glyph.  This is, perhaps, where I differ with the (stated by various
  individuals) Japanese assessment of the Unicode standard. ]

The fact that ISO-Latin-1 is used as the base character set for Unicode
(and is thus in an pre-existing lexical order), and languages with glyphs
that are non-existant in other languages (ie: Tamil and Arabic) are also
in a preexisting adjacent lexical order is seen as Eurocentric (or US
centric, when one considers that the ISO-Latin-1 set is a superset of US
ASCII).

The rationale is the Americans and Western Europeans get to keep the
majority of their existing equipment without modification, while the
Japanese are required to give up their existing investment in the JIS
standard and equipment which supports it.


THIS IS PATENTLY FALSE ON SEVERAL GROUNDS:

1)	The storage mechanism (in this case, Unicode) does not have to
	bear a direct relationship to the input/output mechanism.  For
	instance, a "Unicode font" for use in Japan could contain only
	those Glyphs used in Japanese, just as a Unicode font for use
	in the US can be limited to US ASCII.  As long as the additional
	characters are not used, there is no reason for them to actually
	exist in the font.

2)	Localization is, and for the forseeable future will continue to
	be, necessary for the input mechanism.  This will be true of the
	large glyph count languages forever because of the awkward input
	mechanisms required.  The most likely immediate change will be in
	the small glyph count languages, such as those of Western Europe.
	The additional penalty for small glyph count languages will be a
	16k table for Unicode to Local ISO standard translation, and a
	512b table for Local ISO standard to Unicode translation.  There
	is also the lookup penalty that will be paid referencing these
	tables.
	
3)	The base change for the Japanese language will be a 128k table for
	16 bits worth of short-to-short JIS to Unicode lexical translation,
	and a 128k table for the reverse translation for output.  The
	most likely immediate change here will be a direct change to the
	JIS translation table output from JIS to Unicode, thus eliminating
	any additional translation penalties for Japanese over and above
	that required by the large glyph set.  In the case of most existing
	Japanese hardware, this is a simple ROM replacement.  Thus Japanese
	I/O will realize a performance increase over and above that realized
	by native input of Western languages.  This is a large glyph count
	only advantage shared by Japanese and Chinese.

4)	The storage requirements for text files in small glyph count
	languages double from 8 bits to 16 bits per glyph when using any
	internationalization mechanism which allows for large glyph count
	languages.  This is a significant concession from users of small
	glyph count languages in the interests of allowing for future
	internationalization from which they will not profit significantly.
	Methods of avoiding this expansion (such as Runic encoding [the
	commonly accepted mechanism] and file system language attribution
	[my concept] carry performance or loss-of-information penalties
	for small glyph countt language users.  This is discussed later).


Unicode is NOT to the advantage of Western users as has been claimed... to
the contrary, internationalization carries significant penalties for the
Western user, discounting entirely the programming issues involved.  The
relatively low cost of enforced localization by simply making code 8 bit
clean is highly attractive... but localization has significant drawbacks
for the non-Western users, and in particular large glyph count languages
like Japanese and Chinese.


PERFORMANCE AND STORAGE OPTIMIZATIONS

It is possible to overcome, or at least alleviate somewhat, several
disadvantages to Western users by selective optimization.  In particular,
the selection of optimistic assumptions in the initial lexical order is
of benefit to Western users, and explicitly for US ASCII or ISO Latin-1
users.

These benefits are for the most part dependant upon the small glyph count
of Western languages, and do not translate to a specific advantages of
lexical order that large glyph count languages could benefit from.  In
other words, using a JIS set to enable all Japanese characters to be
lexically adjacent WOULD NOT RESULT IN THE JAPANESE USER GAINING THE SAME
BENEFITS.  THE BENEFITS *DEPEND* ON BOTH LEXICAL ORDER AND ON THE FACT OF
A SMALL GLYPH COUNT LANGUAGE.

1)	OPTIMIZATION:	Type 1 Runic Encoding.
	BENEFITS:	US ASCII and ISO-Latin-1.
			File size is an indicator of character count for
			beneficiary languages.
			File size is mathematically related to the
			character count for language not intersecting the
			beneficiary ISO-Latin-1 glyph set.
	COSTS:		Additional storage for Runic encoded languages.
			Additional cost for character 0 in text files
			in benificiary languages.
			File size is not an indicator of character count
			for languages which partially intersect the
			beneficiary ISO-Latin-1 glyph set.
			Difficulty in Glyph substitution.
	IMPLEMENTATION:

	Type 1 Runic Encoding uses the NULL character (lexical position 0)
	to indicate a rune sequence.  For all users of the ISO-Latin-1
	character set or lexically adjacent subsets (ie: US ASCII), a NULL
	character is used to introduce a sequence of 8-bit characters
	representing a 16 bit character whose ordinal value is in excess
	of the ordinal value 255.  Type 1 Runic encoding allows almost all
	existing Western files to remain unchanged, as nearly no text
	files contain NULL characters.

2)	OPTIMIZATION:	Type 2 Runic Encoding.
	BENEFITS:	US ASCII.
			Lesser benefits for ISO-Latin-1 and other sets
			intersecting with US ACSII.
			Frequency encoding allows shorter sequences of
			characters to represent the majority of Runic
			encoded characters in large glyph set languages
			than is possib in Type 1 Runic Encoding.
			File size is an indicator of character count for
			US ASCII text files.
			File size is mathematically related to the
			character count for language not intersecting the
			beneficiary US ASCII glyph set.
	COSTS:		Additional storage for Runic encoded languages.
			Additional cost for characters with an ordinal
			value in excess of 127 in text files in benificiary
			languages.
			File size is not an indicator of character count
			for languages which partially intersect the
			beneficiary US ASCII glyph set.
			Difficulty in Glyph substitution.
	IMPLEMENTATION:

	Type 2 Runic Encoding uses characters with an ordinal value in
	the range 128-255 to introduce a sequence of 8-bit characters
	representing a 16 bit character whose ordinal value is in excess
	of the ordinal value 127.  Type 2 Runic encoding allows frequency
	encoding (by virtue of multiple introducers) of 128*256 (32k)
	glyphs.  Since the Unicode set consists of 34348 glyphs, this is
	an average of two 8-bit characters per Runic Encoded glyph for
	the vast majority (32768) of encoded glyphs.  This is significant
	when compared to the average of three 8-bit characters per encoded
	glyph for Type 1 Runic encoding.


[ For the obvious reason of file size no longer directly representing what
  has been, in the past, meaningful information, I personally dislike the
  concept of Runic encoding, even though it will tend to effect only those
  languages which are permissable in an "8-bit clean" internationalization
  environment, thus not effecting me personally.  An additional penalty is
  in glyph substition within documents.  A single glyph substitution could
  potentially change the size of a file -- requiring a shift of the contents
  of the file on the storage medium to account for the additional space
  requirements of the substitute glyph.  A final penalty is the input buffer
  mechanisms not being able to specify a language-independant field length
  for data input.  This is particularly nasty for GUI and screen based input
  such as that found in most commercial spreadsheets and databases.  For
  these reasons, I can not advocate Runic encoding or the XPG3/XPG4 standards
  which appear to require it. ]


3)	OPTIMIZATION:	Glyph Count Attribute (in file system)
	BENEFITS:	Non-direct beneficiary languages of the Runic
			Encoding, assuming use of Type 1 or Type 2
			Runic Encoding.
	COSTS:		File system modification.
			File I/O modifications.
	IMPLEMENTATION:

	A Glyph Count Attribute kept as part of the information on a file
	would restore the ability to relate directly available information
	about a file with the character count of the text in the file.
	This is something that is normally lost with Runic encoding.  There
	are not insignificant costs associated with this in terms of the
	required modifications to the File I/O system to perform "glyph
	counting".  This is especially significant when dealing with the
	concept of glyph substitution in the file.


4)	OPTIMIZATION:	Language Attribution (in file system)
	BENEFITS:	All languages capable of existing in "8-bit clean"
			environments (all small glyph count languages).
	COSTS:		File system modification.
			File I/O based translation (buffer modification
			processing time).
			Requirement of conversion to change to/from a
			multilingual storage format with non-intescting
			"8-bit clean" sets (ie: Arabic and US ASCII).
			Conversion utilities.
			Changes to UNIX utilities to allow access to
			and manipulation of attributions.
	IMPLEMENTATION:

	The Language Attribution kept as part of the information on a file
	allows 8-bit storage of any language for which an "8-bit clean"
	character set exists/can be produced.  Unicode buffers of 16-bit
	glyphs are converted on write to the "8-bit clean" character set
	glyph.  This requires a 64k table to allow for direct index
	conversion.  In practice, this can be a 16k table due to the
	lexical location of the small glyph count languages within the
	Unicode character set.  The conversion on read requires a 512b
	table to allow direct indext conversion of 256 8-bit values into
	the 256 corresponding Unicode 16-bit characters.

[ This is clever, if I do say so myself ]


5)	OPTIMIZATION:	Sparse Character Sets For Language Localization
	BENEFITS:	Reduced character set/graphic requirements.
			Continued use of non-graphic devices (depends
			on being used in concert with Language Attribution).
			Reduced memory requirements for fonts in graphical
			environments (like X).
	COSTS:		Non-localized text files can not benefit.
			Device channel mapping for devices supporting less
			than the full Unicode character set.
			Translation tables and lookup time for devices
			supported using this mechanism.

	IMPLEMENTATION:

	[Prexisting] Language specific fonts for "8-bit clean" languages
	can be used, as can existing fonts for Unicode character sets
	for systems like X, which allow sparse font sets.  Basically,
	since there is no need to display multilingual messages in a
	localized environment, there is no need to use fonts/devices
	which support an internationalized character set.  For instance,
	using a DEC VT220, the full ISO-Latin-1 font is available for
	use.  Thus for languages using only characters contained in the
	ISO-Latin-1 set, it is not necessary to supply other glyphs
	within the set as long as output mapping of Unicode to the device
	set is done (preferrably in the tty driver).  Similarly, JIS
	devices for Japenese I/O are not required to support, for instance,
	Finnish, Arabic, or French characters.

[ This is also clever, in that it does not was the existing investments
  in hardware. ]



ADMITTED DRAWBACKS IN UNICODE:

The fact that lexical order is not maintained for all existing character
sets (NOTE: NO CURRENT OR PROPOSED STANDARD SUPPORTS THIS IDEA!) means that
a direct arithmatic translation is not possible for, for instance, JIS to
Unicode mappings; instead a table lookup is required on input and output.
This is not a significant penalty anywhere but languages which do not
require multiple keystroke input on their respective input devices and
which are not lexically adjacent in the Unicode set (ie: Turkish).  The
penalty is a table lookup on I/O rather than a direct arithmetic translation
(an add or subtract depending on direction).  NOTE THAT THIS IS NOT A PENALTY
FOR JIS INPUT, SINE MULTICHARACTER INPUT SEQUENCES REQUIRE A TABLE LOOKUP TO
IMPLEMENT REGARDLESS OF THE STORAGE.

The fact that all character sets do not occur in their local lexical order
means that a particular character can not be identified as to language by
its ordinal value.  This is a small penalty to pay for the vast reduction
in storage requirements between a 32-bit and a 16-bit character set that
contains all required glyphs.  The fact that Japanese and Chinese characters
can not be distinguished as to language by ordinal value is no worse than
the fact that one can not distinguish an English 's' in the ISO-Latin-1 set 
from a French 's'.  The significance of language attribution must be handled
by the input (and potentially output) mechanisms in any case, and thus they
must be locale specific.  This is sufficient to provide information as to
the language being output, since input and output devices are generally
closely associated.

======================= ======================= =======================
======================= ======================= =======================
======================= ======================= =======================


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------