*BSD News Article 9557

Received: by minnie.vk1xwt.ampr.org with NNTP
	id AA5981 ; Sat, 02 Jan 93 13:02:07 EST
Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!uunet!usc!zaphod.mps.ohio-state.edu!caen!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
Message-ID: <1993Jan5.090747.29232@fcom.cc.utah.edu>
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <2564@titccy.cc.titech.ac.jp> <1992Dec28.062554.24144@fcom.cc.utah.edu> <2615@titccy.cc.titech.ac.jp>
Date: Tue, 5 Jan 93 09:07:47 GMT
Lines: 268

In article <2615@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>In article <1992Dec28.062554.24144@fcom.cc.utah.edu>
>	terry@cs.weber.edu (A Wizard of Earth C) writes:
>
>>|> Do you know what Shift JIS is? It's a defacto standard for charcter encoding
>>|> established by microsoft, NEC, ASCII etc. and common in Japanese PC market.
>>
[ ... ]
>Sigh... WIth DOS/V you can use Japanese on  YOUR "commodity hardware".
>
>>I think other mechanisms, such as ATOK, Wnn, and KanjiHand deserve to be
>>examined.  One method would be to adopt exactly the input mechanism of
>>"Ichi-Taro" (the most popular NEC 98 word processer).
>
>They run also on IBM/PC.

Well, then this makes them worthy of consideration in the same light as
Shift JIS as input mechanisms.

>>|> In the workstation market in Japan, some supports Shift JIS, some
>>|> supports EUC and some supports both. Of course, many US companies
>>|> sell Japanized UNIX on thier workstations.
>>
>>I think this is precisely what we want to avoid -- localization.  The basic
>>difference, to my mind, is that localization involves the maintenance of
>>multiple code sets, whereas internationalization requires maintenance of
>>multiple data sets, a much smaller job.
>
>>This I don't understand.  The maximum translation table from one 16 bit value
>>to another is 16k.
>
>WHAAAAT? It's 128KB, not 16k.

Not true for designated sets, the maximum of which spans a 16K region.  If,
for instance, I designate that I am using the German language, I can say
that there will be no input of characters other than German characters;
thus the set of characters is reduced from the first German characters to
the last German character.  16K is sufficient for most spanning sets for
a particular language, as it covers the lexical distance between characters
in a particular designated language.

128k is only necessary if you are to include translation of all characters;
for a translation involving all characters, no spanning set smaller than the
full set exists.  Thus a 128k translation table is [almost] never useful
(an exception exists for JIS input translation, and that table can exist
in the input mechanism just as easily as the existing JIS ordering table
does now).

>>This means 2 16k tables for translation into/out of
>>Unicode for Input/Output devices,
>
>I'm afraid you don't know what Unicode is. What, do you mean, "tables for
>translation" is?

In this particular case, I was thinking an ASCII->Unicode or Unicode->ASCII
translation to reduce the storage penalty paid by Western languages for
the potential support for large glyph set (ie: greater than 256 character)
languages.  Failure to do this translation (presumably to an ISO font for
storage) will result in 1) space loss due to raw encoding, or 2) information
loss due to Runic encoding (ie: the byte cont of the file no longer reflects
the files character count).

>>I don't see why the storage mechanism in any way effects the validity of the
>>data
>
>*I* don't see why the storage mechanism in any way effects the validity of the
>data
>
>>and thus I don't understand *why* you say "with Unicode, we can't
>>achieve internationalization."
>
>Because we can't process a data mixed with Japanese and Chinese.

Unicode is a dtorage mechanism, *not* the display mechanism.  The Japanese
and Chinese data within the file can be tagged as such, and this information
can be used by programs (such as the print utility) to perform font selection.
If the use of embedded control information is so much more undesirable than
Runic encoding, which seems to be accepted without a peep, then a DEC-style
CDA (Compound Document Architecture) can be used.

In any case, font and language selection is not a property of the storage
mechanism, but a property of the data stored.  Thus the storage mechanism
(in this case, Unicode) is not responsible for the language tagging.

A mixed Japanese and Chinese document is a multilingual document; I can
go into the rationale as to why this is not a penalty specific to languages
with intersecting glyph variants if necessary.  I think that this is beside
the point (which is enabling 386BSD for data-driven localization).

>>I don't understand this, either.  This is like saying PC ASCII can not cover
>>both the US and the UK because the American and English pound signs are not
>>the same, or that it can't cover German or Dutch because of the 7 characters
>>difference needed for support of those languages.
>
>Wrong. The US and UK sign are the same character while they might be assigned
>different code points in different countryies.
>
>Thus, in universal coded character set, it is corrent to assign a
>single code point to the single pound sign, even though the character
>is used both in US and UK.
>
>But, corresponding characters in China/Japan, which do not share the
>same graphical representation even on the moderate quality printers
>thus different characters, are assigned the same code point in Unicode.

This is not a requirement for unification if an assumption of language
tagging for the files (either externally to the files or within the data
stream for mixed language documents) is made.  I am making that assumption.

The fact is that localization is a much more desirable goal than
multilingual word processing.  Thus if a penalty is to be paid, it should
be paid in the multinational application rather than in all localized
applications.

One could make the same argument regarding the Unification of the Latin,
Latvian, and Lappish { SMALL LETTER G CEDILLA }, since there are glyph
variants there as well.

The point is that the storage technology is divorced from the font
selection process; that is something that is done, either based on an
ISO 2022 style mechanism within the document itself, or on a per file basis
during localization or in a Compound Document (what Unicode calls a "Fancy
text" file).

The font selection is based on the language tagging of the file at the time
of display (to a screen or to a hard copy device).  The use of a Unicode
character set for storage does not necessarily imply the use of a Unicode
(or other unified) font for display.

When printing Latin text, a Latin font will be used; when printing Latvian
text, a Lativian font, etc.

>>|> Of course, it is possible to LOCALIZE Unicode so that it produces
>>|> Japanese characters only or Chinese characters only. But don't we
>>|> need internationalization?
>>
>>The point of an internationalization effort (as *opposed* to a localization
>>effort) is the coexistance of languages within the same processing means.
>>The point is not to produce something which is capable of "only English" or
>>"only French" or "only Japanese" at the flick of an environment variable;
>>the point is to produce something which is *data driven* and localized by
>>a change of data rather than by a change of code.  To do otherwise would
>>require the use of multiple code trees for each language, which was the
>>entire impetus for an internationalization effort in the first place.
>
>That is THE problem of Unicode.

It's not a problem if you don't expect your storage mechanism to dictate
your display mechanism.

>I was informed that MicroSoft will provide a LOCALIZATION mechanism
>to print correnponding Chinese/Japanese characters of Unicode
>differently.

Yes.  We will have to do the same if we use Unicode.

>So, HOW can we MIX Chinese and Japanese without LOCALIZATION?

You can't... but niether can you type in both Chinese and Japanese on the
same keyboard without switching "LOCALIZATION" of the keyboard to the
language being input.  This gives an automatic attribution queue to the
Compounding mechanism (ISO 2022 style or DEC style or whatever).

>>This involves yet another
>>set of localization-specific storage tables to translate from an ISO or
>>other local font to Unicode and back on attributed file storage.
>
>FILE ATTRIBUTE!!!!!????? *IT* *IS* *EVIL*. Do you really know UNIX?

The basis for this is an assumtion of an optimization necessary to reduce
the storage requirements of Western text files to current levels, and the
assumption that it would be better if the file size on a monolingual
document (the most common kind of document) bore an arithmetic relationship
to the number of glyphs within the file.  This will incidently reduce
the number of bytes required for the storage of Japanese documents from
2-5 down to 2 per glyph.  Of course you can always turn the optimiztion
off...

>How can you "cat" two files with different file attributes?

By localizing the display output mechanism.  In particualr, in a multilingual
environment (the only place you will ever have two files with different
file attributes), there would be no alternative to a graphic or other
complex display device (be it printer or X terminal).

Ideally, there will be a language-localized font for every supported language
defining only the glyphs in Unicode which pertain to taht language.

>What attribute can you attach to semi binary file, in which some field
>contains an ASCII string and some other field contains a JIS string?

The point of a *data driven* localization is to avoid the existance of
semi-binary files.  In terms of internationalization, you are describing
not a generally usable application, but a bilingual application.  This
is against the rules.  It is possible to do language tagging of the
users, wherein all tagging mechanisms not associated with a language
(such as data embedded in a printf() call) in a badly behaved application
is assumed to be a tagged for output as a particular language.  Thus a
German user on a 7-bit VT220 set for the German language NRCS (National
Replacement Character Set) might see some odd characters on their screen
when running a badly behaved application from France.

>>To do
>>otherwise would require 16 bit sotrage of files, or worse, runic encoding
>>of any non-US ASCII characters in a file.  This either doubles the file
>>size for all text files (something the west _will_not_accept_),
>
>Do you know what UTF is?

Yes, Runic encoding.  I'll post the Plan-9 and Metis sources for it if you
want.  I describe it fairly accurately below as "pollution":

>>or
>>"pollutes" the files (all files except those stored in US-ASCII have file
>>sizes which no longer reflect true character counts on the file).

Destruction of this information is basically unacceptable for Western users
and other multinational users currently used to internationalization by way
of 8-bit clean environments.

>That's already true for languages like Japanese, whose characters are
>NOT ALWAYS (but sometimes) represented with a single byte.
>
>But, what's wrong with that?

It doesn't have to be true of Japanese, especially with a fixed storage
coding.  What's wrong with it is that file size equates to character count
for Western users, and there are users who depend on this information.  As
a VMS user (I know, *bletch*), I had the wonderful experience of finding
out that tell() didn't return a byte offset for record oriented files.  As
a UNIX user, I depended on that information, and it was vastly frustrating
to me when the information was destroyed by the storage mechanism.  This
is precisely analogous, only picking the file size rather than the data
within the file.

Consider for a moment, if you will, changing the first character of a
field in a 1M file.  Not only does this cause the record size ot become
variant on the data within it (thus rendering a computed lseek() useless,
since the records are no longer fixed size), but it requires that the entire
file contents be shifted to accomodate what used to be ther rewrite of a
single block.

I realize that as a Japanese user, you are "used to" this behaviour; the
average Western user (or *anyone* using a language representable in only
and 8-bit clean envornment) is *not*.

>>Admittedly, these mechanisms are adapatable for XPG4 (not widely available)
>>and XPG3 (does not support eastern languages), but the MicroSoft adoption
>>of Unicode tells us that at least 90% of the market is now committed to
>>Unicode, if not now, then in the near future.
>
>Do you think MicroSoft will use file attributes?

Probably not.  Most likely, the fact that Microsoft did not invent it and the
fact that the HPFS fot NT is pretty well locked in place argues against it.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------