Return to BSD News archive
Received: by minnie.vk1xwt.ampr.org with NNTP id AA5616 ; Fri, 01 Jan 93 01:51:02 EST Newsgroups: comp.unix.bsd Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1992Dec28.062554.24144@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: University of Utah Computer Center References: <id.M2XV.VTA@ferranti.com> <1992Dec18.043033.14254@midway.uchicago.edu> <1992Dec18.212323.26882@netcom.com> <1992Dec19.083137.4400@fcom.cc.utah.edu> <2564@titccy.cc.titech.ac.jp> Date: Mon, 28 Dec 92 06:25:54 GMT Lines: 164 In article <2564@titccy.cc.titech.ac.jp>, mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: |> In article <1992Dec19.083137.4400@fcom.cc.utah.edu> |> terry@cs.weber.edu (A Wizard of Earth C) writes: |> |> >US Engineers produce software for the available market; because of the |> >input difficulties involved in 6000+ glyph sets of symbols, there has been |> >a marked lack of standardization in Japanese hardware and software. This |> >means that the market in Japan consists of mostly "niche" markets, rather |> >than being a commodity market. |> |> Do you know what Shift JIS is? It's a defacto standard for charcter encoding |> established by microsoft, NEC, ASCII etc. and common in Japanese PC market. I am aware of JIS; however, even you must agree that the Japaneese hardware and software markets have not reached the level of "commodity hardware" found elsewhere in the world (ie: the US and Europe). There are multiple conflicting platforms, and thus multiple conflicting code sets for implementation. If we had to pick one platform to support (I am loathe to do this, as it means support for other platforms may be ignored until something incompatable has fossilized) it would probably be the NEC 98, which is not even PC compatible. I think other mechanisms, such as ATOK, Wnn, and KanjiHand deserve to be examined. One method would be to adopt exactly the input mechanism of "Ichi-Taro" (the most popular NEC 98 word processer). |> Now, DOS/V from IBM strongly supports Shift JIS. |> |> In the workstation market in Japan, some supports Shift JIS, some |> supports EUC and some supports both. Of course, many US companies |> sell Japanized UNIX on thier workstations. I think this is precisely what we want to avoid -- localization. The basic difference, to my mind, is that localization invloves the maintenance of multiple code sets, whereas internationalization requires maintenance of multiple data sets, a much smaller job. |> >This has changed somewhat with the Nintendo |> >corporations recent successes in Japan, where standardized hardware is |> |> I'm sure you are just joking here. Yes, this was intended to be a jab at localization of a system as opposed to internationalization. The set of Nintendo games in the US and Japan are largely non-intersecting sets of software... games sold in the US are not sold in Japan and vice versa. I feel that "localization" is the "Nintendo" soloution. I also feel that we need to be striving for a level of complexity well above that of a toy. |> >Microsoft has adopted Unicode as a standard. It will probably be the |> >prevalent standard because of this -- the software world is too wrapped |> >up in commodity (read "DOS") hardware for it to be otherwise. Unicode |> >has also done something that XPG4 has not: unified the Far Eastern and |> >all other written character sets in a single font, with room for some |> >expansion (to the full 16 bits) and a discussion of moving to a full |> >32 bit mechanism. |> |> Do you know that Japan vote AGAINST ISO10646/Unicode, because it's not |> good for Japanese? |> |> >So even if the Unicode standard ignores backward compatability |> >with Japanese standards (and specific American and European standards), |> >it better supports true internationalization. |> |> The reason of disapproval is not backward compatibility. |> |> The reason is that, with Unicode, we can't achieve internationalization. This I don't understand. The maximum translation table from one 16 bit value to another is 16k. This means 2 16k tables for translation into/out of Unicode for Input/Output devices, and one 16k table and one 512 byte table if a compact storage methos is used to remove the normal 2X storage penalty for 256 character languages, like most European languages. I don't see why the storage mechanism in any way effects the validity of the data -- and thus I don't understand *why* you say "with Unicode, we can't achieve internationalization." |> >XPG4, by adopting the JIS standard, appears to be |> >igonoring HAN (Chinese) and many other languages covered by the Unicode |> >standard. |> |> Unicode can not cover both Japanese and Chinese at the same time, because |> the same code points are shared between similar characters in Japan |> and in China. I don't understand this, either. This is like saying PC ASCII can not cover both the US and the UK because the American and English pound signs are not the same, or that it can't cover German or Dutch because of the 7 characters difference needed for support of those languages. |> Of course, it is possible to LOCALIZE Unicode so that it produces |> Japanese characters only or Chinese characters only. But don't we |> need internationalization? The point of an internationalization effort (as *opposed* to a localization effort) is the coexistance of languages within the same processing means. The point is not to produce something which is capable of "only English" or "only French" or "only Japanese" at the flick of an environment variable; the point is to produce something which is *data driven* and localized by a change of data rather than by a change of code. To do otherwise would require the use of multiple code trees for each language, which was the entire impetus for an internationalization effort in the first place. |> Or, how can I process a text containing both Japanese and Chinese? Obviously, the input mechanisms will require localization for the set of characters out of the Unicode set which will be used for a particular language; there is no reason JIS input can not be used to input Unicode as well as any other font; your argument that the lexical order of the target language effects the usability of a storage standard is invalid. Sure, the translation mechanisms may be *easier* to code given localization of lexical ordering, but that doesn't mean they *can't* be coded otherwise; if it was easy, we'd do it in hardware. ;-). |> >I think that Japaneese |> >users (and European and American users, if nothing is done about storage |> >encoding to 8 bit sets) are going to have to live with the drawbacks of |> >the standard for a very long time (the primary one being two 16K tables |> >for input and output for each language representable in 8 bits, and two |> >16k tables for runic mapping for languages, like Japaneese, which don't |> >fit on keyboards without postprocessing). |> |> What? 16K? Do you think 16K is LARGE? |> |> Then, you know nothing about how Japanese are input. We are happily using |> several hundreds kilo bytes or even several mega bytes of electrical |> dictionary, even on PCs. No, I don't think 16k is large; however the drawback is not in the size of the tables, but in their use on every character in from an input device or out to an output device. In addition, an optimization of the file system to allow for "lexically compact storage" (my term) is necessary to make Americans and Europeans accept the mechanism. This involves yet another set of localization-specific storage tables to translate from an ISO or other local font to Unicode and back on attributed file storage. To do otherwise would require 16 bit sotrage of files, or worse, runic encoding of any non-US ASCII characters in a file. This either doubles the file size for all text files (something the west _will_not_accept_), or "pollutes" the files (all files except those stored in US-ASCII have file sizes which no longer reflect true character counts on the file). Admittedly, these mechanisms are adapatable for XPG4 (not widely available) and XPG3 (does not support eastern languages), but the MicroSoft adoption of Unicode tells us that at least 90% of the market is now committed to Unicode, if not now, then in the near future. I would like to hear any arguments anyone has regarding *why* Unicode is "bad" and should not be adopted in the remaining 10% of the market (thus ensuring incompatability and a lack of interoperability which is guaranteed to prevent penetration of the existing 90%). Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------