Return to BSD News archive
Received: by minnie.vk1xwt.ampr.org with NNTP id AA5981 ; Sat, 02 Jan 93 13:02:07 EST Newsgroups: comp.unix.bsd Path: sserve!manuel.anu.edu.au!munnari.oz.au!uunet!usc!zaphod.mps.ohio-state.edu!caen!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Message-ID: <1993Jan5.090747.29232@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <2564@titccy.cc.titech.ac.jp> <1992Dec28.062554.24144@fcom.cc.utah.edu> <2615@titccy.cc.titech.ac.jp> Date: Tue, 5 Jan 93 09:07:47 GMT Lines: 268 In article <2615@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >In article <1992Dec28.062554.24144@fcom.cc.utah.edu> > terry@cs.weber.edu (A Wizard of Earth C) writes: > >>|> Do you know what Shift JIS is? It's a defacto standard for charcter encoding >>|> established by microsoft, NEC, ASCII etc. and common in Japanese PC market. >> [ ... ] >Sigh... WIth DOS/V you can use Japanese on YOUR "commodity hardware". > >>I think other mechanisms, such as ATOK, Wnn, and KanjiHand deserve to be >>examined. One method would be to adopt exactly the input mechanism of >>"Ichi-Taro" (the most popular NEC 98 word processer). > >They run also on IBM/PC. Well, then this makes them worthy of consideration in the same light as Shift JIS as input mechanisms. >>|> In the workstation market in Japan, some supports Shift JIS, some >>|> supports EUC and some supports both. Of course, many US companies >>|> sell Japanized UNIX on thier workstations. >> >>I think this is precisely what we want to avoid -- localization. The basic >>difference, to my mind, is that localization involves the maintenance of >>multiple code sets, whereas internationalization requires maintenance of >>multiple data sets, a much smaller job. > >>This I don't understand. The maximum translation table from one 16 bit value >>to another is 16k. > >WHAAAAT? It's 128KB, not 16k. Not true for designated sets, the maximum of which spans a 16K region. If, for instance, I designate that I am using the German language, I can say that there will be no input of characters other than German characters; thus the set of characters is reduced from the first German characters to the last German character. 16K is sufficient for most spanning sets for a particular language, as it covers the lexical distance between characters in a particular designated language. 128k is only necessary if you are to include translation of all characters; for a translation involving all characters, no spanning set smaller than the full set exists. Thus a 128k translation table is [almost] never useful (an exception exists for JIS input translation, and that table can exist in the input mechanism just as easily as the existing JIS ordering table does now). >>This means 2 16k tables for translation into/out of >>Unicode for Input/Output devices, > >I'm afraid you don't know what Unicode is. What, do you mean, "tables for >translation" is? In this particular case, I was thinking an ASCII->Unicode or Unicode->ASCII translation to reduce the storage penalty paid by Western languages for the potential support for large glyph set (ie: greater than 256 character) languages. Failure to do this translation (presumably to an ISO font for storage) will result in 1) space loss due to raw encoding, or 2) information loss due to Runic encoding (ie: the byte cont of the file no longer reflects the files character count). >>I don't see why the storage mechanism in any way effects the validity of the >>data > >*I* don't see why the storage mechanism in any way effects the validity of the >data > >>and thus I don't understand *why* you say "with Unicode, we can't >>achieve internationalization." > >Because we can't process a data mixed with Japanese and Chinese. Unicode is a dtorage mechanism, *not* the display mechanism. The Japanese and Chinese data within the file can be tagged as such, and this information can be used by programs (such as the print utility) to perform font selection. If the use of embedded control information is so much more undesirable than Runic encoding, which seems to be accepted without a peep, then a DEC-style CDA (Compound Document Architecture) can be used. In any case, font and language selection is not a property of the storage mechanism, but a property of the data stored. Thus the storage mechanism (in this case, Unicode) is not responsible for the language tagging. A mixed Japanese and Chinese document is a multilingual document; I can go into the rationale as to why this is not a penalty specific to languages with intersecting glyph variants if necessary. I think that this is beside the point (which is enabling 386BSD for data-driven localization). >>I don't understand this, either. This is like saying PC ASCII can not cover >>both the US and the UK because the American and English pound signs are not >>the same, or that it can't cover German or Dutch because of the 7 characters >>difference needed for support of those languages. > >Wrong. The US and UK sign are the same character while they might be assigned >different code points in different countryies. > >Thus, in universal coded character set, it is corrent to assign a >single code point to the single pound sign, even though the character >is used both in US and UK. > >But, corresponding characters in China/Japan, which do not share the >same graphical representation even on the moderate quality printers >thus different characters, are assigned the same code point in Unicode. This is not a requirement for unification if an assumption of language tagging for the files (either externally to the files or within the data stream for mixed language documents) is made. I am making that assumption. The fact is that localization is a much more desirable goal than multilingual word processing. Thus if a penalty is to be paid, it should be paid in the multinational application rather than in all localized applications. One could make the same argument regarding the Unification of the Latin, Latvian, and Lappish { SMALL LETTER G CEDILLA }, since there are glyph variants there as well. The point is that the storage technology is divorced from the font selection process; that is something that is done, either based on an ISO 2022 style mechanism within the document itself, or on a per file basis during localization or in a Compound Document (what Unicode calls a "Fancy text" file). The font selection is based on the language tagging of the file at the time of display (to a screen or to a hard copy device). The use of a Unicode character set for storage does not necessarily imply the use of a Unicode (or other unified) font for display. When printing Latin text, a Latin font will be used; when printing Latvian text, a Lativian font, etc. >>|> Of course, it is possible to LOCALIZE Unicode so that it produces >>|> Japanese characters only or Chinese characters only. But don't we >>|> need internationalization? >> >>The point of an internationalization effort (as *opposed* to a localization >>effort) is the coexistance of languages within the same processing means. >>The point is not to produce something which is capable of "only English" or >>"only French" or "only Japanese" at the flick of an environment variable; >>the point is to produce something which is *data driven* and localized by >>a change of data rather than by a change of code. To do otherwise would >>require the use of multiple code trees for each language, which was the >>entire impetus for an internationalization effort in the first place. > >That is THE problem of Unicode. It's not a problem if you don't expect your storage mechanism to dictate your display mechanism. >I was informed that MicroSoft will provide a LOCALIZATION mechanism >to print correnponding Chinese/Japanese characters of Unicode >differently. Yes. We will have to do the same if we use Unicode. >So, HOW can we MIX Chinese and Japanese without LOCALIZATION? You can't... but niether can you type in both Chinese and Japanese on the same keyboard without switching "LOCALIZATION" of the keyboard to the language being input. This gives an automatic attribution queue to the Compounding mechanism (ISO 2022 style or DEC style or whatever). >>This involves yet another >>set of localization-specific storage tables to translate from an ISO or >>other local font to Unicode and back on attributed file storage. > >FILE ATTRIBUTE!!!!!????? *IT* *IS* *EVIL*. Do you really know UNIX? The basis for this is an assumtion of an optimization necessary to reduce the storage requirements of Western text files to current levels, and the assumption that it would be better if the file size on a monolingual document (the most common kind of document) bore an arithmetic relationship to the number of glyphs within the file. This will incidently reduce the number of bytes required for the storage of Japanese documents from 2-5 down to 2 per glyph. Of course you can always turn the optimiztion off... >How can you "cat" two files with different file attributes? By localizing the display output mechanism. In particualr, in a multilingual environment (the only place you will ever have two files with different file attributes), there would be no alternative to a graphic or other complex display device (be it printer or X terminal). Ideally, there will be a language-localized font for every supported language defining only the glyphs in Unicode which pertain to taht language. >What attribute can you attach to semi binary file, in which some field >contains an ASCII string and some other field contains a JIS string? The point of a *data driven* localization is to avoid the existance of semi-binary files. In terms of internationalization, you are describing not a generally usable application, but a bilingual application. This is against the rules. It is possible to do language tagging of the users, wherein all tagging mechanisms not associated with a language (such as data embedded in a printf() call) in a badly behaved application is assumed to be a tagged for output as a particular language. Thus a German user on a 7-bit VT220 set for the German language NRCS (National Replacement Character Set) might see some odd characters on their screen when running a badly behaved application from France. >>To do >>otherwise would require 16 bit sotrage of files, or worse, runic encoding >>of any non-US ASCII characters in a file. This either doubles the file >>size for all text files (something the west _will_not_accept_), > >Do you know what UTF is? Yes, Runic encoding. I'll post the Plan-9 and Metis sources for it if you want. I describe it fairly accurately below as "pollution": >>or >>"pollutes" the files (all files except those stored in US-ASCII have file >>sizes which no longer reflect true character counts on the file). Destruction of this information is basically unacceptable for Western users and other multinational users currently used to internationalization by way of 8-bit clean environments. >That's already true for languages like Japanese, whose characters are >NOT ALWAYS (but sometimes) represented with a single byte. > >But, what's wrong with that? It doesn't have to be true of Japanese, especially with a fixed storage coding. What's wrong with it is that file size equates to character count for Western users, and there are users who depend on this information. As a VMS user (I know, *bletch*), I had the wonderful experience of finding out that tell() didn't return a byte offset for record oriented files. As a UNIX user, I depended on that information, and it was vastly frustrating to me when the information was destroyed by the storage mechanism. This is precisely analogous, only picking the file size rather than the data within the file. Consider for a moment, if you will, changing the first character of a field in a 1M file. Not only does this cause the record size ot become variant on the data within it (thus rendering a computed lseek() useless, since the records are no longer fixed size), but it requires that the entire file contents be shifted to accomodate what used to be ther rewrite of a single block. I realize that as a Japanese user, you are "used to" this behaviour; the average Western user (or *anyone* using a language representable in only and 8-bit clean envornment) is *not*. >>Admittedly, these mechanisms are adapatable for XPG4 (not widely available) >>and XPG3 (does not support eastern languages), but the MicroSoft adoption >>of Unicode tells us that at least 90% of the market is now committed to >>Unicode, if not now, then in the near future. > >Do you think MicroSoft will use file attributes? Probably not. Most likely, the fact that Microsoft did not invent it and the fact that the HPFS fot NT is pretty well locked in place argues against it. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------