Return to BSD News archive
Received: by minnie.vk1xwt.ampr.org with NNTP id AA5767 ; Fri, 01 Jan 93 01:55:14 EST Path: sserve!manuel.anu.edu.au!munnari.oz.au!uunet!not-for-mail From: avg@rodan.UU.NET (Vadim Antonov) Newsgroups: comp.unix.bsd Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST) Date: 30 Dec 1992 17:48:04 -0500 Organization: UUNET Technologies Inc, Falls Church, VA Lines: 97 Message-ID: <1ht8v4INNj7i@rodan.UU.NET> References: <2564@titccy.cc.titech.ac.jp> <1992Dec30.010216.2550@nobeltech.se> <1992Dec30.061759.8690@fcom.cc.utah.edu> NNTP-Posting-Host: rodan.uu.net Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages In article <1992Dec30.061759.8690@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes: >The "ugly thing Unicode does with asiatic languages" is exactly what it >does with all other languages: There is a single lexical assignment for >for each possible glyph. >.... >ADMITTED DRAWBACKS IN UNICODE: > >The fact that lexical order is not maintained for all existing character >sets (NOTE: NO CURRENT OR PROPOSED STANDARD SUPPORTS THIS IDEA!) means that >a direct arithmatic translation is not possible for... It means that: 1) "mechanistic" conversion between upper and lower case is impossible (as well as case-insensitive comparisons) Example: Latin T -> t Cyrillic T -> m Greek T -> ? This property alone renders Unicode useless for any business applications. 2) there is no trivial way to sort anything. An elementary sort program will require access to enormous tables for all possible languages. English: A B C D E ... T ... Russian: A .. B ... E ... C T ... 3) there is no reasonable way to do hyphenation. Since there is no way to tell language from the text there is no way to do any reasonable attempts to hyphenate. [OX - which language this word is from]? Good-bye wordprocessors and formatters? 4) "the similar gliphs" in Unicode are often SLIGHTLY different typographical gliphs -- everybody who ever dealt with international publishing knows that fonts are designed as a WHOLE and every letter is designed with all others in mind -- i.e. X in Cyrillic is NOT the same X as Latin even if the fonts are variations of the same style. I'd wish you to see how ugly the Russian texts prited on American desktop publishing systems with "few characters added" are. In reality it means that Unicode is not a solution for typesetting. Having unique glyphs works ONLY WITHIN a group of languages which are based on variations of a single alphabet with non-conflicting alphabetical ordering and sets of vowels. You can do that for European languages. An attempt to do it for different groups (like Cyrillic and Latin) is disastrous at best -- we already tried is and finally came to the encodings with two absolutely separate alphabets. I think that there is no many such groups, though, and it is possible to identify several "meta-alpahbets". The meta-alphabets have no defined rules for cross-sorting (unlike latters WITHIN one meta-alphabet; you CAN sort English and German words together and it still will make sense; sorting Russian and English together is at best useless). It increases the number of codes but not as drastically as codifying languages; there are hundreds of languages based on a dozen of meta-alphabets. >The fact that all character sets do not occur in their local lexical order >means that a particular character can not be identified as to language by >its ordinal value. This is a small penalty to pay for the vast reduction >in storage requirements between a 32-bit and a 16-bit character set that >contains all required glyphs. Not true. First of all nothing forces to use 32-bit representation where only 10 bits are necessary. So, as you see the Unicode is more a problem than a solution. The fundamental idea is simply wrong -- it is inadequate for anything except for Latin-based languages. No wonder we're hearing that Unicode is US-centric. Unfortunately Unicode looks like a cool solution for people who never did any real localization work and i fear that this particular mistake will be promoted as standard presenting us a new round of headache. It does not remove necessity to carry off-text information (like "X-Language: english") and it makes it not better than existing ISO 8-bit encodings (if i know the language i already know its alphabet -- all extra bits are simply wasted; and programs handling Unicode text have to know the laguage for reasons stated before). UNICODE IS A *BIG* MISTAKE. (Don't get me wrong -- i'm for the universal encoding; it's just that particular idea of unique glyphs that i strongly oppose). --vadim