Return to BSD News archive
Received: by minnie.vk1xwt.ampr.org with NNTP id AA6105 ; Mon, 04 Jan 93 22:09:28 EST Xref: sserve comp.unix.bsd:9671 comp.std.internat:1617 Newsgroups: comp.unix.bsd,comp.std.internat Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: INTERNATIONALIZATION: JAPAN, FAR EAST Message-ID: <1993Jan7.045612.13244@fcom.cc.utah.edu> Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <2615@titccy.cc.titech.ac.jp> <1993Jan5.090747.29232@fcom.cc.utah.edu> <2628@titccy.cc.titech.ac.jp> Date: Thu, 7 Jan 93 04:56:12 GMT Lines: 226 Before I proceed, I will [ once again ] remove the "dumb Americans" from my original topic line. You and I both know the original attribution was by way of a query by an American, and was not intended to be reintegrated into the main thread. In article <2628@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >In article <1993Jan5.090747.29232@fcom.cc.utah.edu> > terry@cs.weber.edu (A Wizard of Earth C) writes: > >>>>This I don't understand. The maximum translation table from one 16 bit value >>>>to another is 16k. >>> >>>WHAAAAT? It's 128KB, not 16k. >> >>Not true for designated sets, the maximum of which spans a 16K region. > >Uuuurrrghhhh!!! Ia Ia! > >>128k is only necessary if you are to include translation of all characters; >>for a translation involving all characters, no spanning set smaller than the >>full set exists. Thus a 128k translation table is [almost] never useful > >Then, don't say 16 bit. If you say 16 bit, it IS 128K. It is still a translation of one 16 bit value to another. In is *not* an *arbitrary* translation we are talking about, since the spanning sets will be known. >>>>This means 2 16k tables for translation into/out of >>>>Unicode for Input/Output devices, > >>In this particular case, I was thinking an ASCII->Unicode or Unicode->ASCII >>translation to reduce the storage penalty paid by Western languages for > >ASCII <-> Unicode translation does not need any tables. PERIOD. Sorry; I misspoke (mistyped?) here. I meant to refer to any arbitrary 8-bit set for which a localization set is available (example: and ISO 8859-x set). I did not mean to limit the scope of the discussion to ASCII. For any set containing more than 256 characters, raw representation in 16-bit form is preferrable, as it saves space (an 8-bit representation is not attainable for such a large glyph-set character set). This would *not* result in direct Unicode storage of the large glyph-set characters, since a language attribution is still necessary for attribution. >>>How can you "cat" two files with different file attributes? >> >>By localizing the display output mechanism. > >Wow! Apparently he thinks "cat" is a command to display content of >files. No wonder he think file attributes good. Obviously, by this response, you meant "cat two files to a third file" rather than what you stated, which would have resulted in the files going to the screen. Display device attribution based on supported character sets has been well discussed, hopefully to both our satisfaction. "cat" command pseudo-code in a language attributed file environment: [ cat f1 f2 > f3 ] Shell opens output file (default attribute: Unicode) Open first file (attribute: Japanese) Read data out (as Unicode data) Write Unicode data to stdout Write data to stdout Close first file Open second file (attribute: Japanese) Read data out (as Unicode data) Write Unicode data to stdout Close second file Shell closes output file (ala "cat" program exit) Obviously what you are asking is "how do I make two monolingual/bilingual/ multilingual files of different language attribution into a single bilingual/ multilingual file using cat" -- not the question as you have phrased it, nor as I have answered it, but in the context of the discussion, clearly the intended tack. Rather than pretending I don't know what you are getting at, I will answer the target question to avoid several more cryptic postings on your part as you slowly work towards the point. The answer is "you don't use 'cat'". The "cat" command does not deal with attribution of files because it's redirected output is arbitrarily assigned, and because of the fact that the input files are processed sequentially. I could argue for open descriptor type attribution and parallelization of the open mechanism such that all stream types addressed are known to the "cat" program, since attribution of an inode implies that attribution can be directly applied to an open in-core file descriptor for an instance of a pipe on either end and to input and output redirection. This defeats the cat program somewhat, and it is still possible to contrive examples where this would fail: # # Contrived example 1.0 # for i in f1 f2 f3 do cat $i done | cat > f4 In this case it would not be possible to provide the parallelization I spoke of above (no doubt you will include this construct in all your shell scripts from this point on if you rea no further). The correct approach is to note that since Unicode does not provide a mechanism directly for language attribution, and that file attribution is only a partial soloution, since it does not deal with non-monolingual texts. While this has already been discussed to death, I will answer in good faith. What this means is that all files which are multilingual in nature require a compound document architecture. Whether this is done using "bit stuffing" (a "magic cookie" which can not occur in normal text files is allocated a code space in the Unicode character set as a non-spacing language designation character. Yes, this means that the data within the files is stateful.) or by some other mechanism impelementing what the Unicode committee unfortunately called "Fancy text", or what most people call "Rich text", or what Adobe calls "Page Description Language" is irrelevant. What this means is that a utility to combine documents (let's call it "combine") must have the ability to either generate language attributed files (if the source files are all of a single language attribution) or our default compound document format (TBD). This also means that the only difference between its operation and that of "cat" is that it must be unable to accept input piped to it, or in any other way in which it may be contrived to fool such a utility. The soloution to this may be documentation of "undefined behaviour" (as ANSI is fond of saying about the C language) rather than specific limitations on its use. The ability to redirect or pipe output is predicated on the ability to attribute existing files and/or pipes by language as well; if this is not allowed, then it would be necessary to specify a direct output file (most likely as the first argument, like the "tar" command). Sufficient definition of "undefined" behaviour could allow us to rename "combine" back to "cat" if we wished to risk it. Thus, to combine two arbitrary files to produce a third without the user worrying about language attribution: combine f1 f2 > f3 Attribution of output and clever construction of out output device drivers would even allow us to switch fonts as dictated by the compound document architecture controls embedded in the file and/or the attribution of the file descriptor (the absence of such attribution being an indicator of a compund document). Thus a modified "more" could: more f3 successfully, and a modified "lpr" could: lpr f3 successfully. In the case of two monolingual documents of the same attribution, the standard "cat" command would produce an output file missing language attribution (unless both files were compound documents, and there was no aspect of the compounding mechanism causing the files to be unable to be simply conjoined. A POSSIBLE OPTIMIZATION OF STORAGE FOR SMALL GLYPH-SET LANGUAGES. As a possible optimization to reduce storage costs, we add to the existing list of potential language attributes for a file, which are: Compound Document Raw Unicode <Specific Language> This may take on many values To these we add: <Language Class> A language class is done by batch reduction of files to reduce storage costs and to provide attribution, and is done when the system is idle. For instance, if I have a file which contains US ASCII, French, and German, a spanning reduction of the language attribute of the file from Raw Unicode (the result of the standard "cat" output) to a language class of "ISO Latin-1" is possible, thus reducing storage requirements. For some languages, a 100 percent reduction is possible; for others, such as US ASCII , French, Japanese, or Chinese, it is not. In these cases, the system locale can be used to provide a "strong hint". Ideally, none of this would be necessary, since we have replace the standard cat, and all non-international user code will be specific to the locale as long as output of a stream with multiple potential source languages is restricted. In a monolingual environment, this will be possible. A strict/nonstrict locale switch is one potential soloution to this problem. This would cause instant reduction to the locale language in the strict case. A direct language attribute reduction from Unicode to the language class ISO Latin-1 will result in a halving of file storage requirements. Note that a reduction from Raw Unicode to Compound Document is possible base on our "magic cookie" scenerio above. Admittedly, it is possible to misreduce a file if its contents are not specific to a particular language or language class. In this case, no harm is done directly appart from misattribution of the file; the file can be considered to have been migrated to a slightly slower storage. Any additions to the file not fitting into the language class will cause a remigration back to the Raw Unicode class; hopefully this would be the minority case, since most utilites will already be internationalized, and there should be intrinsic internationalization of new utilities. Note that since this is simply a storage compaction optimization, there is no explict reason that it could not be switchable as to whether or not it is enforced, at the system administrators discretion. Does this answer your "cat" question sufficiently? The problem seemed to be that there was not a means around the problem from your point of view. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------