*BSD News Article 9614

Received: by minnie.vk1xwt.ampr.org with NNTP
	id AA6105 ; Mon, 04 Jan 93 22:09:28 EST
Xref: sserve comp.unix.bsd:9671 comp.std.internat:1617
Newsgroups: comp.unix.bsd,comp.std.internat
Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: INTERNATIONALIZATION: JAPAN, FAR EAST
Message-ID: <1993Jan7.045612.13244@fcom.cc.utah.edu>
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <2615@titccy.cc.titech.ac.jp> <1993Jan5.090747.29232@fcom.cc.utah.edu> <2628@titccy.cc.titech.ac.jp>
Date: Thu, 7 Jan 93 04:56:12 GMT
Lines: 226

Before I proceed, I will [ once again ] remove the "dumb Americans" from my
original topic line.  You and I both know the original attribution was by
way of a query by an American, and was not intended to be reintegrated
into the main thread.

In article <2628@titccy.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>In article <1993Jan5.090747.29232@fcom.cc.utah.edu>
>	terry@cs.weber.edu (A Wizard of Earth C) writes:
>
>>>>This I don't understand.  The maximum translation table from one 16 bit value
>>>>to another is 16k.
>>>
>>>WHAAAAT? It's 128KB, not 16k.
>>
>>Not true for designated sets, the maximum of which spans a 16K region.
>
>Uuuurrrghhhh!!! Ia Ia!
>
>>128k is only necessary if you are to include translation of all characters;
>>for a translation involving all characters, no spanning set smaller than the
>>full set exists.  Thus a 128k translation table is [almost] never useful
>
>Then, don't say 16 bit. If you say 16 bit, it IS 128K.

It is still a translation of one 16 bit value to another.  In is *not* an
*arbitrary* translation we are talking about, since the spanning sets will
be known.

>>>>This means 2 16k tables for translation into/out of
>>>>Unicode for Input/Output devices,
>
>>In this particular case, I was thinking an ASCII->Unicode or Unicode->ASCII
>>translation to reduce the storage penalty paid by Western languages for
>
>ASCII <-> Unicode translation does not need any tables. PERIOD.

Sorry; I misspoke (mistyped?) here.  I meant to refer to any arbitrary 8-bit
set for which a localization set is available (example: and ISO 8859-x set).
I did not mean to limit the scope of the discussion to ASCII.

For any set containing more than 256 characters, raw representation in 16-bit
form is preferrable, as it saves space (an 8-bit representation is not
attainable for such a large glyph-set character set).  This would *not*
result in direct Unicode storage of the large glyph-set characters, since
a language attribution is still necessary for attribution.

>>>How can you "cat" two files with different file attributes?
>>
>>By localizing the display output mechanism.
>
>Wow! Apparently he thinks "cat" is a command to display content of
>files. No wonder he think file attributes good.

Obviously, by this response, you meant "cat two files to a third file" rather
than what you stated, which would have resulted in the files going to the
screen.  Display device attribution based on supported character sets has
been well discussed, hopefully to both our satisfaction.

"cat" command pseudo-code in a language attributed file environment:
[ cat f1 f2 > f3 ]

	Shell opens output file (default attribute: Unicode)
	Open first file (attribute: Japanese)
	Read data out (as Unicode data)
	Write Unicode data to stdout
	Write data to stdout
	Close first file
	Open second file (attribute: Japanese)
	Read data out (as Unicode data)
	Write Unicode data to stdout
	Close second file
	Shell closes output file (ala "cat" program exit)

Obviously what you are asking is "how do I make two monolingual/bilingual/
multilingual files of different language attribution into a single bilingual/
multilingual file using cat" -- not the question as you have phrased it, nor
as I have answered it, but in the context of the discussion, clearly the
intended tack.  Rather than pretending I don't know what you are getting
at, I will answer the target question to avoid several more cryptic postings
on your part as you slowly work towards the point.

The answer is "you don't use 'cat'".  The "cat" command does not deal with
attribution of files because it's redirected output is arbitrarily assigned,
and because of the fact that the input files are processed sequentially.

I could argue for open descriptor type attribution and parallelization of
the open mechanism such that all stream types addressed are known to the
"cat" program, since attribution of an inode implies that attribution
can be directly applied to an open in-core file descriptor for an instance
of a pipe on either end and to input and output redirection.  This defeats
the cat program somewhat, and it is still possible to contrive examples
where this would fail:

	#
	# Contrived example 1.0
	#
	for i in f1 f2 f3
	do
		cat $i
	done | cat > f4

In this case it would not be possible to provide the parallelization I
spoke of above (no doubt you will include this construct in all your
shell scripts from this point on if you rea no further).

The correct approach is to note that since Unicode does not provide a
mechanism directly for language attribution, and that file attribution
is only a partial soloution, since it does not deal with non-monolingual
texts.  While this has already been discussed to death, I will answer in
good faith.

What this means is that all files which are multilingual in nature require
a compound document architecture.  Whether this is done using "bit stuffing"
(a "magic cookie" which can not occur in normal text files is allocated
a code space in the Unicode character set as a non-spacing language
designation character.  Yes, this means that the data within the files is
stateful.) or by some other mechanism impelementing what the Unicode
committee unfortunately called "Fancy text", or what most people call
"Rich text", or what Adobe calls "Page Description Language" is irrelevant.

What this means is that a utility to combine documents (let's call it
"combine") must have the ability to either generate language attributed
files (if the source files are all of a single language attribution) or
our default compound document format (TBD).  This also means that the
only difference between its operation and that of "cat" is that it must
be unable to accept input piped to it, or in any other way in which it
may be contrived to fool such a utility.  The soloution to this may be
documentation of "undefined behaviour" (as ANSI is fond of saying about
the C language) rather than specific limitations on its use.

The ability to redirect or pipe output is predicated on the ability to
attribute existing files and/or pipes by language as well; if this is
not allowed, then it would be necessary to specify a direct output file
(most likely as the first argument, like the "tar" command).  Sufficient
definition of "undefined" behaviour could allow us to rename "combine"
back to "cat" if we wished to risk it.

Thus, to combine two arbitrary files to produce a third without the user
worrying about language attribution:

	combine f1 f2 > f3

Attribution of output and clever construction of out output device drivers
would even allow us to switch fonts as dictated by the compound document
architecture controls embedded in the file and/or the attribution of the
file descriptor (the absence of such attribution being an indicator of a
compund document).  Thus a modified "more" could:

	more f3

successfully, and a modified "lpr" could:

	lpr f3

successfully.

In the case of two monolingual documents of the same attribution, the
standard "cat" command would produce an output file missing language
attribution (unless both files were compound documents, and there was
no aspect of the compounding mechanism causing the files to be unable
to be simply conjoined.

A POSSIBLE OPTIMIZATION OF STORAGE FOR SMALL GLYPH-SET LANGUAGES.

As a possible optimization to reduce storage costs, we add to the existing
list of potential language attributes for a file, which are:

Compound Document
Raw Unicode
<Specific Language>		This may take on many values

To these we add:

<Language Class>

A language class is done by batch reduction of files to reduce storage
costs and to provide attribution, and is done when the system is idle.
For instance, if I have a file which contains US ASCII, French, and German,
a spanning reduction of the language attribute of the file from Raw
Unicode (the result of the standard "cat" output) to a language class of
"ISO Latin-1" is possible, thus reducing storage requirements.  For some
languages, a 100 percent reduction is possible; for others, such as US ASCII
, French, Japanese, or Chinese, it is not.  In these cases, the system locale
can be used to provide a "strong hint".  Ideally, none of this would be
necessary, since we have replace the standard cat, and all non-international
user code will be specific to the locale as long as output of a stream
with multiple potential source languages is restricted.  In a monolingual
environment, this will be possible.  A strict/nonstrict locale switch is
one potential soloution to this problem.  This would cause instant reduction
to the locale language in the strict case.

A direct language attribute reduction from Unicode to the language class
ISO Latin-1 will result in a halving of file storage requirements.

Note that a reduction from Raw Unicode to Compound Document is possible
base on our "magic cookie" scenerio above.

Admittedly, it is possible to misreduce a file if its contents are not
specific to a particular language or language class.  In this case, no
harm is done directly appart from misattribution of the file; the file
can be considered to have been migrated to a slightly slower storage.  Any
additions to the file not fitting into the language class will cause a
remigration back to the Raw Unicode class; hopefully this would be the
minority case, since most utilites will already be internationalized, and
there should be intrinsic internationalization of new utilities.

Note that since this is simply a storage compaction optimization, there is
no explict reason that it could not be switchable as to whether or not it
is enforced, at the system administrators discretion.


Does this answer your "cat" question sufficiently?  The problem seemed to
be that there was not a means around the problem from your point of view.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------