[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Proposed alternative to JEIDA proposal

To: "X3J13: Character Subcommittee" <cl-natural-languages@sail.stanford.edu>
Subject: Proposed alternative to JEIDA proposal
From: Masinter.pa@Xerox.COM
Date: 13 Nov 87 17:22 PST
Cc: masinter.pa@Xerox.COM
Since it will be up for discussion, I took the time to write up the
variant I had envisioned for the JEIDA proposal. I started with the
JEIDA proposal as forwarded by M.Ida  I've written this by editing the
JEIDA proposal and adding a Q&A section at the end.

I added several sections in {braces} which are my comments about what
edits I did to the JEIDA proposal and why. 

This is still very much a rough draft, and not a formal proposal, but I
thought this might make clearer what I had in mind.

-------------------- Beginning of text --------------------
1. Hierarcy of characters and strings

Let the value of char-code-limit be large enough to include all
characters.

	char >= string-char >= standard-char

{I removed the internal-string-char (which was probably intended to be
internal-thin-char) and changed the > to >=, since we are not requiring
there to be more char than string-char.}

	string >= simple-string >= simple-standard-string

	string = (or (vector standard-char) (vector string-char))

{Rather than introduce a new type of "thin" character, merely use
standard-char. Algorithms that want to assert that their elements are
all "thin" can most likely also assert that they are all "standard".}

Type (vector standard-char) and (vector string-char) aredisjoint or
identical.

simple-string = (or (simple-array standard-char (*))
			     (simple-array string-char (*)))


	notes:	A > B means B is a subtype of A,
		A >= B means B is a subtype of A or B is equal to A.

2. Print width

Only standard characters are required to have fix-pitched print width. 

{I removed WRITE-WIDTH; see notes below.}


3. Functions

Functions dealing with strings should work as before, except ones which
change the contents of simple-standard-string to non standard-char's.

{removed "internal-thin" terminology}

Functions producing strings should create (vector string-char), rather
than any more restricted type, unless they were explicitly specified.

Funtions comparing strings should compare them elementwise. Therefore it
is possible that a (vector string-char) is equal to a (vector
standard-char).

{revise terminology}


1. A proposal for embedding multi-byte characters

In order to decide on a final proposal, we chose essential and desirable
characteristics of a working multi-byte character system. Chapter 2
describes these characteristics in some detail.

Chapter 3 describes additional features to Common Lisp which will be
useful not just for multi-byte character, but also for many other kinds
of character sets. This chapter describes internal data structures.  If
this proposal is accepted in Common Lisp, it will be easy for countries
to add original mechanisms.

Chapters 4 describes proposed changes to @I[Common Lisp -- The Language]
(CLtL).

2. Additional features for embedding multi-byte characters.

This chapter describes design principles which can be used to design
multi-byte character language extensions to Common Lisp.

There are many programming languages which can use multi-byte
characters. Most of them can use multi-byte character as string
character data but not as variables or function names. 

It is necessary for programming languages like Lisp that use symbolic
data to be able to process not only single-byte characters but also
multi-byte characters. That is, it should be possible to use multi-byte
characters in character string and symbols, and it must be possible to
store both kinds of characters in them.

Treating multi-byte characters just like other alpha-numeric characters
means that multi-byte character must be treated as a single character
object. Many of the present implementations of Lisp treat multi-byte
character as pairs of bytes.  Alternatively, they use a different data
type which doesn't permit multi-byte character to be mixed with standard
characters. Such systems are not useful for user.

Thus, the basic design principles for embedding multi-byte character to
Common Lisp are:

* Multi-byte character should be treated like single-byte character,
that is,  a multi-byte character is one character object.

* A program which was coded without explicit attention for multi-byte
character should handle multi-byte character data as is.

* The performance of the system in terms of CPU and memory utilization
should not be consideraly affected in programs which do not use
multi-byte characters.


3.  Implementation notes:

This section describes the implementation of multiple character sets in
Common Lisp. 

To treat multi-byte characters like single-byte characters, the
multi-byte character must be included in the set of possible character
codes.

Add multi-byte characters by setting the variable char-code-limit to a
large number.

The single-byte character set and the multi-byte character set must be
ordered into a single sequence of character codes. This means multi-byte
character set must not overlap with the single-byte character set.  

It is possible to use multi-byte characters with fonts in Common Lisp,
and operations that work for single-byte character will also work for
multi-byte character without any change.

Alone, this implementation method could have problems with efficiency.
If the value of character code is greater than size of 1 byte
(multi-byte characters are in this category), memory utilization is
affected.  A string containing only one single-byte character is 2 bytes
long. The same problem would also occur with symbol p-names.  If we can
solve the problem for strings, we can solve other problems, so we will
start by considering only strings.

To avoid this memory utilization problem, it is possible to optimize and
make single-byte character strings by packing internally. In other
words, to have two kinds of data types and not show it to user. There is
only one type of data from the viewpoint of users, which means that
every function which uses strings will continue to work as defined.

This can be implemented in almost everywhere without high cost.  The
only problem occurs when a function attempts to put a multi-byte
character into an optimized and packed sigle-byte-only string.  To work
according to the definition, the implementation must unpack the original
packed string. This presents an implementation inefficiency which the
user may find undesirable.

For this reason, the implementation allows (array standard-char (*)) and
(simple-array standard-char (*)) (along with simple-standard-string) as
types so that users can construct and manipulate strings that are
guaranteed not to require multiple bytes to represent.

This proposal has only three named string types (Implementations may add
other string types between these but they are implementation dependent.)
In particular, since string = (or (array string-char (*)) (array
standard-char (*))), implementations may have distinct representations
for (array string-char (*)) and (array standard-char (*)), or those
arrays may be the same. The named types are: 
 
string (the most general)

simple-string (cannot be displaced and does not have a fill pointer, but
can contain multi-byte characters)

simple-standard-string ("it is an error" to attempt to store a character
that is not a standard-character in a simple-standard-string. These
strings are thus guaranteed to require only one byte because there are
not many standard characters.)

The data type hierarchy for character remains unchanged. The type
hierarchy for string is shown in figure 1.

Fig-1.a  Structure of character type
				character
				    |
			     string-char
				    |
			     standard-char


Fig-1.b  Structure of string type:

string = (or (array string-char (*)) (array standard-char (*)))
  |
simple-string = (or (simple-array string-char (*)) (simple-array
standard-char (*)))
  |
simple-standard-string = (simple-array standard-char (*))


either simple-string = simple-standard-string or they are disjoint.


The same character is the same object regardless of whether it is found
in an simple-standard-string or a normal string.

Next we must discuss character input. The proposal does not discuss what
is stored in files, nor what happens between the Lisp implementation and
a terminal. Each system will implement this in its own way.  Instead,
let us discuss the data as passed to lisp programs. We think that
treating all input data as string is the safest possible course. Since a
symbol's p-name string should not be modified, it can be optimized.

For implementations or programs that know that they are only
manipulating standard characters, the stream can be opened with an
element-type of standard-character.

{I removed *read-default-string-type*; it is poor design because it is a
dynamic property rather than a stream property. Whether strings should
be simple-standard-string or standard-string should depend on the
element type of the stream you are reading from. If it is string-char,
then read should can produce simple-string. If it  is standard-char,
read can produce simple-standard-string.}

4. Proposed changes to CLtL to support multiple character sets.

This section lists proposed modifications to CLtL.  Only additional and
modified parts are specified.  Those portions which are not mentioned
are unchanged.

Section 2.5.2 Strings:

"a string is a specialized vector .... type string-char"
		=>
"a string is a specialized vector .... type string-char or
@B[standard-char]"


Section 2.15 Overlap,Inclusion and Disjointness of Types:

{No longer need any changes to the character type descriptions.}

Add the following :
    
Type simple-standard-string is a subtype of vector because
simple-standard-string means (simple-array standard-char (*)).

The description of type string is changed to:

Type string is a subtype of vector because string means (or (vector
string-char) (vector standard-char)).  Type (vector string-char) and
@B(vector standard-char) are disjoint or equal.

a description of type simple-vector, simple-string ... is changed to :
  
Type simple-vector, simple-string and simple-bit-vector are disjoint
subtype of simple-array because each one means (simple-array t (*)), (or
(simple-array string-char (*)),(or (simple-array standard-char (*)) and
(simple-array bit (*)).

add the following:

Type simple-standard-string means (simple-array standard-char (*)). 

Type (simple-array string-char (*)) and (simple-array standard-char (*))
are disjoint or equal.

Section 4.1 Type Specifier Symbols:

add following to system defined type specifiers:

simple-standard-string

Section 4.5 Type Specifiers That Specialize

"The specialized types (vector string-char) ... data types."
					=>
"The specialized types (or (vector standard-char) (vector string-char))
and (vector bit) are so useful that they have the special names string
and bit-vector.  Every implementation of Common Lisp must provide
distinct representation for string and bit-vector as distinct
specialized data types."

Section 13.2 Predicates on Characters

graphic-char-p char			[constant]

"graphic characters of font 0 are all of the same width when printed" =>
"standard-char without #\Newline of font 0 are all of the same width
when printed".

alpha-char-p char			[function]
   only standard characters are alpha-char-p
upper-case-p char			[function]
   only standard characters are upper-case-p
lower-case-p char			[function]
   only standard-characters are lower-case-p

both-case-p char			[function]
   only standard characters are both-case-p

digit-char-p char &optional (radix 10)			[function]
   only standard characters are digit-char-p

alphanumericp char			[function]
   only standard characters are alphanumericp


Chapter 18 Strings

"the type string is identical ... (array string-char (*))."
				=>
"the type string is identical to the type(or (vector standard-char)
(vector string-char)), which in turn is the same as (or (array
standard-char (*)) (array string-char (*)))."

Section 18.3 String Construction and Manipulation

make-string size &key :initial-element			[function]

add:

To make an simple-standard-string, use make-array or make-sequence.


   
Section  22.2.1 Input from Character Stream

Add a note that the stream-element-type of a stream is used to determine
the element-type of string elements that are read. 

Section 22.3.1 Output to Character Stream

{Do not add write-width. This does not belong here. There are many
"run-coded" external character representations where the write-width of
a string or character depends on the characters that precede it. Note
that the number of bytes written to a stream by write-char may vary on
the system or the stream.}

Appendix Proposed Extended character processing facilities for Common
Lisp.

{I've attempted to extend this section to include all languages and not
just Japanese.}


char-code-limit char 			[Function]

The value of char-code-limit should be large enough to include all JIS
and/or ISO TC97/SC18/WG8 and/or  ISO SC2/WG2 characters. char-code-limit
= 65536 is large enough to meet these purposes currently.  Other
character encodings are possible.

13.2. Predicates on Characters

standard-char-p char 			[Function]

Return nil for all Japanese characters, all Cyrillic, Greek, etc.
characters. (That is, only the characters in CLtL specified are
standard-char-p.)
	
graphic-char-p char 			[Function]

Return t for all characters that have a printable representation in the
encoding in use, including Japanese characters, etc. The predicate
depends only on the character encoding standard used, rather than the
capabilities of any particular printer or output device. Implementations
may chose to also provide additional functions which are able to query
output devices to determine their character representation, but
graphic-char-p has no such capability.


alpha-char-p char 			[Function]

Return NIL for all characters except the alpha-characters of
standard-char. This means that alpha-char-p is portable, although of
limited use in non-standard-char applications.

@newpage

{I removed jis-char-p from the proposal because it is encoding specific.
I removed japanese-char-p temporarily because I didn't understand its
use. Perhaps JIS WG can give an example of when japanese-char-p might be
used? Similarly kanji-char-p? Part of the problem is that any one
encoding might over time acquire additional kanji-char-p characters as
part of internal use or representation of names. }
 

kanji-char-p 			[Function]
The argument char has to be character type object. kanji-char-p is true
if the argument is a kanji character within the encoding of the system.

hiragana-char-p char			[Function]
The argument char has to be character type object.hiragana-char-p is
true if the argument is one of the 83 hiragana characters in JIS
C6226(3.1.4), the hiragana repeatsymbol, or dakuten for a total of 85
characters.

katakana-char-p char			[Function]

The argument char has to be a character type object.katakana-char-p is
true if the argument is one of the 86hiragana characters in JIS
C6226(3.1.5), long-sound-symbol,katakana-repeat symbol, or
katakana-dakuten for a total of 89 characters that also satisfy
jis-char-p.

kana-char-p char			[Function]
equivalence (or (hiragana-char-p char) (katakana-char-p char))


char= character &rest more-characters			[Function]
char/= character &rest more-characters			[Function]
char< character &rest more-characters			[Function]
char> character &rest more-characters			[Function]
char<= character &rest more-characters			[Function]
char>= character &rest more-characters			[Function]

The ordering of hiragana, katakana, kanji follows the ordering in the
character encoding chosen, e.g. (char< x y) is exactly the same as (<
(char-int x) (char-int y))
   

13.4 Character Conversions

char-upcase char			[Function]
char-downcase char			[Function]

These return the argument if the argument does not satisfy alpha-char-p
and are not standard-char.

!
Some questions and my answers:


Q. Are characters with different codes always syntactically distinct?

A. Yes.

Q. Can the standard character #\( have two different codes,
corresponding, for example, to two different external file system
representations of that character?  

A. No. READ and READ-CHAR translate the external file system
representations into a single consistent internal character
representation. A Common Lisp implementation can support multiple
external file system representations either by additional stream
properties (e.g., new keyword arguments to OPEN in addition to
ELEMENT-TYPE) and by accessors on character streams.

A lisp program can deal explicitly with character set conversions by
using READ-BYTE and INT-CHAR or MAKE-CHAR.

 
Q. Can two different string-chars to have the same print glyph, '(' for
example, but different syntactical properties?

A. Yes. This is consistent with other ISO character standards; for
example, some character representations separate the hyphen, dash,
em-dash and en-dash, yet in some printed representations they have the
same print glyph.

Q. Is it allowable to map both of these sets of codes into the one,
internal Lisp character code set when inputting data to Lisp, and adopt
our own conventions for translating output back to single and double
byte? Is it possible for an with 2-byte codes, and to map some 2-byte
character codes and some 1-byte character codes in system files onto the
same set of 2-byte internal codes for the standard characters when read
into Lisp?

A. yes. READ-CHAR and WRITE-CHAR may do an arbitrary amount of
processing to actually read or  write a character object onto a file.
Explicit run-coding, two-byte codes, one-byte codes with an external map
of coding schemes, etc. are all allowable and implementation dependent.
The handling of external coding and the type of external coding used is
recommended to be described by programmers in exta optional keywords to
OPEN.
	

Q. if the character object print syntax "#\a" or "#\A" is read from a
file, is alpha-char-p true

          1. if the 'a' had been encoded as a single byte?
          2. if the 'a' had been encoded as a double byte?
          3. if the 'A' had been encoded as a single byte?
          4. if the 'A' had been encoded as a double byte?

A. #\ can be though of operating by performing READ-CHAR; READ-CHAR
hides the encoding of the character, so that #\a and #\A have the same
semantics no matter what the file encoding was.

Q. Even if the Lisp system supports a large character set, only standard
characters have, as a default, non-constituent syntax type, constituent
character attributes relevant to parsing of numbers and symbols, or
defined syntax within a format control string. Correct?

A. False. readtables should allow any character of of type STRING-CHAR
to have a syntax class, and format strings can contain any character of
type STRING-CHAR. 

Q. If a Lisp system supports a large character code set, need it allow
every character of type string-char to have a non-constituent syntax
type defined in the readtable, or is the proposal's default that only
standard characters need be represented in the readtable?

A. CLtL says (22.1.5 page  360):
"every character of type string-char must be represented in the
readtable." The members felt as we extended the definition of
string-char to include japanese characters, as the results of a natual
interpretation of CLtL, the readtable must have more than 64k 'logical'
entries. A hash table works well.

Q. A specific case related to the previous question: suppose #\% were a
non-standard character, but still a string-char in some implementation
of Lisp.  Is

           (make-dispatch-macro-character #\%)

necessarily permitted in every implementation that supports #\% as a
string-char?

A. Yes.


Q. What about efficiency of standard non-simple strings. Don't they take
too much space to represent? 

A. This proposal allows users to write programs that create only strings
that have only STANDARD-CHAR in them. These are the "thin" strings. It
is quite possible that such strings might contain other character codes
that are not standard-char, but these are not portably elements of a
"thin" string.
Prev by Date: Re: agenda
Previous by thread: Ida response
Index(es):
- Date
- Thread