raw-byte and char-table

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* raw-byte and char-table
@ 2010-08-24  1:11 Kenichi Handa
  2010-08-24  3:06 ` Eli Zaretskii
  0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-24  1:11 UTC (permalink / raw)
  To: emacs-devel

A char-table is a table indexed by a character code.  So,
it's 0xA0th element is a value for a character U+00A0.
Then, how to set/get a value for raw-byte 0xA0?  Currently,
this is the way to do that:

  (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
  (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)

But, this is not documented.  Should we document it?

A display-table is a char-table.  But, the current code uses
0xA0th element of a display-table for both U+00A0 and
raw-byte 0xA0.  For instance, in get_next_display_element of
xdisp.c:

5744	  if (it->dp
5745	      && (dv = DISP_CHAR_VECTOR (it->dp, it->c),
5746		  VECTORP (dv)))

Here, it->c may be 0xA0 comming from a unibyte buffer/string.

Should we change the above code and all other codes setting
0x80th..0xA0th elements of a display table?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-24  1:11 Kenichi Handa
@ 2010-08-24  3:06 ` Eli Zaretskii
  2010-08-24  4:29   ` Kenichi Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2010-08-24  3:06 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Tue, 24 Aug 2010 10:11:23 +0900
> 
> A char-table is a table indexed by a character code.  So,
> it's 0xA0th element is a value for a character U+00A0.
> Then, how to set/get a value for raw-byte 0xA0?  Currently,
> this is the way to do that:
> 
>   (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
>   (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)

One could also use the codepoint of the corresponding eight-bit
character directly, no?  I mean, unibyte-char-to-multibyte is just the
way of getting that codepoint, right?

> But, this is not documented.  Should we document it?

Yes, IMO.

> Should we change the above code and all other codes setting
> 0x80th..0xA0th elements of a display table?

Yes.  IMO, we should consistently use the codepoints of eight-bit
characters in all char-tables.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-24  3:06 ` Eli Zaretskii
@ 2010-08-24  4:29   ` Kenichi Handa
  2010-08-24 17:08     ` Eli Zaretskii
  0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-24  4:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

In article <83eidoogkc.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > A char-table is a table indexed by a character code.  So,
> > it's 0xA0th element is a value for a character U+00A0.
> > Then, how to set/get a value for raw-byte 0xA0?  Currently,
> > this is the way to do that:
> > 
> >   (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
> >   (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)

> One could also use the codepoint of the corresponding eight-bit
> character directly, no?

Like #x3FFFA0?  It's possible but should not be recommended.

> I mean, unibyte-char-to-multibyte is just the
> way of getting that codepoint, right?

Yes.

> > But, this is not documented.  Should we document it?

> Yes, IMO.

> > Should we change the above code and all other codes setting
> > 0x80th..0xA0th elements of a display table?

> Yes.  IMO, we should consistently use the codepoints of eight-bit
> characters in all char-tables.

Ok, if Yidong and Stefan agree too, I'll work on it.

As for a display table, we have one more problem.  Currently
an element of a display table is nil or a vector of
characters.  To directly output the byte #xA0 to a terminal,
perhaps the correct way is to set (unibyte-char-to-multibyte
#xA0) in a vector.  That way, we can specify any byte(s) to
send to a terminal.

But, then, what is the semantics of the vector element
(unibyte-char-to-multibyte #xA0) for a graphic device?  What
should we display for CHAR if we setup
standard-display-table as this?

  (aset standard-display-table 
        CHAR (vector (unibyte-char-to-multibyte #xA0)))

It seems that displaying a glyph of glyph-code 0xA0 of a
font that is usually selected for CHAR is the most natural
interpretation.  But then, it means that we have a method of
directly specifying a glyph-code only if it is 0x80..0xFF;
it's very unbalanced.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-24  4:29   ` Kenichi Handa
@ 2010-08-24 17:08     ` Eli Zaretskii
  2010-08-25  4:05       ` Kenichi Handa
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2010-08-24 17:08 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Cc: emacs-devel@gnu.org
> Date: Tue, 24 Aug 2010 13:29:45 +0900
> 
> In article <83eidoogkc.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> 
> > > A char-table is a table indexed by a character code.  So,
> > > it's 0xA0th element is a value for a character U+00A0.
> > > Then, how to set/get a value for raw-byte 0xA0?  Currently,
> > > this is the way to do that:
> > > 
> > >   (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
> > >   (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)
> 
> > One could also use the codepoint of the corresponding eight-bit
> > character directly, no?
> 
> Like #x3FFFA0?  It's possible but should not be recommended.

Why not recommended?  We already document in the ELisp manual the
codepoints to which we map eight-bit bytes.  It's not a secret, it's
in the open.

> As for a display table, we have one more problem.  Currently
> an element of a display table is nil or a vector of
> characters.  To directly output the byte #xA0 to a terminal,
> perhaps the correct way is to set (unibyte-char-to-multibyte
> #xA0) in a vector.  That way, we can specify any byte(s) to
> send to a terminal.
> 
> But, then, what is the semantics of the vector element
> (unibyte-char-to-multibyte #xA0) for a graphic device?  What
> should we display for CHAR if we setup
> standard-display-table as this?
> 
>   (aset standard-display-table 
>         CHAR (vector (unibyte-char-to-multibyte #xA0)))

There's something I'm missing here: why text terminals and graphics
terminals are different in this context?  It seems that you are saying
that was is correct for a text terminal does not have a clear
semantics for a GUI terminal, but why?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-24 17:08     ` Eli Zaretskii
@ 2010-08-25  4:05       ` Kenichi Handa
  2010-08-26  0:01         ` Stefan Monnier
  0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-25  4:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

In article <8362z0ndke.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > > >   (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
> > > >   (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)
> > 
> > > One could also use the codepoint of the corresponding eight-bit
> > > character directly, no?
> > 
> > Like #x3FFFA0?  It's possible but should not be recommended.

> Why not recommended?  We already document in the ELisp manual the
> codepoints to which we map eight-bit bytes.  It's not a secret, it's
> in the open.

Number like #x3FFFA0 is so criptic.  The function name
unibyte-char-to-multibyte is also not ideal, but I think
it's better than #x3FFFA0.

> > As for a display table, we have one more problem.  Currently
> > an element of a display table is nil or a vector of
> > characters.  To directly output the byte #xA0 to a terminal,
> > perhaps the correct way is to set (unibyte-char-to-multibyte
> > #xA0) in a vector.  That way, we can specify any byte(s) to
> > send to a terminal.
> > 
> > But, then, what is the semantics of the vector element
> > (unibyte-char-to-multibyte #xA0) for a graphic device?  What
> > should we display for CHAR if we setup
> > standard-display-table as this?
> > 
> >   (aset standard-display-table 
> >         CHAR (vector (unibyte-char-to-multibyte #xA0)))

> There's something I'm missing here: why text terminals and graphics
> terminals are different in this context?  It seems that you are saying
> that was is correct for a text terminal does not have a clear
> semantics for a GUI terminal,

Yes.

> but why?

What we can send to a terminal is a byte.  So, by having a
method of specifying any raw byte directly, we can send all
possible bytes to a terminal.  For a graphic device, the
natural interpretation corresponding to "directly sending a
raw byte" is, I think, "directly specifying a glyph code".
But, to specify all possible glyph codes, 0x80..0xFF is not
enough.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-25  4:05       ` Kenichi Handa
@ 2010-08-26  0:01         ` Stefan Monnier
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Monnier @ 2010-08-26  0:01 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Eli Zaretskii, emacs-devel

>> Why not recommended?  We already document in the ELisp manual the
>> codepoints to which we map eight-bit bytes.  It's not a secret, it's
>> in the open.

> Number like #x3FFFA0 is so criptic.  The function name
> unibyte-char-to-multibyte is also not ideal, but I think
> it's better than #x3FFFA0.

We could provide a ?\NNN (or similar) notation for it.  Similarly to
what we do for those bytes in multibyte strings.


        Stefan




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
@ 2010-08-26  2:58 MON KEY
  2010-08-26  3:34 ` Kenichi Handa
  0 siblings, 1 reply; 13+ messages in thread
From: MON KEY @ 2010-08-26  2:58 UTC (permalink / raw)
  To: handa; +Cc: Eli Zaretskii, Stefan Monnier, emacs-devel

> Number like #x3FFFA0 is so criptic.  The function name
> unibyte-char-to-multibyte is also not ideal, but I think
> it's better than #x3FFFA0.

Maybe I am misunderstanding, but I think the `#x' and `#o' syntax is
not cryptic at all in the context. These at least preserve identity:

 4194208, #o17777640, #x3fffa0

This signals an error:
 (unibyte-char-to-multibyte
  (unibyte-char-to-multibyte 160))

Also, there is the brevity factor:

(aref (syntax-table) #o17777640)

(aref (syntax-table) #x3fffa0)

(aref (syntax-table) 4194208)

(aref (syntax-table)
      (unibyte-char-to-multibyte 160))

> We could provide a ?\NNN (or similar) notation for it.  Similarly to
> what we do for those bytes in multibyte strings.

Howsabout just this one for all of them:

 `#\'

:)

>        Stefan

--
/s_P\



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-26  2:58 raw-byte and char-table MON KEY
@ 2010-08-26  3:34 ` Kenichi Handa
  2010-08-26  5:30   ` MON KEY
  0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-26  3:34 UTC (permalink / raw)
  To: MON KEY; +Cc: eliz, monnier, emacs-devel

In article <AANLkTinaF1Z2Rvp_sDv-ciHNjY4=eoW7e46KS3_yN-Hh@mail.gmail.com>, MON KEY <monkey@sandpframing.com> writes:

> > Number like #x3FFFA0 is so criptic.  The function name
> > unibyte-char-to-multibyte is also not ideal, but I think
> > it's better than #x3FFFA0.

> Maybe I am misunderstanding, but I think the `#x' and `#o' syntax is
> not cryptic at all in the context.

I'm not arguing that the syntax is cryptic.  What I want to
say is that it is difficult for one who reads the code to
understand what #x3FFFA0 means.

> This signals an error:
>  (unibyte-char-to-multibyte
>   (unibyte-char-to-multibyte 160))

Yes, but is it a problem?

> > We could provide a ?\NNN (or similar) notation for it.  Similarly to
> > what we do for those bytes in multibyte strings.

> Howsabout just this one for all of them:

>  `#\'

Do you mean that making #\240 to be read as #x3FFFA0?

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-26  3:34 ` Kenichi Handa
@ 2010-08-26  5:30   ` MON KEY
  2010-08-26  6:48     ` Kenichi Handa
  0 siblings, 1 reply; 13+ messages in thread
From: MON KEY @ 2010-08-26  5:30 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Stefan Monnier, emacs-devel

On Wed, Aug 25, 2010 at 11:34 PM, Kenichi Handa <handa@m17n.org> wrote:
> In article <AANLkTinaF1Z2Rvp_sDv-ciHNjY4=eoW7e46KS3_yN-Hh@mail.gmail.com>, MON KEY <monkey@sandpframing.com> writes:
>
>> > Number like #x3FFFA0 is so criptic.  The function name
>> > unibyte-char-to-multibyte is also not ideal, but I think
>> > it's better than #x3FFFA0.
>
>> Maybe I am misunderstanding, but I think the `#x' and `#o' syntax is
>> not cryptic at all in the context.
>
> I'm not arguing that the syntax is cryptic.  What I want to
> say is that it is difficult for one who reads the code to
> understand what #x3FFFA0 means.

So the syntax aren't the problem its their semantic denotation.
This is the realm of Tarski and McDermott[1].

Regardless, right now it is all confusing (esp. for those of us less
inclined to differentiating the multibyte/unibyte distinction).

>
>> This signals an error:
>>  (unibyte-char-to-multibyte
>>   (unibyte-char-to-multibyte 160))
>
> Yes, but is it a problem?

I would urge that it is a problem wherever the numerical denotation
has no visible/nameable/printable corollary.

Why should it be allowed to be problem if it can be avoided?

>
>> > We could provide a ?\NNN (or similar) notation for it.  Similarly to
>> > what we do for those bytes in multibyte strings.
>
>> Howsabout just this one for all of them:
>
>>  `#\'
>
> Do you mean that making #\240 to be read as #x3FFFA0?
>

> Do you mean that making #\240 to be read as #x3FFFA0?

Half-jokingly, Yes.

(assuming the #\240 above is the the code-point 0xA0)

Though, I _also_ had these things in mind as well:

#\8-bit-240

or

#\byte-240

Which would allow referencing these chars by something other than a
numeric id.

E.g. in some other dialects of Lisp there is this type of behaviour:

CL-USER> #\	;<-that's a #x9 after the \
;=> #\Tab

CL-USER> #\ ;<- that's a #xa after the \
;=>
;  #\Newline

CL-USER> #\NO-BREAK_SPACE ;<-that's the char-name for #xa0
;=> #\NO-BREAK_SPACE      ;<-return is as per `identity'

CL-USER> (identity #\NO-BREAK_SPACE)
;=> #\NO-BREAK_SPACE

CL-USER> (princ #\ )
;=>
;  #\NO-BREAK_SPACE

CL-USER> (prin1 #\ )
;=> #\NO-BREAK_SPACE
;   #\NO-BREAK_SPACE

CL-USER> #\ ;<- That's a #x20 after the \
;=> #\

CL-USER> (char-code #\ )
32

CL-USER> (describe #\ )
;=> #\
;  [standard-char]
;
;  :_Char-code: 32
;  :_Char-name: Space
;  _

The idea being that where those chars in the above example don't have
visibly "printable" representations but the `#\' reader syntax _does_
recognize them either by char-name or a readable identity, e.g.:

CL-USER> (read-char)
\x06
;=> #\Ack

Of course, introduction of this type of read syntax to Emacs lisp
would (or at least it should) imply extension to all characters
unibyte and multibyte...

Hence the ":)" smiley in my previous response to Stefan.


[1] McDermott, Drew (1978). Tarskian semantics, or no notation without
    denotation. Cognitive Science 2:277-82.

--
/s_P\



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-26  5:30   ` MON KEY
@ 2010-08-26  6:48     ` Kenichi Handa
  2010-08-26  7:09       ` Miles Bader
  0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-26  6:48 UTC (permalink / raw)
  To: MON KEY; +Cc: monnier, emacs-devel

In article <AANLkTi=iQqseE5irbKxHCrd5NxGmEH-db+G4FatGZAP4@mail.gmail.com>, MON KEY <monkey@sandpframing.com> writes:

> > I'm not arguing that the syntax is cryptic.  What I want to
> > say is that it is difficult for one who reads the code to
> > understand what #x3FFFA0 means.

> So the syntax aren't the problem its their semantic denotation.

Sorry, but I can't parse the above sentence.  Could you
please paraphrase it?

> Regardless, right now it is all confusing (esp. for those of us less
> inclined to differentiating the multibyte/unibyte distinction).

I agree that the handling of raw-byte is very confusing.
The base is, I think, because we represent a character by an
integer value, and we must introduce character-object to
solve that confusion.  Unfortunately, it requires a huge
amount of work.  Until someone volunteer that work, we must
live with the current infrastructure of Emacs.

>>> This signals an error:
>>>  (unibyte-char-to-multibyte
>>>   (unibyte-char-to-multibyte 160))
> >
> > Yes, but is it a problem?

> I would urge that it is a problem wherever the numerical denotation
> has no visible/nameable/printable corollary.

> Why should it be allowed to be problem if it can be avoided?

Conceptually we have "byte", "integer", and "character", and
#x3FFFA0 is both an integer and a character representing
byte 160.

Perhaps we should not call "byte" as "unibyte char", rename
the above funciton to "byte-to-char", and document it as:

(byte-to-char BYTE)
Convert the byte BYTE to a character representing BYTE.

Then it's clear that (byte-to-char (byte-to-char BYTE))
signals an error.

Likewise multibyte-char-to-unibyte => char-to-byte:

(char-to-byte CH)
Convert the character CH to a byte.
If the character does not represent a byte, return -1.

By the way, we also have decode-char.

(decode-char 'eight-bit 160) => #x3FFFA0

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-26  6:48     ` Kenichi Handa
@ 2010-08-26  7:09       ` Miles Bader
  2010-08-27  3:30         ` MON KEY
  0 siblings, 1 reply; 13+ messages in thread
From: Miles Bader @ 2010-08-26  7:09 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: MON KEY, monnier, emacs-devel

Kenichi Handa <handa@m17n.org> writes:
> Perhaps we should not call "byte" as "unibyte char", rename
> the above funciton to "byte-to-char", and document it as:
>
> (byte-to-char BYTE)
> Convert the byte BYTE to a character representing BYTE.
...
> Likewise multibyte-char-to-unibyte => char-to-byte:

Those names seem a bit more clear to me as well.

-Miles

-- 
Monday, n. In Christian countries, the day after the baseball game.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-26  7:09       ` Miles Bader
@ 2010-08-27  3:30         ` MON KEY
  2010-08-27  3:45           ` Kenichi Handa
  0 siblings, 1 reply; 13+ messages in thread
From: MON KEY @ 2010-08-27  3:30 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-devel, Stefan Monnier, Kenichi Handa

On Thu, Aug 26, 2010 at 3:09 AM, Miles Bader <miles@gnu.org> wrote:
>> Convert the byte BYTE to a character representing BYTE.
> ...
>> Likewise multibyte-char-to-unibyte => char-to-byte:
>
> Those names seem a bit more clear to me as well.
>

Yes. Me too.

Is there some reason why these names might be confusing or otherwise
add to the existing confusion?

> -Miles

--
/s_P\



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: raw-byte and char-table
  2010-08-27  3:30         ` MON KEY
@ 2010-08-27  3:45           ` Kenichi Handa
  0 siblings, 0 replies; 13+ messages in thread
From: Kenichi Handa @ 2010-08-27  3:45 UTC (permalink / raw)
  To: MON KEY; +Cc: emacs-devel, monnier, miles

In article <AANLkTimyj413zuWLDZciQMA1xPNo33qxCoY8ZrPqg6DW@mail.gmail.com>, MON KEY <monkey@sandpframing.com> writes:

> On Thu, Aug 26, 2010 at 3:09 AM, Miles Bader <miles@gnu.org> wrote:
>>> Convert the byte BYTE to a character representing BYTE.
> > ...
>>> Likewise multibyte-char-to-unibyte => char-to-byte:
> >
> > Those names seem a bit more clear to me as well.
> >

> Yes. Me too.

> Is there some reason why these names might be confusing or otherwise
> add to the existing confusion?

Until Emacs 22, we treated a 8-bit byte as a character of the
current unibyte charset.  So, in Latin-1 env.,
  (unibyte-char-to-multibyte #xC0) => LATIN CAPITAL LETTER A WITH GRAVE
but in Cyrillic-KO8 env., 
  (unibyte-char-to-multibyte #xC0) => CYRILLIC SMALL LETTER YU

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-08-27  3:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-26  2:58 raw-byte and char-table MON KEY
2010-08-26  3:34 ` Kenichi Handa
2010-08-26  5:30   ` MON KEY
2010-08-26  6:48     ` Kenichi Handa
2010-08-26  7:09       ` Miles Bader
2010-08-27  3:30         ` MON KEY
2010-08-27  3:45           ` Kenichi Handa
  -- strict thread matches above, loose matches on Subject: below --
2010-08-24  1:11 Kenichi Handa
2010-08-24  3:06 ` Eli Zaretskii
2010-08-24  4:29   ` Kenichi Handa
2010-08-24 17:08     ` Eli Zaretskii
2010-08-25  4:05       ` Kenichi Handa
2010-08-26  0:01         ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).