* raw-byte and char-table
@ 2010-08-24 1:11 Kenichi Handa
2010-08-24 3:06 ` Eli Zaretskii
0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-24 1:11 UTC (permalink / raw)
To: emacs-devel
A char-table is a table indexed by a character code. So,
it's 0xA0th element is a value for a character U+00A0.
Then, how to set/get a value for raw-byte 0xA0? Currently,
this is the way to do that:
(aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
(aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)
But, this is not documented. Should we document it?
A display-table is a char-table. But, the current code uses
0xA0th element of a display-table for both U+00A0 and
raw-byte 0xA0. For instance, in get_next_display_element of
xdisp.c:
5744 if (it->dp
5745 && (dv = DISP_CHAR_VECTOR (it->dp, it->c),
5746 VECTORP (dv)))
Here, it->c may be 0xA0 comming from a unibyte buffer/string.
Should we change the above code and all other codes setting
0x80th..0xA0th elements of a display table?
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-24 1:11 raw-byte and char-table Kenichi Handa
@ 2010-08-24 3:06 ` Eli Zaretskii
2010-08-24 4:29 ` Kenichi Handa
0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2010-08-24 3:06 UTC (permalink / raw)
To: Kenichi Handa; +Cc: emacs-devel
> From: Kenichi Handa <handa@m17n.org>
> Date: Tue, 24 Aug 2010 10:11:23 +0900
>
> A char-table is a table indexed by a character code. So,
> it's 0xA0th element is a value for a character U+00A0.
> Then, how to set/get a value for raw-byte 0xA0? Currently,
> this is the way to do that:
>
> (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
> (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)
One could also use the codepoint of the corresponding eight-bit
character directly, no? I mean, unibyte-char-to-multibyte is just the
way of getting that codepoint, right?
> But, this is not documented. Should we document it?
Yes, IMO.
> Should we change the above code and all other codes setting
> 0x80th..0xA0th elements of a display table?
Yes. IMO, we should consistently use the codepoints of eight-bit
characters in all char-tables.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-24 3:06 ` Eli Zaretskii
@ 2010-08-24 4:29 ` Kenichi Handa
2010-08-24 17:08 ` Eli Zaretskii
0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-24 4:29 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: emacs-devel
In article <83eidoogkc.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> > A char-table is a table indexed by a character code. So,
> > it's 0xA0th element is a value for a character U+00A0.
> > Then, how to set/get a value for raw-byte 0xA0? Currently,
> > this is the way to do that:
> >
> > (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
> > (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)
> One could also use the codepoint of the corresponding eight-bit
> character directly, no?
Like #x3FFFA0? It's possible but should not be recommended.
> I mean, unibyte-char-to-multibyte is just the
> way of getting that codepoint, right?
Yes.
> > But, this is not documented. Should we document it?
> Yes, IMO.
> > Should we change the above code and all other codes setting
> > 0x80th..0xA0th elements of a display table?
> Yes. IMO, we should consistently use the codepoints of eight-bit
> characters in all char-tables.
Ok, if Yidong and Stefan agree too, I'll work on it.
As for a display table, we have one more problem. Currently
an element of a display table is nil or a vector of
characters. To directly output the byte #xA0 to a terminal,
perhaps the correct way is to set (unibyte-char-to-multibyte
#xA0) in a vector. That way, we can specify any byte(s) to
send to a terminal.
But, then, what is the semantics of the vector element
(unibyte-char-to-multibyte #xA0) for a graphic device? What
should we display for CHAR if we setup
standard-display-table as this?
(aset standard-display-table
CHAR (vector (unibyte-char-to-multibyte #xA0)))
It seems that displaying a glyph of glyph-code 0xA0 of a
font that is usually selected for CHAR is the most natural
interpretation. But then, it means that we have a method of
directly specifying a glyph-code only if it is 0x80..0xFF;
it's very unbalanced.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-24 4:29 ` Kenichi Handa
@ 2010-08-24 17:08 ` Eli Zaretskii
2010-08-25 4:05 ` Kenichi Handa
0 siblings, 1 reply; 13+ messages in thread
From: Eli Zaretskii @ 2010-08-24 17:08 UTC (permalink / raw)
To: Kenichi Handa; +Cc: emacs-devel
> From: Kenichi Handa <handa@m17n.org>
> Cc: emacs-devel@gnu.org
> Date: Tue, 24 Aug 2010 13:29:45 +0900
>
> In article <83eidoogkc.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
>
> > > A char-table is a table indexed by a character code. So,
> > > it's 0xA0th element is a value for a character U+00A0.
> > > Then, how to set/get a value for raw-byte 0xA0? Currently,
> > > this is the way to do that:
> > >
> > > (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
> > > (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)
>
> > One could also use the codepoint of the corresponding eight-bit
> > character directly, no?
>
> Like #x3FFFA0? It's possible but should not be recommended.
Why not recommended? We already document in the ELisp manual the
codepoints to which we map eight-bit bytes. It's not a secret, it's
in the open.
> As for a display table, we have one more problem. Currently
> an element of a display table is nil or a vector of
> characters. To directly output the byte #xA0 to a terminal,
> perhaps the correct way is to set (unibyte-char-to-multibyte
> #xA0) in a vector. That way, we can specify any byte(s) to
> send to a terminal.
>
> But, then, what is the semantics of the vector element
> (unibyte-char-to-multibyte #xA0) for a graphic device? What
> should we display for CHAR if we setup
> standard-display-table as this?
>
> (aset standard-display-table
> CHAR (vector (unibyte-char-to-multibyte #xA0)))
There's something I'm missing here: why text terminals and graphics
terminals are different in this context? It seems that you are saying
that was is correct for a text terminal does not have a clear
semantics for a GUI terminal, but why?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-24 17:08 ` Eli Zaretskii
@ 2010-08-25 4:05 ` Kenichi Handa
2010-08-26 0:01 ` Stefan Monnier
0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-25 4:05 UTC (permalink / raw)
To: Eli Zaretskii; +Cc: emacs-devel
In article <8362z0ndke.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:
> > > > (aref CHAR-TABLE (unibyte-char-to-multibyte #xA0))
> > > > (aset CHAR-TABLE (unibyte-char-to-multibyte #xA0) VALUE)
> >
> > > One could also use the codepoint of the corresponding eight-bit
> > > character directly, no?
> >
> > Like #x3FFFA0? It's possible but should not be recommended.
> Why not recommended? We already document in the ELisp manual the
> codepoints to which we map eight-bit bytes. It's not a secret, it's
> in the open.
Number like #x3FFFA0 is so criptic. The function name
unibyte-char-to-multibyte is also not ideal, but I think
it's better than #x3FFFA0.
> > As for a display table, we have one more problem. Currently
> > an element of a display table is nil or a vector of
> > characters. To directly output the byte #xA0 to a terminal,
> > perhaps the correct way is to set (unibyte-char-to-multibyte
> > #xA0) in a vector. That way, we can specify any byte(s) to
> > send to a terminal.
> >
> > But, then, what is the semantics of the vector element
> > (unibyte-char-to-multibyte #xA0) for a graphic device? What
> > should we display for CHAR if we setup
> > standard-display-table as this?
> >
> > (aset standard-display-table
> > CHAR (vector (unibyte-char-to-multibyte #xA0)))
> There's something I'm missing here: why text terminals and graphics
> terminals are different in this context? It seems that you are saying
> that was is correct for a text terminal does not have a clear
> semantics for a GUI terminal,
Yes.
> but why?
What we can send to a terminal is a byte. So, by having a
method of specifying any raw byte directly, we can send all
possible bytes to a terminal. For a graphic device, the
natural interpretation corresponding to "directly sending a
raw byte" is, I think, "directly specifying a glyph code".
But, to specify all possible glyph codes, 0x80..0xFF is not
enough.
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-25 4:05 ` Kenichi Handa
@ 2010-08-26 0:01 ` Stefan Monnier
0 siblings, 0 replies; 13+ messages in thread
From: Stefan Monnier @ 2010-08-26 0:01 UTC (permalink / raw)
To: Kenichi Handa; +Cc: Eli Zaretskii, emacs-devel
>> Why not recommended? We already document in the ELisp manual the
>> codepoints to which we map eight-bit bytes. It's not a secret, it's
>> in the open.
> Number like #x3FFFA0 is so criptic. The function name
> unibyte-char-to-multibyte is also not ideal, but I think
> it's better than #x3FFFA0.
We could provide a ?\NNN (or similar) notation for it. Similarly to
what we do for those bytes in multibyte strings.
Stefan
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
@ 2010-08-26 2:58 MON KEY
2010-08-26 3:34 ` Kenichi Handa
0 siblings, 1 reply; 13+ messages in thread
From: MON KEY @ 2010-08-26 2:58 UTC (permalink / raw)
To: handa; +Cc: Eli Zaretskii, Stefan Monnier, emacs-devel
> Number like #x3FFFA0 is so criptic. The function name
> unibyte-char-to-multibyte is also not ideal, but I think
> it's better than #x3FFFA0.
Maybe I am misunderstanding, but I think the `#x' and `#o' syntax is
not cryptic at all in the context. These at least preserve identity:
4194208, #o17777640, #x3fffa0
This signals an error:
(unibyte-char-to-multibyte
(unibyte-char-to-multibyte 160))
Also, there is the brevity factor:
(aref (syntax-table) #o17777640)
(aref (syntax-table) #x3fffa0)
(aref (syntax-table) 4194208)
(aref (syntax-table)
(unibyte-char-to-multibyte 160))
> We could provide a ?\NNN (or similar) notation for it. Similarly to
> what we do for those bytes in multibyte strings.
Howsabout just this one for all of them:
`#\'
:)
> Stefan
--
/s_P\
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-26 2:58 MON KEY
@ 2010-08-26 3:34 ` Kenichi Handa
2010-08-26 5:30 ` MON KEY
0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-26 3:34 UTC (permalink / raw)
To: MON KEY; +Cc: eliz, monnier, emacs-devel
In article <AANLkTinaF1Z2Rvp_sDv-ciHNjY4=eoW7e46KS3_yN-Hh@mail.gmail.com>, MON KEY <monkey@sandpframing.com> writes:
> > Number like #x3FFFA0 is so criptic. The function name
> > unibyte-char-to-multibyte is also not ideal, but I think
> > it's better than #x3FFFA0.
> Maybe I am misunderstanding, but I think the `#x' and `#o' syntax is
> not cryptic at all in the context.
I'm not arguing that the syntax is cryptic. What I want to
say is that it is difficult for one who reads the code to
understand what #x3FFFA0 means.
> This signals an error:
> (unibyte-char-to-multibyte
> (unibyte-char-to-multibyte 160))
Yes, but is it a problem?
> > We could provide a ?\NNN (or similar) notation for it. Similarly to
> > what we do for those bytes in multibyte strings.
> Howsabout just this one for all of them:
> `#\'
Do you mean that making #\240 to be read as #x3FFFA0?
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-26 3:34 ` Kenichi Handa
@ 2010-08-26 5:30 ` MON KEY
2010-08-26 6:48 ` Kenichi Handa
0 siblings, 1 reply; 13+ messages in thread
From: MON KEY @ 2010-08-26 5:30 UTC (permalink / raw)
To: Kenichi Handa; +Cc: Stefan Monnier, emacs-devel
On Wed, Aug 25, 2010 at 11:34 PM, Kenichi Handa <handa@m17n.org> wrote:
> In article <AANLkTinaF1Z2Rvp_sDv-ciHNjY4=eoW7e46KS3_yN-Hh@mail.gmail.com>, MON KEY <monkey@sandpframing.com> writes:
>
>> > Number like #x3FFFA0 is so criptic. The function name
>> > unibyte-char-to-multibyte is also not ideal, but I think
>> > it's better than #x3FFFA0.
>
>> Maybe I am misunderstanding, but I think the `#x' and `#o' syntax is
>> not cryptic at all in the context.
>
> I'm not arguing that the syntax is cryptic. What I want to
> say is that it is difficult for one who reads the code to
> understand what #x3FFFA0 means.
So the syntax aren't the problem its their semantic denotation.
This is the realm of Tarski and McDermott[1].
Regardless, right now it is all confusing (esp. for those of us less
inclined to differentiating the multibyte/unibyte distinction).
>
>> This signals an error:
>> (unibyte-char-to-multibyte
>> (unibyte-char-to-multibyte 160))
>
> Yes, but is it a problem?
I would urge that it is a problem wherever the numerical denotation
has no visible/nameable/printable corollary.
Why should it be allowed to be problem if it can be avoided?
>
>> > We could provide a ?\NNN (or similar) notation for it. Similarly to
>> > what we do for those bytes in multibyte strings.
>
>> Howsabout just this one for all of them:
>
>> `#\'
>
> Do you mean that making #\240 to be read as #x3FFFA0?
>
> Do you mean that making #\240 to be read as #x3FFFA0?
Half-jokingly, Yes.
(assuming the #\240 above is the the code-point 0xA0)
Though, I _also_ had these things in mind as well:
#\8-bit-240
or
#\byte-240
Which would allow referencing these chars by something other than a
numeric id.
E.g. in some other dialects of Lisp there is this type of behaviour:
CL-USER> #\ ;<-that's a #x9 after the \
;=> #\Tab
CL-USER> #\ ;<- that's a #xa after the \
;=>
; #\Newline
CL-USER> #\NO-BREAK_SPACE ;<-that's the char-name for #xa0
;=> #\NO-BREAK_SPACE ;<-return is as per `identity'
CL-USER> (identity #\NO-BREAK_SPACE)
;=> #\NO-BREAK_SPACE
CL-USER> (princ #\ )
;=>
; #\NO-BREAK_SPACE
CL-USER> (prin1 #\ )
;=> #\NO-BREAK_SPACE
; #\NO-BREAK_SPACE
CL-USER> #\ ;<- That's a #x20 after the \
;=> #\
CL-USER> (char-code #\ )
32
CL-USER> (describe #\ )
;=> #\
; [standard-char]
;
; :_Char-code: 32
; :_Char-name: Space
; _
The idea being that where those chars in the above example don't have
visibly "printable" representations but the `#\' reader syntax _does_
recognize them either by char-name or a readable identity, e.g.:
CL-USER> (read-char)
\x06
;=> #\Ack
Of course, introduction of this type of read syntax to Emacs lisp
would (or at least it should) imply extension to all characters
unibyte and multibyte...
Hence the ":)" smiley in my previous response to Stefan.
[1] McDermott, Drew (1978). Tarskian semantics, or no notation without
denotation. Cognitive Science 2:277-82.
--
/s_P\
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-26 5:30 ` MON KEY
@ 2010-08-26 6:48 ` Kenichi Handa
2010-08-26 7:09 ` Miles Bader
0 siblings, 1 reply; 13+ messages in thread
From: Kenichi Handa @ 2010-08-26 6:48 UTC (permalink / raw)
To: MON KEY; +Cc: monnier, emacs-devel
In article <AANLkTi=iQqseE5irbKxHCrd5NxGmEH-db+G4FatGZAP4@mail.gmail.com>, MON KEY <monkey@sandpframing.com> writes:
> > I'm not arguing that the syntax is cryptic. What I want to
> > say is that it is difficult for one who reads the code to
> > understand what #x3FFFA0 means.
> So the syntax aren't the problem its their semantic denotation.
Sorry, but I can't parse the above sentence. Could you
please paraphrase it?
> Regardless, right now it is all confusing (esp. for those of us less
> inclined to differentiating the multibyte/unibyte distinction).
I agree that the handling of raw-byte is very confusing.
The base is, I think, because we represent a character by an
integer value, and we must introduce character-object to
solve that confusion. Unfortunately, it requires a huge
amount of work. Until someone volunteer that work, we must
live with the current infrastructure of Emacs.
>>> This signals an error:
>>> (unibyte-char-to-multibyte
>>> (unibyte-char-to-multibyte 160))
> >
> > Yes, but is it a problem?
> I would urge that it is a problem wherever the numerical denotation
> has no visible/nameable/printable corollary.
> Why should it be allowed to be problem if it can be avoided?
Conceptually we have "byte", "integer", and "character", and
#x3FFFA0 is both an integer and a character representing
byte 160.
Perhaps we should not call "byte" as "unibyte char", rename
the above funciton to "byte-to-char", and document it as:
(byte-to-char BYTE)
Convert the byte BYTE to a character representing BYTE.
Then it's clear that (byte-to-char (byte-to-char BYTE))
signals an error.
Likewise multibyte-char-to-unibyte => char-to-byte:
(char-to-byte CH)
Convert the character CH to a byte.
If the character does not represent a byte, return -1.
By the way, we also have decode-char.
(decode-char 'eight-bit 160) => #x3FFFA0
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-26 6:48 ` Kenichi Handa
@ 2010-08-26 7:09 ` Miles Bader
2010-08-27 3:30 ` MON KEY
0 siblings, 1 reply; 13+ messages in thread
From: Miles Bader @ 2010-08-26 7:09 UTC (permalink / raw)
To: Kenichi Handa; +Cc: MON KEY, monnier, emacs-devel
Kenichi Handa <handa@m17n.org> writes:
> Perhaps we should not call "byte" as "unibyte char", rename
> the above funciton to "byte-to-char", and document it as:
>
> (byte-to-char BYTE)
> Convert the byte BYTE to a character representing BYTE.
...
> Likewise multibyte-char-to-unibyte => char-to-byte:
Those names seem a bit more clear to me as well.
-Miles
--
Monday, n. In Christian countries, the day after the baseball game.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-26 7:09 ` Miles Bader
@ 2010-08-27 3:30 ` MON KEY
2010-08-27 3:45 ` Kenichi Handa
0 siblings, 1 reply; 13+ messages in thread
From: MON KEY @ 2010-08-27 3:30 UTC (permalink / raw)
To: Miles Bader; +Cc: emacs-devel, Stefan Monnier, Kenichi Handa
On Thu, Aug 26, 2010 at 3:09 AM, Miles Bader <miles@gnu.org> wrote:
>> Convert the byte BYTE to a character representing BYTE.
> ...
>> Likewise multibyte-char-to-unibyte => char-to-byte:
>
> Those names seem a bit more clear to me as well.
>
Yes. Me too.
Is there some reason why these names might be confusing or otherwise
add to the existing confusion?
> -Miles
--
/s_P\
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: raw-byte and char-table
2010-08-27 3:30 ` MON KEY
@ 2010-08-27 3:45 ` Kenichi Handa
0 siblings, 0 replies; 13+ messages in thread
From: Kenichi Handa @ 2010-08-27 3:45 UTC (permalink / raw)
To: MON KEY; +Cc: emacs-devel, monnier, miles
In article <AANLkTimyj413zuWLDZciQMA1xPNo33qxCoY8ZrPqg6DW@mail.gmail.com>, MON KEY <monkey@sandpframing.com> writes:
> On Thu, Aug 26, 2010 at 3:09 AM, Miles Bader <miles@gnu.org> wrote:
>>> Convert the byte BYTE to a character representing BYTE.
> > ...
>>> Likewise multibyte-char-to-unibyte => char-to-byte:
> >
> > Those names seem a bit more clear to me as well.
> >
> Yes. Me too.
> Is there some reason why these names might be confusing or otherwise
> add to the existing confusion?
Until Emacs 22, we treated a 8-bit byte as a character of the
current unibyte charset. So, in Latin-1 env.,
(unibyte-char-to-multibyte #xC0) => LATIN CAPITAL LETTER A WITH GRAVE
but in Cyrillic-KO8 env.,
(unibyte-char-to-multibyte #xC0) => CYRILLIC SMALL LETTER YU
---
Kenichi Handa
handa@m17n.org
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2010-08-27 3:45 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-24 1:11 raw-byte and char-table Kenichi Handa
2010-08-24 3:06 ` Eli Zaretskii
2010-08-24 4:29 ` Kenichi Handa
2010-08-24 17:08 ` Eli Zaretskii
2010-08-25 4:05 ` Kenichi Handa
2010-08-26 0:01 ` Stefan Monnier
-- strict thread matches above, loose matches on Subject: below --
2010-08-26 2:58 MON KEY
2010-08-26 3:34 ` Kenichi Handa
2010-08-26 5:30 ` MON KEY
2010-08-26 6:48 ` Kenichi Handa
2010-08-26 7:09 ` Miles Bader
2010-08-27 3:30 ` MON KEY
2010-08-27 3:45 ` Kenichi Handa
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).