Unibyte characters

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Unibyte characters
@ 2008-10-31 11:05 Eli Zaretskii
  2008-10-31 11:18 ` Miles Bader
  2008-10-31 19:30 ` Richard M. Stallman
  0 siblings, 2 replies; 10+ messages in thread
From: Eli Zaretskii @ 2008-10-31 11:05 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

The ELisp manual has (in node "Text Representation") this explanation
of what is a "unibyte character":

       In unibyte representation, each character occupies one byte and
    therefore the possible character codes range from 0 to 255.  Codes 0
    through 127 are ASCII characters; the codes from 128 through 255 are
    used for one non-ASCII character set [...]

But I think this is inaccurate and even misleading.  For starters,
unibyte buffers and strings can contain DBCS characters and UTF-8
encoded text, where a character certainly does not ``occupy one
byte''.

More generally, I think it is better to say that unibyte buffers and
strings hold raw 8-bit bytes, and that for 8859-x and single-byte
Windows codepages, each such byte represents a single character.

Am I missing something?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-10-31 11:05 Unibyte characters Eli Zaretskii
@ 2008-10-31 11:18 ` Miles Bader
  2008-10-31 11:27   ` Eli Zaretskii
  2008-10-31 19:30 ` Richard M. Stallman
  1 sibling, 1 reply; 10+ messages in thread
From: Miles Bader @ 2008-10-31 11:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Kenichi Handa

Eli Zaretskii <eliz@gnu.org> writes:
> But I think this is inaccurate and even misleading.  For starters,
> unibyte buffers and strings can contain DBCS characters and UTF-8
> encoded text

It doesn't seem inaccurate or misleading -- it's talking about
characters as emacs (and the user, if the buffer is displayed) sees
them.  Text in a unibyte buffer is simply a bunch of binary characters
0-255; you can interpret them however you want, of course, but that's
not how emacs sees it.

-Miles

-- 
Road, n. A strip of land along which one may pass from where it is too
tiresome to be to where it is futile to go.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-10-31 11:18 ` Miles Bader
@ 2008-10-31 11:27   ` Eli Zaretskii
  2008-10-31 14:41     ` Stefan Monnier
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2008-10-31 11:27 UTC (permalink / raw)
  To: Miles Bader; +Cc: handa, emacs-devel

> From: Miles Bader <miles@gnu.org>
> Date: Fri, 31 Oct 2008 20:18:41 +0900
> Cc: emacs-devel@gnu.org, Kenichi Handa <handa@m17n.org>
> 
> Text in a unibyte buffer is simply a bunch of binary characters
> 0-255

Here you are saying what I was saying: that these are just raw 8-bit
bytes.

> you can interpret them however you want, of course, but that's
> not how emacs sees it.

I don't mind saying that displaying such a buffer or string or
movement by characters _interprets_ each byte as a single character.
But interpretation and essence are two different things, and the
manual does not make a point of telling that what it describes is the
Emacs interpretation of such buffers, not what is actually held there.

Thanks for the feedback, I will try to rephrase that text to make this
distinction more clear.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-10-31 11:27   ` Eli Zaretskii
@ 2008-10-31 14:41     ` Stefan Monnier
  2008-10-31 15:02       ` Juanma Barranquero
  2008-10-31 18:44       ` Eli Zaretskii
  0 siblings, 2 replies; 10+ messages in thread
From: Stefan Monnier @ 2008-10-31 14:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, emacs-devel, Miles Bader

>> Text in a unibyte buffer is simply a bunch of binary characters
>> 0-255

> Here you are saying what I was saying: that these are just raw 8-bit
> bytes.

>> you can interpret them however you want, of course, but that's
>> not how emacs sees it.

> I don't mind saying that displaying such a buffer or string or
> movement by characters _interprets_ each byte as a single character.
> But interpretation and essence are two different things, and the
> manual does not make a point of telling that what it describes is the
> Emacs interpretation of such buffers, not what is actually held there.

> Thanks for the feedback, I will try to rephrase that text to make this
> distinction more clear.

IIUC, this part of the manual dates back to the introduction of Mule,
when many people were using Emacs in unibyte mode.  Nowadays unibyte
mode is not recommended (I'd even be all happy to remove it altogether)
and unibyte buffers should only be used for binary, undecoded data
(i.e. for bytes, not for chars).

So I agree with Eli that we should update this text to insist that
a unibyte buffer only contains bytes, and then explain that if the
buffer is displayed, those bytes will be interpreted in
a particular way.

BTW IIRC the non-ascii part will just be displayed as \NNN nowadays,
rather than in some locale-dependent charset (such as latin-1).

        Stefan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-10-31 14:41     ` Stefan Monnier
@ 2008-10-31 15:02       ` Juanma Barranquero
  2008-10-31 18:44       ` Eli Zaretskii
  1 sibling, 0 replies; 10+ messages in thread
From: Juanma Barranquero @ 2008-10-31 15:02 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Miles Bader, Eli Zaretskii, emacs-devel, handa

On Fri, Oct 31, 2008 at 15:41, Stefan Monnier <monnier@iro.umontreal.ca> wrote:

> BTW IIRC the non-ascii part will just be displayed as \NNN nowadays,
> rather than in some locale-dependent charset (such as latin-1).

Unless you've set unibyte-display-via-language-environment to t.

Which apparently works for everyone except me; I get #872/#1179, a
hard to pinpoint but easy to repeat crash in a random location upwards
(stackwise) of draw_glyphs...

  Juanma

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-10-31 14:41     ` Stefan Monnier
  2008-10-31 15:02       ` Juanma Barranquero
@ 2008-10-31 18:44       ` Eli Zaretskii
  2008-10-31 21:15         ` Stefan Monnier
  1 sibling, 1 reply; 10+ messages in thread
From: Eli Zaretskii @ 2008-10-31 18:44 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: handa, emacs-devel, miles

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Miles Bader <miles@gnu.org>,  handa@m17n.org,  emacs-devel@gnu.org
> Date: Fri, 31 Oct 2008 10:41:47 -0400
> 
> IIUC, this part of the manual dates back to the introduction of Mule,
> when many people were using Emacs in unibyte mode.

Yes.

> Nowadays unibyte mode is not recommended

I agree, but I was talking about the ELisp manual; Lisp programmers do
need to know about unibyte buffers and strings, even if users are
discouraged from using the unibyte mode.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-10-31 11:05 Unibyte characters Eli Zaretskii
  2008-10-31 11:18 ` Miles Bader
@ 2008-10-31 19:30 ` Richard M. Stallman
  1 sibling, 0 replies; 10+ messages in thread
From: Richard M. Stallman @ 2008-10-31 19:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, handa

	   In unibyte representation, each character occupies one byte and
	therefore the possible character codes range from 0 to 255.  Codes 0
	through 127 are ASCII characters; the codes from 128 through 255 are
	used for one non-ASCII character set [...]

    But I think this is inaccurate and even misleading.  For starters,
    unibyte buffers and strings can contain DBCS characters and UTF-8
    encoded text, where a character certainly does not ``occupy one
    byte''.

As far as Emacs is concerned, that UTF-8 sequence is multiple
characters and each of those characters is one byte.  The fact that
one might interpret that byte sequence some other way in another
context is not a part of the Emacs text representation.

So the text is correct.  But it could be useful to add something
to explain how this unibyte text relates to other interpretations
of the same byte sequence.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-10-31 18:44       ` Eli Zaretskii
@ 2008-10-31 21:15         ` Stefan Monnier
  2008-11-01 10:47           ` Stephen J. Turnbull
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Monnier @ 2008-10-31 21:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, emacs-devel, miles

>> IIUC, this part of the manual dates back to the introduction of Mule,
>> when many people were using Emacs in unibyte mode.
> Yes.
>> Nowadays unibyte mode is not recommended
> I agree, but I was talking about the ELisp manual; Lisp programmers do
> need to know about unibyte buffers and strings, even if users are
> discouraged from using the unibyte mode.

Right, but for Elisp, we should make it even more clear that unibyte
buffers contain only bytes and not chars and that those buffers should
basically never be displayed, other than for debugging ;-)


        Stefan


PS: Of course, editing binary files is also useful sometimes.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-10-31 21:15         ` Stefan Monnier
@ 2008-11-01 10:47           ` Stephen J. Turnbull
  2008-11-02  1:59             ` Stefan Monnier
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen J. Turnbull @ 2008-11-01 10:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: miles, Eli Zaretskii, emacs-devel, handa

Stefan Monnier writes:

 > Right, but for Elisp, we should make it even more clear that unibyte
 > buffers contain only bytes and not chars and that those buffers should
 > basically never be displayed, other than for debugging ;-)

There's no reason not to display when the coding system is iso-8859-1.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unibyte characters
  2008-11-01 10:47           ` Stephen J. Turnbull
@ 2008-11-02  1:59             ` Stefan Monnier
  0 siblings, 0 replies; 10+ messages in thread
From: Stefan Monnier @ 2008-11-02  1:59 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: miles, Eli Zaretskii, emacs-devel, handa

>> Right, but for Elisp, we should make it even more clear that unibyte
>> buffers contain only bytes and not chars and that those buffers should
>> basically never be displayed, other than for debugging ;-)

> There's no reason not to display when the coding system is iso-8859-1.

I have no idea what you're talking about.


        Stefan




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-11-02  1:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-31 11:05 Unibyte characters Eli Zaretskii
2008-10-31 11:18 ` Miles Bader
2008-10-31 11:27   ` Eli Zaretskii
2008-10-31 14:41     ` Stefan Monnier
2008-10-31 15:02       ` Juanma Barranquero
2008-10-31 18:44       ` Eli Zaretskii
2008-10-31 21:15         ` Stefan Monnier
2008-11-01 10:47           ` Stephen J. Turnbull
2008-11-02  1:59             ` Stefan Monnier
2008-10-31 19:30 ` Richard M. Stallman

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).