Buffer-local variables affect general-purpose functions

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Buffer-local variables affect general-purpose functions
@ 2014-03-26 19:04 Eli Zaretskii
  2014-03-26 19:32 ` Paul Eggert
  2014-03-27 14:17 ` Stefan Monnier
  0 siblings, 2 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-26 19:04 UTC (permalink / raw)
  To: emacs-devel

(See bug#17011 for some context.)

In some cases, Emacs uses buffer-local variables in ways that affect
operations which might not have anything with buffer text.  One
example, from bug #17011 is this:

  M-x find-file-literally RET some-file RET
  M-x set-variable RET case-fold-search RET t RET
  M-: (chars-equal ?à ?À) RET

This produces nil, although the characters should compare equal under
case-fold-search.  Why?  Because we are in a unibyte buffer, where
values between 128 and 255 are interpreted as eight-bit raw bytes, not
as Latin characters, and raw bytes don't have lower/upper-case pairs.

Another example, from the same sequence of commands above, is the fact
that setting case-fold-search for the buffer affects comparison of
characters that don't belong to the buffer, merely because that buffer
happens to be current at the moment of comparison.

Yet another example is 'downcase' and 'upcase' functions -- they use
case tables local to the current buffer, even when the functions they
are applied to characters and strings not from the buffer.

This could produce subtle bugs, and is certainly confusing and
unexpected, at least by some.

The question is: do we want to do something about that?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-26 19:04 Buffer-local variables affect general-purpose functions Eli Zaretskii
@ 2014-03-26 19:32 ` Paul Eggert
  2014-03-26 20:03   ` Eli Zaretskii
  2014-03-27 14:17 ` Stefan Monnier
  1 sibling, 1 reply; 103+ messages in thread
From: Paul Eggert @ 2014-03-26 19:32 UTC (permalink / raw)
  To: Eli Zaretskii, emacs-devel

Eli Zaretskii wrote:
> do we want to do something about that?

Yes, and we should start by removing the backwards-compatibility hacks 
in question.  Whether the current buffer is unibyte should not affect 
the behavior of general-purpose functions on characters.

Elisp code that blindly extracts bytes from unibyte buffers or strings, 
and treats these bytes as characters, is broken anyway.  It needs to be 
fixed to convert bytes to characters (using 'unibyte-char-to-multibyte', 
say) before it gives them to general-purpose character functions like 
'downcase' and 'char-equal'.

Years ago, when these backwards-compatibility hacks were put in, it made 
sense to have them, because unibyte non-ASCII locales were widespread 
and converting code to multibyte was a hassle.  But nowadays the vast 
majority of non-ASCII usage is multibyte and these hacks cause more 
trouble than they're worth -- not just core dumps such as Bug#17011, but 
subtle behavioral problems not easily diagnosed.  It's time for the 
hacks to go.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-26 19:32 ` Paul Eggert
@ 2014-03-26 20:03   ` Eli Zaretskii
  2014-03-26 21:50     ` Paul Eggert
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-26 20:03 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Wed, 26 Mar 2014 12:32:05 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> 
> Eli Zaretskii wrote:
> > do we want to do something about that?
> 
> Yes, and we should start by removing the backwards-compatibility hacks 
> in question.  Whether the current buffer is unibyte should not affect 
> the behavior of general-purpose functions on characters.

Well, the change in behavior is not limited to unibyte buffers, as I
told in my OP.  I think the problem is wider.

> Elisp code that blindly extracts bytes from unibyte buffers or strings, 
> and treats these bytes as characters, is broken anyway.  It needs to be 
> fixed to convert bytes to characters (using 'unibyte-char-to-multibyte', 
> say) before it gives them to general-purpose character functions like 
> 'downcase' and 'char-equal'.

But there should still be a way to compare bytes and strings of bytes
in a unibyte buffer, right?  So perhaps we should have special
functions just for that purpose, and char-equal should signal an error
when presented with unibyte non-ASCII values.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-26 20:03   ` Eli Zaretskii
@ 2014-03-26 21:50     ` Paul Eggert
  2014-03-27 17:42       ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: Paul Eggert @ 2014-03-26 21:50 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii wrote:
> I think the problem is wider.

Yes, it is.

> But there should still be a way to compare bytes and strings of bytes
> in a unibyte buffer, right?

Byte-strings vs character-strings shouldn't be a problem, as the string 
itself tells you whether it's multibyte.  The problem is bytes vs 
characters, as both are modeled as small integers.

> So perhaps we should have special
> functions just for that purpose, and char-equal should signal an error
> when presented with unibyte non-ASCII values.

Sorry, I don't follow.  How could char-equal know whether 224 is a raw 
byte or the Latin-1 character 'à'?  It'd have to know that, to signal an 
error in the former case.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-26 19:04 Buffer-local variables affect general-purpose functions Eli Zaretskii
  2014-03-26 19:32 ` Paul Eggert
@ 2014-03-27 14:17 ` Stefan Monnier
  2014-03-27 17:17   ` Eli Zaretskii
  1 sibling, 1 reply; 103+ messages in thread
From: Stefan Monnier @ 2014-03-27 14:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>   M-x find-file-literally RET some-file RET
>   M-x set-variable RET case-fold-search RET t RET
>   M-: (chars-equal ?à ?À) RET

> This produces nil, although the characters should compare equal under
> case-fold-search.  Why?  Because we are in a unibyte buffer, where
> values between 128 and 255 are interpreted as eight-bit raw bytes, not
> as Latin characters, and raw bytes don't have lower/upper-case pairs.

I agree with Paul on this one: this should be fixed to disregard
unibyte setting.  `char-equal' compares chars, not bytes (use `eq'
for bytes).
It's an old backward compatibility hack that should go.

> Another example, from the same sequence of commands above, is the fact
> that setting case-fold-search for the buffer affects comparison of
> characters that don't belong to the buffer, merely because that buffer
> happens to be current at the moment of comparison.

IIUC this is the kind of problem you really want to talk about in this
thread, and yes, it's a problem.  Usually case-fold-search is let-bound
rather than set buffer-locally, but we have similar problems with
syntax-tables, case-tables, etc...

> The question is: do we want to do something about that?

Not sure.  It's hard to find all occurrences of this problem.
And I don't think we can find a "general" solution: each case might be
best solved in a different way.  Furthermore the right solution will
sometimes (often?) be to throw away the current functionality and
replace it with something different.

But we can definitely try to solve it on a case-by-case basis.

> Yet another example is 'downcase' and 'upcase' functions -- they use
> case tables local to the current buffer, even when the functions they
> are applied to characters and strings not from the buffer.

The solution here is simple: throw away buffer-local case-tables.
AFAICT, set-case-table is used at only one place: in with-case-table.

   % grep set-case-table **/*.el    
   emacs-lisp/cl-lib.el:;; (gv-define-simple-setter current-case-table set-case-table)
   subr.el:      (progn (set-case-table ,table)
   subr.el:      (set-case-table ,old-case-table))))))

So the only use of set-case-table is in with-case-table.

   % grep with-case-table **/*.el
   emacs-lisp/lisp-mode.el:                       "eval-and-compile" "eval-when-compile" "with-case-table"
   leim/quail/sisheng.el:  (with-case-table (standard-case-table)
   mail/smtpmail.el:                   (with-case-table ascii-case-table ;Why?
   subr.el:(defmacro with-case-table (table &rest body)

And the only uses of with-case-table are in lisp/leim/quail/sisheng.el
(where it sets the standard case table, so it should have no effect) and
in lisp/mail/smtpmail.el (where it uses ascii-case-table but should only
apply it to ASCII text, so it could just as well use the standard case
table).

And then we can use the Unicode 'case tables' as recently discussed.
Patch for that welcome on trunk.

> This could produce subtle bugs, and is certainly confusing and
> unexpected, at least by some.

Agreed.

        Stefan

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-27 14:17 ` Stefan Monnier
@ 2014-03-27 17:17   ` Eli Zaretskii
  2014-03-27 21:04     ` Stefan Monnier
  2014-03-28  3:38     ` Stephen J. Turnbull
  0 siblings, 2 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-27 17:17 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@IRO.UMontreal.CA>
> Cc: emacs-devel@gnu.org
> Date: Thu, 27 Mar 2014 10:17:23 -0400
> 
> >   M-x find-file-literally RET some-file RET
> >   M-x set-variable RET case-fold-search RET t RET
> >   M-: (chars-equal ?à ?À) RET
> 
> > This produces nil, although the characters should compare equal under
> > case-fold-search.  Why?  Because we are in a unibyte buffer, where
> > values between 128 and 255 are interpreted as eight-bit raw bytes, not
> > as Latin characters, and raw bytes don't have lower/upper-case pairs.
> 
> I agree with Paul on this one: this should be fixed to disregard
> unibyte setting.  `char-equal' compares chars, not bytes (use `eq'
> for bytes).
> It's an old backward compatibility hack that should go.

Paul seemed to say something more broad: that _all_ behaviors specific
to unibyte buffers should go away.  Do you agree?

Anyway, what should replace those hacks?  Arbitrarily interpreting raw
bytes as Latin characters is not TRT, IMO.

Actually, in the above case, we could simply make char-equal disregard
case-fold-search in unibyte buffers -- that would give you and Paul
what you want, but also keep backward compatibility (except for ASCII
characters).

> > The question is: do we want to do something about that?
> 
> Not sure.  It's hard to find all occurrences of this problem.
> And I don't think we can find a "general" solution: each case might be
> best solved in a different way.  Furthermore the right solution will
> sometimes (often?) be to throw away the current functionality and
> replace it with something different.

Maybe so, but something like

  (with-buffer-defaults BODY)

might be the solution, and should be easy enough to implement.  Or
maybe some other way of telling primitives: don't apply
buffer-specific behavior to this code.

>    % grep with-case-table **/*.el
>    emacs-lisp/lisp-mode.el:                       "eval-and-compile" "eval-when-compile" "with-case-table"
>    leim/quail/sisheng.el:  (with-case-table (standard-case-table)
>    mail/smtpmail.el:                   (with-case-table ascii-case-table ;Why?
>    subr.el:(defmacro with-case-table (table &rest body)
> 
> And the only uses of with-case-table are in lisp/leim/quail/sisheng.el
> (where it sets the standard case table, so it should have no effect) and
> in lisp/mail/smtpmail.el (where it uses ascii-case-table but should only
> apply it to ASCII text, so it could just as well use the standard case
> table).
> 
> And then we can use the Unicode 'case tables' as recently discussed.
> Patch for that welcome on trunk.

OK.




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-26 21:50     ` Paul Eggert
@ 2014-03-27 17:42       ` Eli Zaretskii
  2014-03-27 18:55         ` Paul Eggert
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-27 17:42 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Wed, 26 Mar 2014 14:50:52 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> How could char-equal know whether 224 is a raw byte or the Latin-1
> character 'à'?

The same way it "knows" today.




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-27 17:42       ` Eli Zaretskii
@ 2014-03-27 18:55         ` Paul Eggert
  0 siblings, 0 replies; 103+ messages in thread
From: Paul Eggert @ 2014-03-27 18:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On 03/27/2014 10:42 AM, Eli Zaretskii wrote:
>> How could char-equal know whether 224 is a raw byte or the Latin-1
>> >character 'à'?
> The same way it "knows" today.
So (char-equal ?x ?à) would signal an error in a unibyte buffer (because 
?à < 256), and (char-equal ?x ?α) would return nil (because 255 < ?α)? 
That doesn't sound right, but most likely I'm misunderstanding the proposal.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-27 17:17   ` Eli Zaretskii
@ 2014-03-27 21:04     ` Stefan Monnier
  2014-03-28  7:11       ` Eli Zaretskii
  2014-03-28  3:38     ` Stephen J. Turnbull
  1 sibling, 1 reply; 103+ messages in thread
From: Stefan Monnier @ 2014-03-27 21:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> Paul seemed to say something more broad: that _all_ behaviors specific
> to unibyte buffers should go away.  Do you agree?

Too broad to answer.  I think this needs to be decided on a case-by-case basis.

> Anyway, what should replace those hacks?  Arbitrarily interpreting raw
> bytes as Latin characters is not TRT, IMO.

I think it is: char-equal compares *chars*, not *bytes*.  IOW it's a bug
to pass bytes to it.

> Maybe so, but something like
>   (with-buffer-defaults BODY)
> might be the solution, and should be easy enough to implement.
> Or maybe some other way of telling primitives: don't apply
> buffer-specific behavior to this code.

That might be a valid option, but in any case it's incompatible and the
incompatibility will have different consequences for different uses, so
we're back to "case-by-case basis".


        Stefan



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-27 17:17   ` Eli Zaretskii
  2014-03-27 21:04     ` Stefan Monnier
@ 2014-03-28  3:38     ` Stephen J. Turnbull
  2014-03-28  8:51       ` Unibyte characters, strings, and buffers Eli Zaretskii
  1 sibling, 1 reply; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-28  3:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel

Eli Zaretskii writes:

 > Paul seemed to say something more broad: that _all_ behaviors specific
 > to unibyte buffers should go away.  Do you agree?

Yes, please.  XEmacs has never had the unibyte hack with Mule, and
never has had much trouble with that.  It also has never had an
instance of the \201 bug since Mule was declared stable -- where Emacs
has had *many* regressions.  It's arguable that there are performance
implications, but simply aliasing the binary codec to latin1-unix has
*never* caused a bug in handling binary files -- all bugs are due to
autodetection errors, not the buffer representation.  I don't recall a
case where a programmer "did something stupid" with a character
function that technically is inappropriate for true binary (eg,
upcase) -- invariably they were doing something like upcasing all the
HTML tags as they came off the wire.  Ie, the stream was a binary
protocol where all of the syntax was represented with ASCII bytes, and
therefore "readable words".

If the performance implications bother you, then a buffer
representation like http://www.python.org/dev/peps/pep-0393/ may be
useful.  You could do that halfway, as well (ie, buffers containing
pure Latin1 text or binary text would be represented as a flat buffer
of bytes, buffers containing scalars >= 256 would be represented as
UTF-8b, or whatever the hack for representing undecodable bytes
currently is).

 > Anyway, what should replace those hacks?  Arbitrarily interpreting raw
 > bytes as Latin characters is not TRT, IMO.

Python has a bytes/character distinction, but they have completely
separate implementations.  Emacs doesn't need that, unless you want to
compete with the P-languages as a web framework platform.  OTOH Emacs'
unibyte buffer toggle is a design bug, pure and simple, and it should
be backed up against a wall and immersed in insecticide.

If you stick to the interpretation that bytes contain non-negative
integers less than 256, you won't have a problem in practice if you
think them as the first 256 Unicode characters, but choose not to use
functions that make sense only with characters.  Python actually
implements many polymorphic functions (ie, they can be interpreted as
bytes->bytes or characters->characters, etc) by converting bytes to
characters as Latin-1, then using the character implementation of the
function.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-27 21:04     ` Stefan Monnier
@ 2014-03-28  7:11       ` Eli Zaretskii
  2014-03-28  7:46         ` Paul Eggert
  2014-03-28 14:12         ` Buffer-local variables affect general-purpose functions Stefan Monnier
  0 siblings, 2 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-28  7:11 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: emacs-devel@gnu.org
> Date: Thu, 27 Mar 2014 17:04:45 -0400
> 
> > Anyway, what should replace those hacks?  Arbitrarily interpreting raw
> > bytes as Latin characters is not TRT, IMO.
> 
> I think it is: char-equal compares *chars*, not *bytes*.  IOW it's a bug
> to pass bytes to it.

How to compare bytes, then?

Anyway, we don't have a way of distinguishing between characters and
bytes, unless we look on something besides the arguments themselves.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-28  7:11       ` Eli Zaretskii
@ 2014-03-28  7:46         ` Paul Eggert
  2014-03-28  8:18           ` Unibyte characters, strings and buffers Eli Zaretskii
  2014-03-28 14:12         ` Buffer-local variables affect general-purpose functions Stefan Monnier
  1 sibling, 1 reply; 103+ messages in thread
From: Paul Eggert @ 2014-03-28  7:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii wrote:
> How to compare bytes, then?

It depends on what kind of comparison one wants.  Simplest is to use 
'='.  To ignore case and treat bytes 128-255 as Latin-1 characters, use 
'downcase' first.  To ignore case and treat bytes 128-255 as 
uninterpreted bit patterns, use 'unibyte-char-to-multibyte' before 
downcasing.  Etc.

> we don't have a way of distinguishing between characters and
> bytes, unless we look on something besides the arguments themselves.

Yes, that's right.




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-28  7:46         ` Paul Eggert
@ 2014-03-28  8:18           ` Eli Zaretskii
  2014-03-28 18:42             ` Paul Eggert
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-28  8:18 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

(I retitled the subject, because the unibyte issue is sufficiently
different from what I originally raised.)

> Date: Fri, 28 Mar 2014 00:46:01 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> Eli Zaretskii wrote:
> > How to compare bytes, then?
> 
> It depends on what kind of comparison one wants.  Simplest is to use 
> '='.  To ignore case and treat bytes 128-255 as Latin-1 characters, use 
> 'downcase' first.  To ignore case and treat bytes 128-255 as 
> uninterpreted bit patterns, use 'unibyte-char-to-multibyte' before 
> downcasing.  Etc.
> 
> > we don't have a way of distinguishing between characters and
> > bytes, unless we look on something besides the arguments themselves.
> 
> Yes, that's right.

Which is why your suggestions above will not necessarily DTRT.
Arbitrary interpretation of bytes 128-255 as Latin-1 is not guaranteed
to be correct, and therefore 'downcase' will sometimes produce
unexpected results, unless we can make sure, somehow, that raw bytes
will never be exposed to Lisp as having these values.  Unless you show
a practical way towards the latter goal, what you suggest will just
replace one set of subtly buggy behaviors with another (in which case
I vote for what we already have, because that one is at least well
known and passed some test of time).

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28  3:38     ` Stephen J. Turnbull
@ 2014-03-28  8:51       ` Eli Zaretskii
  2014-03-28 10:28         ` Stephen J. Turnbull
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-28  8:51 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Fri, 28 Mar 2014 12:38:10 +0900
> Cc: Stefan Monnier <monnier@IRO.UMontreal.CA>, emacs-devel@gnu.org
> 
> Eli Zaretskii writes:
> 
>  > Paul seemed to say something more broad: that _all_ behaviors specific
>  > to unibyte buffers should go away.  Do you agree?
> 
> Yes, please.  XEmacs has never had the unibyte hack with Mule, and
> never has had much trouble with that.  It also has never had an
> instance of the \201 bug since Mule was declared stable -- where Emacs
> has had *many* regressions.

Let's not talk about Emacs 20 vintage problems, that is not useful.
Likewise examples from XEmacs, since the differences in this area
between Emacs and XEmacs are substantial, and that precludes useful
comparison.

> It's arguable that there are performance implications, but simply
> aliasing the binary codec to latin1-unix has *never* caused a bug in
> handling binary files -- all bugs are due to autodetection errors,
> not the buffer representation.

Forget about performance, there are real problems unrelated to that
which need to be solved, and I don't see how can you avoid them by
treating raw bytes as Latin-1 characters.  Let me explain.

First, we must have a way to have buffer "text" that represents a
stream of bytes, not some human-readable text.  (Just as a random
example, a buffer visiting an mbox file, from which you decode
portions into another buffer for display.)  Agreed?

In such unibyte buffers, we need a way to represent raw bytes, which
are parts of as yet un-decoded byte sequences that represent encoded
characters.  We cannot represent each such byte as a Latin-1
character, because Latin-1 characters are stored inside Emacs as
2-byte sequences of their UTF-8 encoding.  If you interpret bytes as
Latin-1 characters, functions like string-bytes will return wrong
results for those raw bytes.  Agreed?

So here you have already at least 2 valid reasons why Emacs must be
able to support raw bytes that are distinguishable from Latin-1
characters that have the same byte values, and why we must have
buffers that hold such raw bytes.  If we want to get rid of unibyte,
Someone(TM) should present a complete practical solution to those two
problems (and a few others), otherwise, this whole discussion leads
nowhere.  ("Practical" means that suggestions to introduce a character
data type are out of scope, or at least belong to an entirely
different discussion.)

> OTOH Emacs' unibyte buffer toggle is a design bug, pure and simple,
> and it should be backed up against a wall and immersed in
> insecticide.

I might even agree with you about the toggle.  But eliminating the
toggle doesn't solve the bigger issue, see above.

> If you stick to the interpretation that bytes contain non-negative
> integers less than 256, you won't have a problem in practice if you
> think them as the first 256 Unicode characters, but choose not to use
> functions that make sense only with characters.

What do you mean by "choose"?  Lisp code is used by many programmers
out there; sometimes, they aren't even aware if the buffer they work
on is unibyte, or what that means.  Even when they are aware, they
just want Emacs to DTRT, for their own value of "RT".  Unless each one
of those programmers "chooses" not to use the problematic functions,
we are back at square one.

And what does "choose not to use" mean, anyway?  How do you choose not
to use 'insert', for example? what do you use instead?

The issue at hand is how do you pull the trick, in practice, of doing
TRT with the legitimate use cases where Emacs needs to manipulate raw
bytes.

> Python actually implements many polymorphic functions (ie, they can
> be interpreted as bytes->bytes or characters->characters, etc) by
> converting bytes to characters as Latin-1, then using the character
> implementation of the function.

As long as Emacs exposes the character values to Lisp programs as
simple integers, I don't think we can take this path.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28  8:51       ` Unibyte characters, strings, and buffers Eli Zaretskii
@ 2014-03-28 10:28         ` Stephen J. Turnbull
  2014-03-28 10:58           ` David Kastrup
                             ` (2 more replies)
  0 siblings, 3 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-28 10:28 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Eli Zaretskii writes:

 > Let's not talk about Emacs 20 vintage problems, 

If they were *only* Emacs 20 vintage, this thread wouldn't exist.

 > Likewise examples from XEmacs, since the differences in this area
 > between Emacs and XEmacs are substantial, and that precludes useful
 > comparison.

"It works fine" isn't useful information?  XEmacs has *two* reasons to
want to change its internal representation.  (1) A Unicode
representation, especially UTF-8, would allow all autosave files to be
readable by other programs.  (2) A PEP 393-like representation would
be way faster for big buffers and strings.  Bytes-character confusion
is just plain not an issue, not for anybody, not at all.

 > First, we must have a way to have buffer "text" that represents a
 > stream of bytes, not some human-readable text.  (Just as a random
 > example, a buffer visiting an mbox file, from which you decode
 > portions into another buffer for display.)  Agreed?

No, I disagree.  XEmacs/MULE has never had such a feature, yet we can
run all Emacs programs without changing the buffer representation
(modulo inability to represent all Unicode characters properly, but
the JIT charsets are plenty good enough in practice).

 > In such unibyte buffers, we need a way to represent raw bytes, which
 > are parts of as yet un-decoded byte sequences that represent encoded
 > characters.

Again, I disagree.  Unibyte is a design mistake, and unnecessary.
XEmacs proves it -- we use (essentially) the same code in many
applications (VM, Gnus for two mbox-using examples) as GNU Emacs does.
The variations for XEmacs and Emacs are due to extents vs. overlays
and such-like, not due to buffer representation.

For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as
no-ops forever, and as far as I can tell nobody's ever needed to worry
about it (of course, maybe the folks who use those are just more clued
than the poor user in my next paragraph).

I agree that having a way to represent "undecodable bytes" in a string
or buffer is extremely convenient.  XEmacs's lack of this capability
is surely a deficiency (Hi, David K!)  But this is a completely
different issue from unibyte buffers.  Emacs doesn't need unibyte
buffers to perform its work, and if they are desirable on the grounds
of space or time efficiency, they should be opaque to Lisp.

 > We cannot represent each such byte as a Latin-1 character, because
 > Latin-1 characters are stored inside Emacs as 2-byte sequences of
 > their UTF-8 encoding.  If you interpret bytes as Latin-1
 > characters, functions like string-bytes will return wrong results
 > for those raw bytes.  Agreed?

No, I still disagree.

`(defun string-bytes (&rest junk) (error))', and live happily ever
after.  You don't need `string-bytes' unless you've exposed internal
representation to Lisp, then you desperately need it to write correct
code (which some users won't be able to do anyway without help, cf. 
https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk).  So
*don't expose internal representation* (and the hammer marks on users'
foreheads will disappear in due time, and the headaches even faster!)

 > So here you have already at least 2 valid reasons

No, *you* have them.  XEmacs works perfectly well without them, using
code written for Emacs.

 > If we want to get rid of unibyte, Someone(TM) should present a
 > complete practical solution to those two problems (and a few
 > others), otherwise, this whole discussion leads nowhere.

Complete practical solution: "They are non-problems, forget about
them, and rewrite any code that implies you need to remember them."

Fortunately for me, I am *intimately* familiar with XEmacs internals,
and therefore RMS won't let me write this code for Emacs. :-)

 > > If you stick to the interpretation that bytes contain non-negative
 > > integers less than 256, you won't have a problem in practice if you
 > > think them as the first 256 Unicode characters, but choose not to use
 > > functions that make sense only with characters.
 > 
 > What do you mean by "choose"?  Lisp code is used by many programmers
 > out there; sometimes, they aren't even aware if the buffer they work
 > on is unibyte, or what that means.

Which is precisely why we're having this thread.  If there were *no*
Lisp-visibile unibyte buffers or strings, it couldn't possibly matter.

 > Even when they are aware, they just want Emacs to DTRT, for their
 > own value of "RT".

Too bad for them, as long as Emacs has unibyte buffers.  They have to
be aware, and write code correctly for the mode of the buffer.
Viz. the poor serial port programmer in comp.emacs.

In XEmacs, they don't have to; they just use an appropriate
network-coding-system, and it just works.  That may not be *obvious*
to a programmer coming from a different background (say, Python) who
expects there to be both byte streams and text streams, but since
there's no other way to do it, it's not hard to get it right.

 > And what does "choose not to use" mean, anyway?  How do you choose not
 > to use 'insert', for example? what do you use instead?

Of course you use `insert'.  What I'm saying is that if you don't want
to trash a binary buffer where each byte is represented by an
ISO-8859-1 character in internal representation, you need to avoid
(1) coding-system-for-write other than 'binary (in XEmacs, aliased to
'iso-8859-1-unix), and (2) functions that mutate characters using
properties of characters that bytes don't have (eg, upcase).  That's
really all there is to it.

 > The issue at hand is how do you pull the trick, in practice, of
 > doing TRT with the legitimate use cases where Emacs needs to
 > manipulate raw bytes.

Follow the Nike advice: Just Do It.  Works fine, I assure you.  I can
understand that you're worried by this:

 > As long as Emacs exposes the character values to Lisp programs as
 > simple integers, I don't think we can take this path.

... but I'm not really sure why not.  I'll grant that after drinking
the Ben Wing Kool-Aid the idea of Emacsen without a character type
gives me hives, but that's because arbitrary integers, if decomposed
into byte- sized fields and inserted into a buffer, can become
non-characters and crash XEmacs.  But surely you have a function like
`char-int-p'[1] that is used (implicitly by `insert') to prevent
non-characters (in Emacs, 0xFFFF and surrogates would be examples, I
suppose) from being inserted in buffers.  Otherwise you'd have crashes
all over the place, I would imagine.  Since you don't, you must be
doing something to prevent arbitrary integers from getting inserted.

It seems to me that the only real issue, given that you have a way in
Emacs to represent undecodable bytes (XEmacs doesn't, but Emacs does)
is what to do if somebody reads in data as 'binary, then proceeds to
insert non-Latin-1 characters in the buffer.  I can think of three
possibilities: (1) don't allow it without changing the buffer's output
codec, (2) treat the existing characters as Latin-1, or (3) convert
all the existing "bytes" to undecodable bytes representation.

XEmacs implicitly does (2) ((3) can't be implemented at all, at
present).  I tend to prefer (1), but ISTR that would not have worked
very well with some programs, specifically readmail and VM (whose
author had a lot of influence on how XEmacs internals were designed),
because they narrowed the buffer and converted wire format (including
raw multibyte encodings) to displayed text in-place.

Footnotes: 
[1]  `char-int-p' is a built-in function (char-int-p OBJECT)
Documentation:
Return t if OBJECT is an integer that can be converted into a character.
See `char-int'.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 10:28         ` Stephen J. Turnbull
@ 2014-03-28 10:58           ` David Kastrup
  2014-03-28 11:22             ` Andreas Schwab
  2014-03-28 11:42             ` Stephen J. Turnbull
  2014-03-28 17:29           ` Eli Zaretskii
  2014-03-28 18:45           ` Daniel Colascione
  2 siblings, 2 replies; 103+ messages in thread
From: David Kastrup @ 2014-03-28 10:58 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> I agree that having a way to represent "undecodable bytes" in a string
> or buffer is extremely convenient.  XEmacs's lack of this capability
> is surely a deficiency (Hi, David K!)

Doing this in an utf-8 based internal coding is somewhat doable by
employing non-utf-8 sequences.  Either using code points above the
Unicode code range (2^20 + something, requiring 4 bytes), or by using
non-minimal encodings (since the minimal ones are two bytes, requiring 3
bytes).  Either way, the size increases significantly.

> But this is a completely different issue from unibyte buffers.  Emacs
> doesn't need unibyte buffers to perform its work, and if they are
> desirable on the grounds of space or time efficiency, they should be
> opaque to Lisp.

Well, Emacs is more following the non-opaque philosophy (XEmacs, in
contrast, has even an opaque character type and several other ones).
That has the advantage that you can use all sorts of available tools as
long as they don't break.

It has the disadvantage that the question "what is the right behavior
for x?"  needs to be answered quite more often since you can't take the
"x does not apply to y anyway" route out as often.

>  > We cannot [...]
>
> No, I still disagree.

Sure, everything is actually "We cannot efficiently" rather than "We
cannot".  But we still changed buffer positions from byte counts (as in
early Emacs 20) to character counts.  Efficiency took a dive but the
alternatives were just too horrible API-wise.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 10:58           ` David Kastrup
@ 2014-03-28 11:22             ` Andreas Schwab
  2014-03-28 11:34               ` David Kastrup
  2014-03-28 11:42             ` Stephen J. Turnbull
  1 sibling, 1 reply; 103+ messages in thread
From: Andreas Schwab @ 2014-03-28 11:22 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>> I agree that having a way to represent "undecodable bytes" in a string
>> or buffer is extremely convenient.  XEmacs's lack of this capability
>> is surely a deficiency (Hi, David K!)
>
> Doing this in an utf-8 based internal coding is somewhat doable by
> employing non-utf-8 sequences.  Either using code points above the
> Unicode code range (2^20 + something, requiring 4 bytes), or by using
> non-minimal encodings (since the minimal ones are two bytes, requiring 3
> bytes).  Either way, the size increases significantly.

Emacs uses U3fff80-U3fffff for raw 8-bit bytes, internally represented
by 2 bytes.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 11:22             ` Andreas Schwab
@ 2014-03-28 11:34               ` David Kastrup
  0 siblings, 0 replies; 103+ messages in thread
From: David Kastrup @ 2014-03-28 11:34 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>>
>>> I agree that having a way to represent "undecodable bytes" in a string
>>> or buffer is extremely convenient.  XEmacs's lack of this capability
>>> is surely a deficiency (Hi, David K!)
>>
>> Doing this in an utf-8 based internal coding is somewhat doable by
>> employing non-utf-8 sequences.  Either using code points above the
>> Unicode code range (2^20 + something, requiring 4 bytes), or by using
>> non-minimal encodings (since the minimal ones are two bytes, requiring 3
>> bytes).  Either way, the size increases significantly.
>
> Emacs uses U3fff80-U3fffff for raw 8-bit bytes, internally represented
> by 2 bytes.

Well, I forgot the non-minimal encodings for 0x00-0x7f, namely two-byte
sequences starting with 0xc0 or 0xc1 and ending with 0x80-0xbf.

Those would still fit the representation invariants.  Are those the
two-byte encodings used for "raw 0x80 to 0xff"?

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 10:58           ` David Kastrup
  2014-03-28 11:22             ` Andreas Schwab
@ 2014-03-28 11:42             ` Stephen J. Turnbull
  1 sibling, 0 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-28 11:42 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > But this is a completely different issue from unibyte buffers.  Emacs
 > > doesn't need unibyte buffers to perform its work, and if they are
 > > desirable on the grounds of space or time efficiency, they should be
 > > opaque to Lisp.
 > 
 > Well, Emacs is more following the non-opaque philosophy (XEmacs, in
 > contrast, has even an opaque character type and several other
 > ones).

Those are irrelevant to my point, though.

The problem here is that unibyte buffers are a second representation
of a single type (the buffer).  "Mr. Foot, meet Mr. Bullet, I'm sure
you'll get along fine!"

 > That has the advantage that you can use all sorts of available tools as
 > long as they don't break.

In this case, it's like being offered the hammer head and the handle
separately.  I'll say one thing for that approach, though -- now you
have *two* excellent ways to give yourself a headache, with two
different (musical?) sounds when you drum on your crown!

 > It has the disadvantage that the question "what is the right behavior
 > for x?"  needs to be answered quite more often since you can't take the
 > "x does not apply to y anyway" route out as often.

The right behavior here is for a unibyte buffer to do *exactly* the
same thing that a multibyte buffer would.  In which case you have a
single (opaque) type, as far as users can tell.

 > Efficiency took a dive but the alternatives were just too horrible
 > API-wise.

Unibyte buffer is just too horrible API-wise.  My advice is: nuke it.

Steve

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Buffer-local variables affect general-purpose functions
  2014-03-28  7:11       ` Eli Zaretskii
  2014-03-28  7:46         ` Paul Eggert
@ 2014-03-28 14:12         ` Stefan Monnier
  1 sibling, 0 replies; 103+ messages in thread
From: Stefan Monnier @ 2014-03-28 14:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

> How to compare bytes, then?

As mentioned in my first mention of the problem: `eq'.

> Anyway, we don't have a way of distinguishing between characters and
> bytes, unless we look on something besides the arguments themselves.

`char-equal' has something to distinguish: the fact that we call
`char-equal' instead of `eq' is just the info needed to decide that the
arguments are chars rather than bytes.


        Stefan



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 10:28         ` Stephen J. Turnbull
  2014-03-28 10:58           ` David Kastrup
@ 2014-03-28 17:29           ` Eli Zaretskii
  2014-03-28 17:50             ` David Kastrup
                               ` (2 more replies)
  2014-03-28 18:45           ` Daniel Colascione
  2 siblings, 3 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-28 17:29 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: monnier@IRO.UMontreal.CA,
>     emacs-devel@gnu.org
> Date: Fri, 28 Mar 2014 19:28:56 +0900
> 
> Eli Zaretskii writes:
> 
>  > Let's not talk about Emacs 20 vintage problems, 
> 
> If they were *only* Emacs 20 vintage, this thread wouldn't exist.

This thread is about different issues.

>  > Likewise examples from XEmacs, since the differences in this area
>  > between Emacs and XEmacs are substantial, and that precludes useful
>  > comparison.
> 
> "It works fine" isn't useful information?

No, because it describes a very different implementation.

>  > First, we must have a way to have buffer "text" that represents a
>  > stream of bytes, not some human-readable text.  (Just as a random
>  > example, a buffer visiting an mbox file, from which you decode
>  > portions into another buffer for display.)  Agreed?
> 
> No, I disagree.

Then I guess you will have to suggest how to implement this without
unibyte buffers.

>  > In such unibyte buffers, we need a way to represent raw bytes, which
>  > are parts of as yet un-decoded byte sequences that represent encoded
>  > characters.
> 
> Again, I disagree.  Unibyte is a design mistake, and unnecessary.

Then what do you call a buffer whose "text" is encoded?

> XEmacs proves it -- we use (essentially) the same code in many
> applications (VM, Gnus for two mbox-using examples) as GNU Emacs does.

I asked you not to bring XEmacs into the discussion, because I cannot
talk intelligently about its implementation.  If you insist on doing
that, this discussion is futile from my POV.

> For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as
> no-ops forever

I wasn't talking about those functions.  I was talking about the need
to have unibyte buffers and strings.

> I agree that having a way to represent "undecodable bytes" in a string
> or buffer is extremely convenient.  XEmacs's lack of this capability
> is surely a deficiency (Hi, David K!)  But this is a completely
> different issue from unibyte buffers.

How is it different?  What would be the encoding of a buffer that
contains raw bytes?

>  > We cannot represent each such byte as a Latin-1 character, because
>  > Latin-1 characters are stored inside Emacs as 2-byte sequences of
>  > their UTF-8 encoding.  If you interpret bytes as Latin-1
>  > characters, functions like string-bytes will return wrong results
>  > for those raw bytes.  Agreed?
> 
> No, I still disagree.
> 
> `(defun string-bytes (&rest junk) (error))', and live happily ever
> after.

But that's ridiculous: a raw byte is just a single byte, so
string-bytes should return a meaningful value for a string of such
bytes.

> You don't need `string-bytes' unless you've exposed internal
> representation to Lisp, then you desperately need it to write correct
> code (which some users won't be able to do anyway without help, cf. 
> https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk).  So
> *don't expose internal representation* (and the hammer marks on users'
> foreheads will disappear in due time, and the headaches even faster!)

How else would you know how many bytes will a string take on disk?

>  > So here you have already at least 2 valid reasons
> 
> No, *you* have them.  XEmacs works perfectly well without them, using
> code written for Emacs.

XEmacs also works "perfectly well" without bidi and other stuff.  That
doesn't help at all in this discussion.

>  > If we want to get rid of unibyte, Someone(TM) should present a
>  > complete practical solution to those two problems (and a few
>  > others), otherwise, this whole discussion leads nowhere.
> 
> Complete practical solution: "They are non-problems, forget about
> them, and rewrite any code that implies you need to remember them."

That a slogan, not a solution.

> Fortunately for me, I am *intimately* familiar with XEmacs internals,
> and therefore RMS won't let me write this code for Emacs. :-)

Then perhaps you shouldn't be part of this discussion.

>  > > If you stick to the interpretation that bytes contain non-negative
>  > > integers less than 256, you won't have a problem in practice if you
>  > > think them as the first 256 Unicode characters, but choose not to use
>  > > functions that make sense only with characters.
>  > 
>  > What do you mean by "choose"?  Lisp code is used by many programmers
>  > out there; sometimes, they aren't even aware if the buffer they work
>  > on is unibyte, or what that means.
> 
> Which is precisely why we're having this thread.  If there were *no*
> Lisp-visibile unibyte buffers or strings, it couldn't possibly matter.

And if I had $5M on by bank account, I'd probably be elsewhere
enjoying myself.  IOW, how are "if there were no..." arguments useful?

>  > Even when they are aware, they just want Emacs to DTRT, for their
>  > own value of "RT".
> 
> Too bad for them, as long as Emacs has unibyte buffers.  They have to
> be aware, and write code correctly for the mode of the buffer.
> Viz. the poor serial port programmer in comp.emacs.
> 
> In XEmacs, they don't have to; they just use an appropriate
> network-coding-system, and it just works.

This is not a discussion about whose model is better, Emacs or XEmacs.
This is a discussion of whether and how can we remove unibyte buffers,
strings, and characters from Emacs.  You must start by understanding
how are they used in Emacs 24, and then suggest practical ways to
change that.  Saying "look at XEmacs" doesn't help, because we can't,
and you know it.  I explicitly asked not to bring these arguments into
the discussion, and yet you still insist on doing precisely that.

>  > And what does "choose not to use" mean, anyway?  How do you choose not
>  > to use 'insert', for example? what do you use instead?
> 
> Of course you use `insert'.

In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers
and characters.  If you use it, you get what it does.

> What I'm saying is that if you don't want to trash a binary buffer
> where each byte is represented by an ISO-8859-1 character in
> internal representation, you need to avoid (1)
> coding-system-for-write other than 'binary (in XEmacs, aliased to
> 'iso-8859-1-unix), and (2) functions that mutate characters using
> properties of characters that bytes don't have (eg, upcase).  That's
> really all there is to it.

If the buffer is not marked specially, how will I know to avoid those?

> But surely you have a function like
> `char-int-p'[1] that is used (implicitly by `insert') to prevent
> non-characters (in Emacs, 0xFFFF and surrogates would be examples, I
> suppose) from being inserted in buffers.  Otherwise you'd have crashes
> all over the place, I would imagine.  Since you don't, you must be
> doing something to prevent arbitrary integers from getting inserted.

There's char-valid-p, but I don't see how that is relevant to the
current discussion.

> It seems to me that the only real issue, given that you have a way in
> Emacs to represent undecodable bytes (XEmacs doesn't, but Emacs does)
> is what to do if somebody reads in data as 'binary, then proceeds to
> insert non-Latin-1 characters in the buffer.  I can think of three
> possibilities: (1) don't allow it without changing the buffer's output
> codec, (2) treat the existing characters as Latin-1, or (3) convert
> all the existing "bytes" to undecodable bytes representation.
> 
> XEmacs implicitly does (2) ((3) can't be implemented at all, at
> present).

Not sure I understand what you describe, but if I do, Emacs does (3).

And I still don't see how this is relevant.  You are describing a
marginally valid use case, while I'm talking about use cases we meet
every day, and which must be supported, e.g. when some Lisp wants to
decode or encode text by hand.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 17:29           ` Eli Zaretskii
@ 2014-03-28 17:50             ` David Kastrup
  2014-03-28 18:31               ` Eli Zaretskii
  2014-03-28 20:27             ` Stefan Monnier
  2014-03-29  9:23             ` Stephen J. Turnbull
  2 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-28 17:50 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: "Stephen J. Turnbull" <stephen@xemacs.org>
>
>> Again, I disagree.  Unibyte is a design mistake, and unnecessary.
>
> Then what do you call a buffer whose "text" is encoded?

I can't speak for Stephen, of course, but my impression was he would
call it "a bad idea".

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 17:50             ` David Kastrup
@ 2014-03-28 18:31               ` Eli Zaretskii
  2014-03-28 19:25                 ` David Kastrup
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-28 18:31 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Date: Fri, 28 Mar 2014 18:50:02 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> >
> >> Again, I disagree.  Unibyte is a design mistake, and unnecessary.
> >
> > Then what do you call a buffer whose "text" is encoded?
> 
> I can't speak for Stephen, of course, but my impression was he would
> call it "a bad idea".

Then what other ideas to use when Lisp code needs to encode or decode
text manually?



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-28  8:18           ` Unibyte characters, strings and buffers Eli Zaretskii
@ 2014-03-28 18:42             ` Paul Eggert
  2014-03-28 18:52               ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: Paul Eggert @ 2014-03-28 18:42 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Emacs development discussions

On 03/28/2014 01:18 AM, Eli Zaretskii wrote:
> what you suggest will just
> replace one set of subtly buggy behaviors with another

Code that blithly passes bytes in the range 128-255 to char-equal is 
*already* buggy.  Although the proposed change wouldn't fix those bugs, 
it'd fix others, so it'd be a win.

Plus, the change is simpler and easier to explain than what we have now, 
and that is a long-term win.

I'm afraid what I'm hearing is "although it's broken, unless we come up 
with a perfect solution we shouldn't do anything".  I'd rather fix this 
particular problem now, even if it's not practical to fix all the 
related problems now.  We don't need to slay the entire unibyte dragon 
to fix the relatively minor issue of comparing characters.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 10:28         ` Stephen J. Turnbull
  2014-03-28 10:58           ` David Kastrup
  2014-03-28 17:29           ` Eli Zaretskii
@ 2014-03-28 18:45           ` Daniel Colascione
  2014-03-28 19:35             ` Glenn Morris
  2014-03-29 11:17             ` Stephen J. Turnbull
  2 siblings, 2 replies; 103+ messages in thread
From: Daniel Colascione @ 2014-03-28 18:45 UTC (permalink / raw)
  To: Stephen J. Turnbull, Eli Zaretskii; +Cc: monnier, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 270 bytes --]

On 03/28/2014 03:28 AM, Stephen J. Turnbull wrote:
> Fortunately for me, I am *intimately* familiar with XEmacs internals,
> and therefore RMS won't let me write this code for Emacs. :-)

What now? People who have contributed to XEmacs can't contribute to Emacs?


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-28 18:42             ` Paul Eggert
@ 2014-03-28 18:52               ` Eli Zaretskii
  2014-03-28 19:21                 ` Paul Eggert
                                   ` (2 more replies)
  0 siblings, 3 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-28 18:52 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Fri, 28 Mar 2014 11:42:16 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: Emacs development discussions <emacs-devel@gnu.org>
> 
> On 03/28/2014 01:18 AM, Eli Zaretskii wrote:
> > what you suggest will just
> > replace one set of subtly buggy behaviors with another
> 
> Code that blithly passes bytes in the range 128-255 to char-equal is 
> *already* buggy.

There's nothing wrong with those bytes, certainly not when they stand
for Latin-1 characters.

> Although the proposed change wouldn't fix those bugs, it'd fix
> others, so it'd be a win.

How is it a win, when it actually _adds_ bugs?  E.g., under your
proposal, (char-equal 192 224) will yield non-nil when
case-fold-search is non-nil.

> Plus, the change is simpler and easier to explain than what we have now, 
> and that is a long-term win.

I don't see how it is simpler or easier to explain.  It replaces one
lopsided interpretation of 128-255 values with another.

> I'm afraid what I'm hearing is "although it's broken, unless we come up 
> with a perfect solution we shouldn't do anything".

I don't know where you heard that.  I certainly didn't say anything
like that.

> I'd rather fix this particular problem now, even if it's not
> practical to fix all the related problems now.

I suggested a solution: ignore case-fold-search in unibyte buffers.  I
think that's a greater win.

> We don't need to slay the entire unibyte dragon to fix the
> relatively minor issue of comparing characters.

I agree.  But then you are responding in a wrong thread ;-)

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-28 18:52               ` Eli Zaretskii
@ 2014-03-28 19:21                 ` Paul Eggert
  2014-03-29  6:40                   ` Eli Zaretskii
  2014-03-28 20:23                 ` Stefan Monnier
  2014-03-29 19:34                 ` Stefan Monnier
  2 siblings, 1 reply; 103+ messages in thread
From: Paul Eggert @ 2014-03-28 19:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

>> Code that blithly passes bytes in the range 128-255 to char-equal is
>> *already* buggy.
> There's nothing wrong with those bytes, certainly not when they stand
> for Latin-1 characters.

Sure, and if they stand for Latin-1 characters the proposed change will 
do the right thing.

> How is it a win, when it actually _adds_ bugs? E.g., under your 
> proposal, (char-equal 192 224) will yield non-nil when 
> case-fold-search is non-nil. 

That's not a bug, since À and à are the same character, ignoring case.

As I understand it, the scenario you're worried about is that someone is 
visiting a unibyte buffer and is doing a case-folded search involving 
non-ASCII bytes and doesn't want these bytes to match their Latin-1 
case-folded counterparts.  This scenario is not common enough to worry 
about.  Changing the behavior for this rare case is a cost, I suppose, 
but it's outweighed by the benefit of simplifying case-equal and fixing 
its semantics to be a bit saner.

>> Plus, the change is simpler and easier to explain than what we have now,
>> and that is a long-term win.
> I don't see how it is simpler or easier to explain.  It replaces one
> lopsided interpretation of 128-255 values with another.
>

It's simpler because it decouples the rules for char-equal from the 
question of whether the current buffer is multibyte.  Separation of 
concerns is a win.

> I suggested a solution: ignore case-fold-search in unibyte buffers.

Sorry, I didn't see that suggestion.  It would be better than what we 
have now for char-equal, but it would have undesirable side effects 
elsewhere.  When I type find-file-literally to visit a buffer in 
raw-text form, it's more convenient if I can type C-s h t m l (or 
whatever) and find "HTML".  I'd rather not lose that capability.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 18:31               ` Eli Zaretskii
@ 2014-03-28 19:25                 ` David Kastrup
  2014-03-29  6:43                   ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-28 19:25 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Date: Fri, 28 Mar 2014 18:50:02 +0100
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> From: "Stephen J. Turnbull" <stephen@xemacs.org>
>> >
>> >> Again, I disagree.  Unibyte is a design mistake, and unnecessary.
>> >
>> > Then what do you call a buffer whose "text" is encoded?
>> 
>> I can't speak for Stephen, of course, but my impression was he would
>> call it "a bad idea".
>
> Then what other ideas to use when Lisp code needs to encode or decode
> text manually?

Redecode right to a "binary" coding system would be my guess.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 18:45           ` Daniel Colascione
@ 2014-03-28 19:35             ` Glenn Morris
  2014-03-29 11:17             ` Stephen J. Turnbull
  1 sibling, 0 replies; 103+ messages in thread
From: Glenn Morris @ 2014-03-28 19:35 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: emacs-devel

Daniel Colascione wrote:

> What now? People who have contributed to XEmacs can't contribute to Emacs?

Of course they can; subject to the same conditions as anyone else.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-28 18:52               ` Eli Zaretskii
  2014-03-28 19:21                 ` Paul Eggert
@ 2014-03-28 20:23                 ` Stefan Monnier
  2014-03-29 19:34                 ` Stefan Monnier
  2 siblings, 0 replies; 103+ messages in thread
From: Stefan Monnier @ 2014-03-28 20:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Paul Eggert, emacs-devel

> How is it a win, when it actually _adds_ bugs?  E.g., under your
> proposal, (char-equal 192 224) will yield non-nil when
> case-fold-search is non-nil.

Non-nil is the right answer.  Doesn't sound like a bug to me.


        Stefan



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 17:29           ` Eli Zaretskii
  2014-03-28 17:50             ` David Kastrup
@ 2014-03-28 20:27             ` Stefan Monnier
  2014-03-29  9:23             ` Stephen J. Turnbull
  2 siblings, 0 replies; 103+ messages in thread
From: Stefan Monnier @ 2014-03-28 20:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Stephen J. Turnbull, emacs-devel

>> Again, I disagree.  Unibyte is a design mistake, and unnecessary.
> Then what do you call a buffer whose "text" is encoded?

I think they call it "a buffer" ;-)
More seriously, IIUC they represent bytes 0..7F as ASCII (like we do)
and 80..FF as latin-1-ish chars (i.e. occupying two bytes in the
internal representation, IIRC).


        Stefan



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-28 19:21                 ` Paul Eggert
@ 2014-03-29  6:40                   ` Eli Zaretskii
  2014-03-29 18:57                     ` Paul Eggert
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29  6:40 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Fri, 28 Mar 2014 12:21:04 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: emacs-devel@gnu.org
> 
> > I suggested a solution: ignore case-fold-search in unibyte buffers.
> 
> Sorry, I didn't see that suggestion.  It would be better than what we 
> have now for char-equal, but it would have undesirable side effects 
> elsewhere.

I suggested it only for char-equal.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 19:25                 ` David Kastrup
@ 2014-03-29  6:43                   ` Eli Zaretskii
  2014-03-29  7:23                     ` David Kastrup
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29  6:43 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: emacs-devel@gnu.org
> Date: Fri, 28 Mar 2014 20:25:17 +0100
> 
> >> > Then what do you call a buffer whose "text" is encoded?
> >> 
> >> I can't speak for Stephen, of course, but my impression was he would
> >> call it "a bad idea".
> >
> > Then what other ideas to use when Lisp code needs to encode or decode
> > text manually?
> 
> Redecode right to a "binary" coding system would be my guess.

Sorry, I don't follow.  Can you tell more what that means?

The situation I was describing is that I need to do something with
undecoded bytes before decoding them, or after encoding them.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  6:43                   ` Eli Zaretskii
@ 2014-03-29  7:23                     ` David Kastrup
  2014-03-29  8:24                       ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-29  7:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Cc: emacs-devel@gnu.org
>> Date: Fri, 28 Mar 2014 20:25:17 +0100
>> 
>> >> > Then what do you call a buffer whose "text" is encoded?
>> >> 
>> >> I can't speak for Stephen, of course, but my impression was he would
>> >> call it "a bad idea".
>> >
>> > Then what other ideas to use when Lisp code needs to encode or decode
>> > text manually?
>> 
>> Redecode right to a "binary" coding system would be my guess.
>
> Sorry, I don't follow.  Can you tell more what that means?

It means a buffer where each _character_ has the same value that the
no-longer-available unibyte buffer would have in its bytes/characters.

> The situation I was describing is that I need to do something with
> undecoded bytes before decoding them, or after encoding them.

You can do that whether or not the conceptual array of 0..255 characters
is internally encoded in unibyte or multibyte encodings.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  7:23                     ` David Kastrup
@ 2014-03-29  8:24                       ` Eli Zaretskii
  2014-03-29  8:40                         ` David Kastrup
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29  8:24 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 08:23:33 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: David Kastrup <dak@gnu.org>
> >> Cc: emacs-devel@gnu.org
> >> Date: Fri, 28 Mar 2014 20:25:17 +0100
> >> 
> >> >> > Then what do you call a buffer whose "text" is encoded?
> >> >> 
> >> >> I can't speak for Stephen, of course, but my impression was he would
> >> >> call it "a bad idea".
> >> >
> >> > Then what other ideas to use when Lisp code needs to encode or decode
> >> > text manually?
> >> 
> >> Redecode right to a "binary" coding system would be my guess.
> >
> > Sorry, I don't follow.  Can you tell more what that means?
> 
> It means a buffer where each _character_ has the same value that the
> no-longer-available unibyte buffer would have in its bytes/characters.

This doesn't seem to be a complete description of what is suggested.
E.g., just by looking at the values of characters, it is impossible to
distinguish between Latin characters below 256 and raw bytes.  In a
unibyte buffer, we know how to make that distinction, but if there are
no unibyte buffers, something else is needed for doing that.

> > The situation I was describing is that I need to do something with
> > undecoded bytes before decoding them, or after encoding them.
> 
> You can do that whether or not the conceptual array of 0..255 characters
> is internally encoded in unibyte or multibyte encodings.

What do you mean by "multibyte encodings" in this context?  Are you
suggesting to store the bytes 128..255 as Latin-1 characters,
i.e. using the 2-byte UTF-8 sequences of the corresponding Latin
characters?  Or are you suggesting something else?



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  8:24                       ` Eli Zaretskii
@ 2014-03-29  8:40                         ` David Kastrup
  2014-03-29  9:25                           ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-29  8:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Cc: emacs-devel@gnu.org
>> Date: Sat, 29 Mar 2014 08:23:33 +0100
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> From: David Kastrup <dak@gnu.org>
>> >> Cc: emacs-devel@gnu.org
>> >> Date: Fri, 28 Mar 2014 20:25:17 +0100
>> >> 
>> >> >> > Then what do you call a buffer whose "text" is encoded?
>> >> >> 
>> >> >> I can't speak for Stephen, of course, but my impression was he would
>> >> >> call it "a bad idea".
>> >> >
>> >> > Then what other ideas to use when Lisp code needs to encode or decode
>> >> > text manually?
>> >> 
>> >> Redecode right to a "binary" coding system would be my guess.
>> >
>> > Sorry, I don't follow.  Can you tell more what that means?
>> 
>> It means a buffer where each _character_ has the same value that the
>> no-longer-available unibyte buffer would have in its bytes/characters.
>
> This doesn't seem to be a complete description of what is suggested.
> E.g., just by looking at the values of characters, it is impossible to
> distinguish between Latin characters below 256 and raw bytes.  In a
> unibyte buffer, we know how to make that distinction,

Uh, what?  The point of a unibyte buffer is that it does not make the
distinction.

> but if there are no unibyte buffers, something else is needed for
> doing that.

>> You can do that whether or not the conceptual array of 0..255 characters
>> is internally encoded in unibyte or multibyte encodings.
>
> What do you mean by "multibyte encodings" in this context?  Are you
> suggesting to store the bytes 128..255 as Latin-1 characters,
> i.e. using the 2-byte UTF-8 sequences of the corresponding Latin
> characters?

That would make the most sense, yes.

> Or are you suggesting something else?

You could also use the "raw byte" character encodings we use for not
losing information when reading not properly formed utf-8 files into a
multibyte buffer, but that seems less practical when working with the
character codes.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 17:29           ` Eli Zaretskii
  2014-03-28 17:50             ` David Kastrup
  2014-03-28 20:27             ` Stefan Monnier
@ 2014-03-29  9:23             ` Stephen J. Turnbull
  2014-03-29  9:52               ` Andreas Schwab
                                 ` (4 more replies)
  2 siblings, 5 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-29  9:23 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: monnier, emacs-devel

Eli Zaretskii writes:

 > This thread is about different issues.

*sigh*  No, it's about unibyte being a premature pessimization.

 > >  > Likewise examples from XEmacs, since the differences in this area
 > >  > between Emacs and XEmacs are substantial, and that precludes useful
 > >  > comparison.
 > > 
 > > "It works fine" isn't useful information?
 > 
 > No, because it describes a very different implementation.

Not at all.  The implementation of multibyte buffers is very similar.
What's different is that Emacs complifusticates matters by also having
a separate implementation of unibyte buffers, and then basically
making a union out of the two structures called "buffer".  XEmacs
simply implements binary as a particular coding system in and out of
multibyte buffers.

 > Then I guess you will have to suggest how to implement this without
 > unibyte buffers.

No, I don't.  I already told you how to do it: nuke unibyte buffers
and use iso-8859-1-unix as the binary codec.  Then you're done, except
for those applications that actually make the mistake of using unibyte
text explicitly.  If there are cases where unibyte happens implicitly,
and this transformation causes a bug, I think you'll discover unibyte
itself was problematic.

 > >  > In such unibyte buffers, we need a way to represent raw bytes, which
 > >  > are parts of as yet un-decoded byte sequences that represent encoded
 > >  > characters.
 > > 
 > > Again, I disagree.  Unibyte is a design mistake, and unnecessary.
 > 
 > Then what do you call a buffer whose "text" is encoded?

"Binary."

 > > XEmacs proves it -- we use (essentially) the same code in many
 > > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does.
 > 
 > I asked you not to bring XEmacs into the discussion, because I cannot
 > talk intelligently about its implementation.  If you insist on doing
 > that, this discussion is futile from my POV.

The whole point here is that exactly what the XEmacs implementation is
*irrelevant*.  The point that we implement the same API as GNU Emacs
without unibyte buffers or the annoyances and incoherence that comes
with them.

 > > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as
 > > no-ops forever
 > 
 > I wasn't talking about those functions.  I was talking about the need
 > to have unibyte buffers and strings.

There is no "need for unibyte."  You're simply afraid to throw it away.

 > How is it different?  What would be the encoding of a buffer that
 > contains raw bytes?

Depends.  If it's uninterpreted bytes, "binary."  If those are
undecodable bytes, they'll be the representation of raw bytes that
occurred in an otherwise sane encoded stream, and the buffer's
encoding will be the nominal encoding of that stream.  If you want to
ensure sanity of output, then you will use an output encoding that
errors on rawbytes, and a program that cleans up those rawbytes in a
way appropriate for the application.  If you expect the next program
in the pipeline to handle them, then you use a variant encoding that
just encodes them back to the original undecodable rawbytes.

 > But that's ridiculous: a raw byte is just a single byte, so
 > string-bytes should return a meaningful value for a string of such
 > bytes.

`string-bytes' should not exist.  As I wrote earlier:

 > > You don't need `string-bytes' unless you've exposed internal
 > > representation to Lisp, then you desperately need it to write correct
 > > code (which some users won't be able to do anyway without help, cf. 
 > > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk).  So
 > > *don't expose internal representation* (and the hammer marks on users'
 > > foreheads will disappear in due time, and the headaches even faster!)
 > 
 > How else would you know how many bytes will a string take on disk?

How does `string-bytes' help?  You don't know what encoding will be
used to write them, and in general it won't be the same number that
they take up in the string.

If you use iso-8859-1-unix as the coding system, then "bytes on the
wire" == "characters in the string".  No problema, señor.

 > 
 > >  > So here you have already at least 2 valid reasons
 > > 
 > > No, *you* have them.  XEmacs works perfectly well without them, using
 > > code written for Emacs.
 > 
 > XEmacs also works "perfectly well" without bidi and other stuff.  That
 > doesn't help at all in this discussion.

You're right: because XEmacs doesn't handle bidi, it's irrelevant to
this discussion.  Why did *you* bring it up?

What is relevant is how to represent byte streams in Emacs.  The
obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
characters.  It is *extremely* convenient if the first 128 of those
bytes correspond to the ASCII coded character set, because so many
wire protocols use ASCII "words" syntactically.  The other 128 don't
matter much, so why not just use the extremely convenient Latin-1 set
for them?

 > >  > If we want to get rid of unibyte, Someone(TM) should present a
 > >  > complete practical solution to those two problems (and a few
 > >  > others), otherwise, this whole discussion leads nowhere.
 > > 
 > > Complete practical solution: "They are non-problems, forget about
 > > them, and rewrite any code that implies you need to remember them."
 > 
 > That a slogan, not a solution.

No, it is a precise high-level design for a solution.  The same design
that XEmacs uses, and which would be quite straightforward for Emacs
to adopt since it already has multibyte buffers of the same power as
XEmacs's, though with (currently) a different internal encoding.

 > > Fortunately for me, I am *intimately* familiar with XEmacs internals,
 > > and therefore RMS won't let me write this code for Emacs. :-)
 > 
 > Then perhaps you shouldn't be part of this discussion.

Since I've been invited to leave, I will.  My point is sufficiently
well-made for open minds to deal with the details.  I'll finish this
post on the off chance that somewhere in it will be the key that will
unlock yours.

 > > Which is precisely why we're having this thread.  If there were *no*
 > > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter.
 > 
 > And if I had $5M on by bank account, I'd probably be elsewhere
 > enjoying myself.  IOW, how are "if there were no..." arguments useful?

Because they point out that this thread wouldn't have happened with a
different design.  I consider that design better, after experience
with two separate implementations of multibyte only (NEmacs,
XEmacs/MULE), an implementation with strict separation of bytes from
characters (Python 2 with PEP 383), an implementation with strict
separation of bytes from characters and space-efficient character
representation (Python 3 with PEPS 383, 393), and one implementation
with unibyte (Emacs).

The first four work fine dealing with bytes and characters, and there
is no confusion.  Both Pythons can handle undecodable bytes in encoded
streams (ie, roundtrip).  Only GNU Emacs has issues about dealing with
unibyte vs. multibyte.

 > This is not a discussion about whose model is better, Emacs or XEmacs.
 > This is a discussion of whether and how can we remove unibyte buffers,
 > strings, and characters from Emacs.  You must start by understanding
 > how are they used in Emacs 24, and then suggest practical ways to
 > change that.

Well, I would have said "tell me about it", but you've asked me to
leave, so I won't.  I will say nothing you've said so far even hints
at issues with simply removing the whole concept of unibyte.

 > In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers
 > and characters.  If you use it, you get what it does.

And I'm telling you those subtleties are a *problem* that solves
nothing that an Emacs without a unibyte concept can't handle fine.

 > If the buffer is not marked specially, how will I know to avoid
 > [inserting non-Latin-1 characters in a "binary" buffer]?

All experience with XEmacs says *you* (the human programmer) *won't*
have any problem avoiding that.  As a programmer, if you're working
with a binary protocol, you will be using binary buffers and strings,
and byte-sized integers.  If you accidentally mix things up, you'll
quickly get an encoding error on output (since the binary codec can't
output non-Latin-1 Unicode characters.

It's just not a problem in practice, and that's not why unibyte was
introduced in Emacs anyway.  Unibyte was introduced because some folks
thought working with variable-width-encoded buffers was too
inefficient so they wanted access to a flat buffer of bytes.  That's
why buffer-as-{uni,multi}byte type punning was included.

 > > But surely you have a function like `char-int-p'[1] [...]
 > 
 > There's char-valid-p, but I don't see how that is relevant to the
 > current discussion.

Only insofar as you thought char-int confusion might be an issue.

 > And I still don't see how this is relevant.  You are describing a
 > marginally valid use case, while I'm talking about use cases we meet
 > every day, and which must be supported, e.g. when some Lisp wants to
 > decode or encode text by hand.

You use `encode-coding-region' and `decode-coding-region', same as you
do now.  Do you seriously think that XEmacs doesn't support those use
cases?

o/o

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  8:40                         ` David Kastrup
@ 2014-03-29  9:25                           ` Eli Zaretskii
  0 siblings, 0 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29  9:25 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 09:40:03 +0100
> 
> >> It means a buffer where each _character_ has the same value that the
> >> no-longer-available unibyte buffer would have in its bytes/characters.
> >
> > This doesn't seem to be a complete description of what is suggested.
> > E.g., just by looking at the values of characters, it is impossible to
> > distinguish between Latin characters below 256 and raw bytes.  In a
> > unibyte buffer, we know how to make that distinction,
> 
> Uh, what?  The point of a unibyte buffer is that it does not make the
> distinction.

Yes, it does: it treats every character as a raw byte.  So the dilemma
is resolved there by definition.  How to do that without unibyte
buffers remains to be defined, otherwise plans to remove unibyte
buffers are impractical.

> > but if there are no unibyte buffers, something else is needed for
> > doing that.
> 
> >> You can do that whether or not the conceptual array of 0..255 characters
> >> is internally encoded in unibyte or multibyte encodings.
> >
> > What do you mean by "multibyte encodings" in this context?  Are you
> > suggesting to store the bytes 128..255 as Latin-1 characters,
> > i.e. using the 2-byte UTF-8 sequences of the corresponding Latin
> > characters?
> 
> That would make the most sense, yes.

Then the above distinction is impossible, and all kinds of subtly
incorrect behaviors creep in.

> > Or are you suggesting something else?
> 
> You could also use the "raw byte" character encodings we use for not
> losing information when reading not properly formed utf-8 files into a
> multibyte buffer, but that seems less practical when working with the
> character codes.

Why less practical?



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  9:23             ` Stephen J. Turnbull
@ 2014-03-29  9:52               ` Andreas Schwab
  2014-03-29 10:48                 ` Eli Zaretskii
  2014-03-29 10:42               ` David Kastrup
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 103+ messages in thread
From: Andreas Schwab @ 2014-03-29  9:52 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> No, I don't.  I already told you how to do it: nuke unibyte buffers
> and use iso-8859-1-unix as the binary codec.

No, you use raw-text, representing each non-ascii character in the
eight-bit charset (this is what string-to-multibyte does).  Using
latin-1 would lose information.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  9:23             ` Stephen J. Turnbull
  2014-03-29  9:52               ` Andreas Schwab
@ 2014-03-29 10:42               ` David Kastrup
  2014-03-29 11:07                 ` Eli Zaretskii
  2014-03-29 10:44               ` Eli Zaretskii
                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-29 10:42 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>
>  > How is it different?  What would be the encoding of a buffer that
>  > contains raw bytes?
>
> Depends.  If it's uninterpreted bytes, "binary."  If those are
> undecodable bytes, they'll be the representation of raw bytes that
> occurred in an otherwise sane encoded stream, and the buffer's
> encoding will be the nominal encoding of that stream.

It's worth pointing out that there is no such thing as a "buffer's
encoding" in general in Emacs.  Buffers are sequences of characters or,
in the case of a unibyte buffer, bytes.  Encodings come into play for
import/export only but they are not an inherent property of the buffer
as such but rather, for example, of the file association of the buffer.

Emacs has two kinds of internal representation (what one might actually
want to call "buffer encoding"): unibyte and multibyte.  XEmacs, I
think, has only one.

The current point of contention is about changing the way of
codepoint-based character operations depending on the unibyte state of
the current buffer.

I consider that an astonishingly bad idea since character and string
operations are not tied to a particular buffer.  The whole point of MULE
from a rather early point of time on was to deal with only a single
Unicode-based character set in all of Emacs.  Making character
operations change meaning based on a buffer's unibyte status means a
return to the character set semantics of Emacs 19.

I am not necessarily of the same opinion as Stephen regarding whether or
not abolishing unibyte buffers is a worthwhile goal.  But I am pretty
sure that "unibyte" should not be bleeding over into character and
string operations.

A unibyte buffer or unibyte string might error out when trying to insert
characters out of the range 0..255.  That's an obvious consequence of
the buffer's representation.

If we want different semantics for case-fold-search in binary buffers,
then the solution is setting a buffer-local setting of case-fold-search
when opening a buffer intended to be manipulated in a binary way.

But the unibyte setting of the buffer should not affect normal character
and string operation semantics.  It is a buffer implementation detail
that should not really have a visible effect apart from making some
buffer operations impossible.

Whether or not we want to abolish unibyte buffer representations, we
don't want this to bleed effects beyond the buffer representation.

If something chooses a unibyte buffer representation for some reason, it
is the responsibility of the same something to switch character
operations and case-fold-search etc to something making sense in the
context of its operation.  That may well be through some buffer-local
setting of case-fold-search etc, but it is not tied to the internal
representation of the buffer contents.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  9:23             ` Stephen J. Turnbull
  2014-03-29  9:52               ` Andreas Schwab
  2014-03-29 10:42               ` David Kastrup
@ 2014-03-29 10:44               ` Eli Zaretskii
  2014-03-29 11:06               ` Andreas Schwab
  2014-03-29 17:01               ` Nathan Trapuzzano
  4 siblings, 0 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 10:44 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: monnier@IRO.UMontreal.CA,
>     emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 18:23:17 +0900
> 
> Eli Zaretskii writes:
> 
>  > This thread is about different issues.
> 
> *sigh*  No, it's about unibyte being a premature pessimization.

*Sigh*, indeed.

>  > >  > Likewise examples from XEmacs, since the differences in this area
>  > >  > between Emacs and XEmacs are substantial, and that precludes useful
>  > >  > comparison.
>  > > 
>  > > "It works fine" isn't useful information?
>  > 
>  > No, because it describes a very different implementation.
> 
> Not at all.  The implementation of multibyte buffers is very similar.

Says you.  But I cannot talk intelligently about that, because I don't
know the details.  And it sounds like you cannot talk about the issue
at hand, because you don't know the details of Emacs handling of raw
bytes.  This discussion is about Emacs's unibyte buffers and strings,
so it isn't going to yield any useful insights by you talking about
XEmacs implementation without knowing what is Emacs's one, and me the
other way around.  That is why I asked not to bring the XEmacs
implementation into this discussion.

> What's different is that Emacs complifusticates matters by also having
> a separate implementation of unibyte buffers, and then basically
> making a union out of the two structures called "buffer".  XEmacs
> simply implements binary as a particular coding system in and out of
> multibyte buffers.

In Emacs, a coding system is only consulted when a buffer is read or
written.  If you also consult it when inserting text into it, or when
deciding whether 'downcase' should or shouldn't change the character
from the buffer, then you still have unibyte buffers in disguise, you
just call them "buffers whose coding system is 'binary'".

>  > Then I guess you will have to suggest how to implement this without
>  > unibyte buffers.
> 
> No, I don't.  I already told you how to do it: nuke unibyte buffers
> and use iso-8859-1-unix as the binary codec.

"Codec" is XEmacs terminology, I don't understand what that means in
practice, when applied to Emacs.  If it means the same as coding
system, then how can iso-8859-1-unix byte-stream be decoded into, say,
Cyrillic characters (assuming the byte-stream was actually UTF-8
encoded Cyrillic text)?

> Then you're done, except for those applications that actually make
> the mistake of using unibyte text explicitly.

What does "explicitly" mean in this context?  Can you show an example
of "explicit" vs "implicit" use of unibyte text?

>  > >  > In such unibyte buffers, we need a way to represent raw bytes, which
>  > >  > are parts of as yet un-decoded byte sequences that represent encoded
>  > >  > characters.
>  > > 
>  > > Again, I disagree.  Unibyte is a design mistake, and unnecessary.
>  > 
>  > Then what do you call a buffer whose "text" is encoded?
> 
> "Binary."

That's just a different name.  If "binary" buffers are treated
differently from any other kind, when processing characters from them,
then they are just unibyte buffers in disguise.

>  > > XEmacs proves it -- we use (essentially) the same code in many
>  > > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does.
>  > 
>  > I asked you not to bring XEmacs into the discussion, because I cannot
>  > talk intelligently about its implementation.  If you insist on doing
>  > that, this discussion is futile from my POV.
> 
> The whole point here is that exactly what the XEmacs implementation is
> *irrelevant*.  The point that we implement the same API as GNU Emacs
> without unibyte buffers or the annoyances and incoherence that comes
> with them.

Without knowing the details of the implementation, it is impossible to
talk about merits and demerits of each design and implementation.
Therefore, bringing into this discussion XEmacs implementation without
describing it in all detail does not help.  Excuse me, but I don't
believe you when you say you have no problems at all in this area,
just because you say that.  If you want that to count, you will have
to delve into the gory details, and then show why and how the problems
are avoided.

>  > > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as
>  > > no-ops forever
>  > 
>  > I wasn't talking about those functions.  I was talking about the need
>  > to have unibyte buffers and strings.
> 
> There is no "need for unibyte."  You're simply afraid to throw it away.

I'm not afraid of anything of the kind.  This discussion was started
in order to try figuring out how to get rid of unibyte.  If you want
to help, offer specific technical solutions to specific issues we have
in Emacs.  Copying the XEmacs implementation, even if we were sure it
resolves the problem (and I'm not at all sure), is impractical.

>  > How is it different?  What would be the encoding of a buffer that
>  > contains raw bytes?
> 
> Depends.  If it's uninterpreted bytes, "binary."  If those are
> undecodable bytes, they'll be the representation of raw bytes that
> occurred in an otherwise sane encoded stream, and the buffer's
> encoding will be the nominal encoding of that stream.  If you want to
> ensure sanity of output, then you will use an output encoding that
> errors on rawbytes, and a program that cleans up those rawbytes in a
> way appropriate for the application.  If you expect the next program
> in the pipeline to handle them, then you use a variant encoding that
> just encodes them back to the original undecodable rawbytes.

That's exactly what Emacs does, so I think you rather agree to what I
originally described as requirements and you said you disagreed.

>  > But that's ridiculous: a raw byte is just a single byte, so
>  > string-bytes should return a meaningful value for a string of such
>  > bytes.
> 
> `string-bytes' should not exist.  As I wrote earlier:
> 
>  > > You don't need `string-bytes' unless you've exposed internal
>  > > representation to Lisp, then you desperately need it to write correct
>  > > code (which some users won't be able to do anyway without help, cf. 
>  > > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk).  So
>  > > *don't expose internal representation* (and the hammer marks on users'
>  > > foreheads will disappear in due time, and the headaches even faster!)
>  > 
>  > How else would you know how many bytes will a string take on disk?
> 
> How does `string-bytes' help?

It returns that information.

> You don't know what encoding will be used to write them

Yes, I do know: the buffer's coding system tells me.  And if text is
already encoded, then I know no additional encoding will be applied,
and whatever string-bytes tells me is it.

> If you use iso-8859-1-unix as the coding system, then "bytes on the
> wire" == "characters in the string".  No problema, señor.

Not if you want to recode the string in, say, UTF-8.  When you shuffle
text from one buffer to another, Emacs does not track which encoding
that text came from, so the iso-8859-1-unix information is lost.

>  > >  > So here you have already at least 2 valid reasons
>  > > 
>  > > No, *you* have them.  XEmacs works perfectly well without them, using
>  > > code written for Emacs.
>  > 
>  > XEmacs also works "perfectly well" without bidi and other stuff.  That
>  > doesn't help at all in this discussion.
> 
> You're right: because XEmacs doesn't handle bidi, it's irrelevant to
> this discussion.  Why did *you* bring it up?

To show how your way of arguing doesn't help.

> What is relevant is how to represent byte streams in Emacs.  The
> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
> characters.  It is *extremely* convenient if the first 128 of those
> bytes correspond to the ASCII coded character set, because so many
> wire protocols use ASCII "words" syntactically.  The other 128 don't
> matter much, so why not just use the extremely convenient Latin-1 set
> for them?

Because there are situations when the effect of this is not what Lisp
programs and users expect.  Case folding and case-insensitive search
is one of them, although not the only one.

>  > >  > If we want to get rid of unibyte, Someone(TM) should present a
>  > >  > complete practical solution to those two problems (and a few
>  > >  > others), otherwise, this whole discussion leads nowhere.
>  > > 
>  > > Complete practical solution: "They are non-problems, forget about
>  > > them, and rewrite any code that implies you need to remember them."
>  > 
>  > That a slogan, not a solution.
> 
> No, it is a precise high-level design for a solution.

We need a low-level design, not high-level.

>  > > Fortunately for me, I am *intimately* familiar with XEmacs internals,
>  > > and therefore RMS won't let me write this code for Emacs. :-)
>  > 
>  > Then perhaps you shouldn't be part of this discussion.
> 
> Since I've been invited to leave, I will.  My point is sufficiently
> well-made for open minds to deal with the details.

No, it isn't made at all.  I tried to explain above why I think so.

>  > > Which is precisely why we're having this thread.  If there were *no*
>  > > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter.
>  > 
>  > And if I had $5M on by bank account, I'd probably be elsewhere
>  > enjoying myself.  IOW, how are "if there were no..." arguments useful?
> 
> Because they point out that this thread wouldn't have happened with a
> different design.

But we _are_ with this design, and have been using it for the last 15
years.  Good luck believing that someone will come and replace the
existing design with something radically different.  There wasn't a
comparable revolution in Emacs since 2001, so I largely doubt that
expecting another one any time soon is wise.  We don't even have
people aboard capable of making such changes.

The only practical way of advancing in this area is by low-level
changes that don't throw away the high-level design.  That is why
precisely describing the details of every proposal is so important:
without them, any proposal becomes impractical and thus not
interesting.

>  > This is not a discussion about whose model is better, Emacs or XEmacs.
>  > This is a discussion of whether and how can we remove unibyte buffers,
>  > strings, and characters from Emacs.  You must start by understanding
>  > how are they used in Emacs 24, and then suggest practical ways to
>  > change that.
> 
> Well, I would have said "tell me about it"

And I would have replied "sorry, I have no time for that".  The
sources are there to be studied, and you are welcome to ask questions
about stuff you don't understand just by looking at the sources.

There cannot be any useful discussion of these matters without
thorough understanding of how Emacs stores characters and raw bytes in
its buffers, and where and how the unibyte nuisance comes into play.

> I will say nothing you've said so far even hints at issues with
> simply removing the whole concept of unibyte.

I started by describing some basic requirements that lead to unibyte.
You refuse to even acknowledge those requirements.  How can we
continue a useful discussion when we don't even agree about the
basics?  To convince me, you need first to take my view of the issue,
something that you refuse to do.  I cannot begin to explain "the
issues" to you if you don't even agree with my starting point.

>  > In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers
>  > and characters.  If you use it, you get what it does.
> 
> And I'm telling you those subtleties are a *problem* that solves
> nothing that an Emacs without a unibyte concept can't handle fine.

You keep saying that, but without the details (which you cannot or
won't provide), these are just slogans with little technical value.

>  > If the buffer is not marked specially, how will I know to avoid
>  > [inserting non-Latin-1 characters in a "binary" buffer]?
> 
> All experience with XEmacs says *you* (the human programmer) *won't*
> have any problem avoiding that.  As a programmer, if you're working
> with a binary protocol, you will be using binary buffers and strings,
> and byte-sized integers.  If you accidentally mix things up, you'll
> quickly get an encoding error on output (since the binary codec can't
> output non-Latin-1 Unicode characters.

On this level, it sounds like XEmacs does things exactly like Emacs
does, it just calls them differently.  If so, you have the same
problems; e.g., what will 'downcase-word' do in a "binary" buffer,
when it sees a "character" whose value is 192?

> It's just not a problem in practice, and that's not why unibyte was
> introduced in Emacs anyway.  Unibyte was introduced because some folks
> thought working with variable-width-encoded buffers was too
> inefficient so they wanted access to a flat buffer of bytes.  That's
> why buffer-as-{uni,multi}byte type punning was included.

Maybe so, but we are now 15 years after that, so history is only
marginally important.  What _is_ important is how to get rid of the
issues we have, without a complete redesign.

>  > And I still don't see how this is relevant.  You are describing a
>  > marginally valid use case, while I'm talking about use cases we meet
>  > every day, and which must be supported, e.g. when some Lisp wants to
>  > decode or encode text by hand.
> 
> You use `encode-coding-region' and `decode-coding-region', same as you
> do now.  Do you seriously think that XEmacs doesn't support those use
> cases?

"Support" doesn't mean "there're no issues".  Emacs supports them as
well, you know.  That fact in itself doesn't help at all in this
discussion, because we all know (I hope) that at this "slogan level"
things work very well for quite some time.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  9:52               ` Andreas Schwab
@ 2014-03-29 10:48                 ` Eli Zaretskii
  2014-03-29 11:00                   ` Andreas Schwab
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 10:48 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: stephen, monnier, emacs-devel

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,  monnier@IRO.UMontreal.CA,  emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 10:52:29 +0100
> 
> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
> 
> > No, I don't.  I already told you how to do it: nuke unibyte buffers
> > and use iso-8859-1-unix as the binary codec.
> 
> No, you use raw-text, representing each non-ascii character in the
> eight-bit charset (this is what string-to-multibyte does).  Using
> latin-1 would lose information.

Right.

So one direction would be use a normal multibyte buffer where raw
bytes are represented as string-to-multibyte does.  Emacs already
supports that.

The next question is what to do with unibyte strings, which are
currently widely used for pure-ASCII text.




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 10:48                 ` Eli Zaretskii
@ 2014-03-29 11:00                   ` Andreas Schwab
  2014-03-29 11:18                     ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: Andreas Schwab @ 2014-03-29 11:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> The next question is what to do with unibyte strings, which are
> currently widely used for pure-ASCII text.

You do the same, obviously (the representation wouldn't change for
pure-ascii, of course).

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  9:23             ` Stephen J. Turnbull
                                 ` (2 preceding siblings ...)
  2014-03-29 10:44               ` Eli Zaretskii
@ 2014-03-29 11:06               ` Andreas Schwab
  2014-03-29 11:12                 ` Eli Zaretskii
  2014-03-29 15:37                 ` Stephen J. Turnbull
  2014-03-29 17:01               ` Nathan Trapuzzano
  4 siblings, 2 replies; 103+ messages in thread
From: Andreas Schwab @ 2014-03-29 11:06 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> *sigh*  No, it's about unibyte being a premature pessimization.

Unibyte is a pure space optimisation.  Everything else should work as if
all bytes in the range 128-255 are decoded in the eight-bit charset.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 10:42               ` David Kastrup
@ 2014-03-29 11:07                 ` Eli Zaretskii
  2014-03-29 11:30                   ` David Kastrup
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 11:07 UTC (permalink / raw)
  To: David Kastrup; +Cc: stephen, monnier, emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,  monnier@IRO.UMontreal.CA,  emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 11:42:43 +0100
> 
> The current point of contention is about changing the way of
> codepoint-based character operations depending on the unibyte state of
> the current buffer.

The point for which this discussion was started was how to get rid of
this dependency, in those few places where we have them in Emacs.

> I am not necessarily of the same opinion as Stephen regarding whether or
> not abolishing unibyte buffers is a worthwhile goal.  But I am pretty
> sure that "unibyte" should not be bleeding over into character and
> string operations.

Indeed, and Emacs tries very hard to contain that distinction, so that
it doesn't leak out of the internals.  Mostly, it succeeds, but
sometimes it doesn't.

> A unibyte buffer or unibyte string might error out when trying to insert
> characters out of the range 0..255.

We currently don't do that.  Try (insert "xyz") in a unibyte buffer,
where "xyz" is some non-ASCII string, and watch the fun.

> If we want different semantics for case-fold-search in binary buffers,
> then the solution is setting a buffer-local setting of case-fold-search
> when opening a buffer intended to be manipulated in a binary way.
> 
> But the unibyte setting of the buffer should not affect normal character
> and string operation semantics.  It is a buffer implementation detail
> that should not really have a visible effect apart from making some
> buffer operations impossible.

But if case-fold-search is set to nil in unibyte buffers, and (as we
know) buffer-local value of case-fold-search does affects functions
that compare text, either because they consult case-fold-search
directly or because the consult buffer-local case-table, then the
unibyte setting does affect the semantics, albeit indirectly.

> If something chooses a unibyte buffer representation for some reason, it
> is the responsibility of the same something to switch character
> operations and case-fold-search etc to something making sense in the
> context of its operation.  That may well be through some buffer-local
> setting of case-fold-search etc, but it is not tied to the internal
> representation of the buffer contents.

Not that I disagree with you, but why does it matter whether some code
makes a buffer unibyte or sets its case-fold-search, to achieve that
goal?  In both cases, that something tells Emacs to ignore case
conversion, it just uses 2 different ways of saying that.  If we are
not going to abolish unibyte buffers, how is the difference important?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:06               ` Andreas Schwab
@ 2014-03-29 11:12                 ` Eli Zaretskii
  2014-03-29 16:11                   ` Stephen J. Turnbull
  2014-03-29 15:37                 ` Stephen J. Turnbull
  1 sibling, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 11:12 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: stephen, monnier, emacs-devel

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,  monnier@IRO.UMontreal.CA,  emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 12:06:31 +0100
> 
> Unibyte is a pure space optimisation.

I think it is (or at least was) also a speed optimization.  Reading or
writing a huge buffer full of eight-bit characters might be
significantly slower if they are in their multibyte representation.
Perhaps we should measure that.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-28 18:45           ` Daniel Colascione
  2014-03-28 19:35             ` Glenn Morris
@ 2014-03-29 11:17             ` Stephen J. Turnbull
  2014-03-29 11:22               ` Eli Zaretskii
  1 sibling, 1 reply; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-29 11:17 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: Eli Zaretskii, monnier, emacs-devel

Daniel Colascione writes:
 > On 03/28/2014 03:28 AM, Stephen J. Turnbull wrote:
 > > Fortunately for me, I am *intimately* familiar with XEmacs internals,
 > > and therefore RMS won't let me write this code for Emacs. :-)
 > 
 > What now? People who have contributed to XEmacs can't contribute to Emacs?

Not a problem, when put that way.

However, I'm familiar with a specific implementation of the ideas that
I describe.  That implementation is not FSF-assigned, and therefore
anything I write is tainted with the fear of copyright infringement if
I claim it's mine but it looks like Ben's or Martin's.  It would be
possible, but somebody would have to spend a lot of time studying
XEmacs and confirming nothing I wrote was an echo of code I'd studied.
Then they'd be tainted by that knowledge ....

What I can do freely is discuss design in general terms, and that's
what I've done.

This is all in the guidelines for reimplementers of non-GNU software.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:00                   ` Andreas Schwab
@ 2014-03-29 11:18                     ` Eli Zaretskii
  2014-03-29 11:30                       ` Andreas Schwab
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 11:18 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: stephen, monnier, emacs-devel

> From: Andreas Schwab <schwab@linux-m68k.org>
> Date: Sat, 29 Mar 2014 12:00:32 +0100
> Cc: stephen@xemacs.org, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > The next question is what to do with unibyte strings, which are
> > currently widely used for pure-ASCII text.
> 
> You do the same, obviously (the representation wouldn't change for
> pure-ascii, of course).

OK, so we get rid of unibyte strings as well.

Next question: what happens to implementation of encoding?  It
currently produces raw bytes.  Should it produce eight-bit characters
instead?  If not, who or what will convert raw bytes into eight-bit
characters, when they are inserted into a buffer or string, and who or
what will convert them back when they are written to a file or sent to
a process?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:17             ` Stephen J. Turnbull
@ 2014-03-29 11:22               ` Eli Zaretskii
  2014-03-29 16:03                 ` Stephen J. Turnbull
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 11:22 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dancol, monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,
>     monnier@IRO.UMontreal.CA,
>     emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 20:17:59 +0900
> 
> What I can do freely is discuss design in general terms

I'm quite sure you can also describe the fine details of the
implementation, as long as you don't describe that by posting the
actual code.

AFAIU, copyright protects only the form, not the ideas.  Ideas can be
described and discussed at any level of detail, because implementation
of those same ideas by another person will never, except by improbable
accident, be so close to the original as to be suspected of copying.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:07                 ` Eli Zaretskii
@ 2014-03-29 11:30                   ` David Kastrup
  2014-03-29 12:58                     ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-29 11:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>
>> If we want different semantics for case-fold-search in binary buffers,
>> then the solution is setting a buffer-local setting of case-fold-search
>> when opening a buffer intended to be manipulated in a binary way.
>> 
>> But the unibyte setting of the buffer should not affect normal character
>> and string operation semantics.  It is a buffer implementation detail
>> that should not really have a visible effect apart from making some
>> buffer operations impossible.
>
> But if case-fold-search is set to nil in unibyte buffers, and (as we
> know) buffer-local value of case-fold-search does affects functions
> that compare text, either because they consult case-fold-search
> directly or because the consult buffer-local case-table, then the
> unibyte setting does affect the semantics, albeit indirectly.

No, it doesn't.  Correlation is not causation.  Just because some
operations will create a unibyte buffer as well as set a
case-fold-search variable does not mean that the unibyte setting of the
buffer is the cause of the case-fold-search setting in any meaningful
way.

>> If something chooses a unibyte buffer representation for some reason,
>> it is the responsibility of the same something to switch character
>> operations and case-fold-search etc to something making sense in the
>> context of its operation.  That may well be through some buffer-local
>> setting of case-fold-search etc, but it is not tied to the internal
>> representation of the buffer contents.
>
> Not that I disagree with you, but why does it matter whether some code
> makes a buffer unibyte or sets its case-fold-search, to achieve that
> goal?  In both cases, that something tells Emacs to ignore case
> conversion, it just uses 2 different ways of saying that.  If we are
> not going to abolish unibyte buffers, how is the difference important?

Because it makes things predictable.  I can take a look at the setting
of case-fold-search in order to figure out what will happen regarding
the case folding of searches.  If I want them to occur, I can set the
variable, and if I don't want them to occur, I can clear that variable.

I can perfectly well do that with a let-binding, and it will work
throughout the let-binding without having some buffer properties
interfere.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:18                     ` Eli Zaretskii
@ 2014-03-29 11:30                       ` Andreas Schwab
       [not found]                         ` <83ha6hduzz.fsf@gnu.org>
  0 siblings, 1 reply; 103+ messages in thread
From: Andreas Schwab @ 2014-03-29 11:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Next question: what happens to implementation of encoding?  It
> currently produces raw bytes.  Should it produce eight-bit characters
> instead?  If not, who or what will convert raw bytes into eight-bit
> characters, when they are inserted into a buffer or string, and who or
> what will convert them back when they are written to a file or sent to
> a process?

Writing out a character in the eight-bit charset will produce an
eight-bit character, and vice-versa.

The process is the same, just put on a lower level.  The only visible
difference will be the value of aref: it will produce values in the
range of the eight-bit charset instead of 128-255.  The challenge will
be to find and fix all such assumptions.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:30                   ` David Kastrup
@ 2014-03-29 12:58                     ` Eli Zaretskii
  2014-03-29 13:15                       ` David Kastrup
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 12:58 UTC (permalink / raw)
  To: David Kastrup; +Cc: stephen, monnier, emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: stephen@xemacs.org,  monnier@IRO.UMontreal.CA,  emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 12:30:21 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: David Kastrup <dak@gnu.org>
> >
> >> If we want different semantics for case-fold-search in binary buffers,
> >> then the solution is setting a buffer-local setting of case-fold-search
> >> when opening a buffer intended to be manipulated in a binary way.
> >> 
> >> But the unibyte setting of the buffer should not affect normal character
> >> and string operation semantics.  It is a buffer implementation detail
> >> that should not really have a visible effect apart from making some
> >> buffer operations impossible.
> >
> > But if case-fold-search is set to nil in unibyte buffers, and (as we
> > know) buffer-local value of case-fold-search does affects functions
> > that compare text, either because they consult case-fold-search
> > directly or because the consult buffer-local case-table, then the
> > unibyte setting does affect the semantics, albeit indirectly.
> 
> No, it doesn't.  Correlation is not causation.

But in this case, it is: they both stem from the same cause.

> > Not that I disagree with you, but why does it matter whether some code
> > makes a buffer unibyte or sets its case-fold-search, to achieve that
> > goal?  In both cases, that something tells Emacs to ignore case
> > conversion, it just uses 2 different ways of saying that.  If we are
> > not going to abolish unibyte buffers, how is the difference important?
> 
> Because it makes things predictable.  I can take a look at the setting
> of case-fold-search in order to figure out what will happen regarding
> the case folding of searches.  If I want them to occur, I can set the
> variable, and if I don't want them to occur, I can clear that variable.

The same is true about the unibyte flag.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 12:58                     ` Eli Zaretskii
@ 2014-03-29 13:15                       ` David Kastrup
  0 siblings, 0 replies; 103+ messages in thread
From: David Kastrup @ 2014-03-29 13:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Cc: stephen@xemacs.org,  monnier@IRO.UMontreal.CA,  emacs-devel@gnu.org
>> Date: Sat, 29 Mar 2014 12:30:21 +0100
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> >> From: David Kastrup <dak@gnu.org>
>> >
>> >> If we want different semantics for case-fold-search in binary buffers,
>> >> then the solution is setting a buffer-local setting of case-fold-search
>> >> when opening a buffer intended to be manipulated in a binary way.
>> >> 
>> >> But the unibyte setting of the buffer should not affect normal character
>> >> and string operation semantics.  It is a buffer implementation detail
>> >> that should not really have a visible effect apart from making some
>> >> buffer operations impossible.
>> >
>> > But if case-fold-search is set to nil in unibyte buffers, and (as we
>> > know) buffer-local value of case-fold-search does affects functions
>> > that compare text, either because they consult case-fold-search
>> > directly or because the consult buffer-local case-table, then the
>> > unibyte setting does affect the semantics, albeit indirectly.
>> 
>> No, it doesn't.  Correlation is not causation.
>
> But in this case, it is: they both stem from the same cause.

That's just word games, and pretty bad ones at that.  Not interested.

>> > Not that I disagree with you, but why does it matter whether some
>> > code makes a buffer unibyte or sets its case-fold-search, to
>> > achieve that goal?  In both cases, that something tells Emacs to
>> > ignore case conversion, it just uses 2 different ways of saying
>> > that.  If we are not going to abolish unibyte buffers, how is the
>> > difference important?
>> 
>> Because it makes things predictable.  I can take a look at the
>> setting of case-fold-search in order to figure out what will happen
>> regarding the case folding of searches.  If I want them to occur, I
>> can set the variable, and if I don't want them to occur, I can clear
>> that variable.
>
> The same is true about the unibyte flag.

So then we have two competing settings.  How does that make things
predictable?

I think that there is nothing missing for reasonable people to come to a
decision by now, so there is nothing to be gained from me participating
further in this absurd spectacle.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
       [not found]                         ` <83ha6hduzz.fsf@gnu.org>
@ 2014-03-29 14:30                           ` Andreas Schwab
  2014-03-29 14:47                             ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: Andreas Schwab @ 2014-03-29 14:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Not just aref, I think: we currently pass SSDATA(s) directly to libc
> I/O functions in some places.

Which part of "on a lower level" did you miss?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 14:30                           ` Andreas Schwab
@ 2014-03-29 14:47                             ` Eli Zaretskii
  0 siblings, 0 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 14:47 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: stephen, monnier, emacs-devel

> From: Andreas Schwab <schwab@linux-m68k.org>
> Cc: stephen@xemacs.org,  monnier@IRO.UMontreal.CA,  emacs-devel@gnu.org
> Date: Sat, 29 Mar 2014 15:30:54 +0100
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Not just aref, I think: we currently pass SSDATA(s) directly to libc
> > I/O functions in some places.
> 
> Which part of "on a lower level" did you miss?

I didn't.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:06               ` Andreas Schwab
  2014-03-29 11:12                 ` Eli Zaretskii
@ 2014-03-29 15:37                 ` Stephen J. Turnbull
  2014-03-29 15:55                   ` David Kastrup
  2014-03-29 15:58                   ` Andreas Schwab
  1 sibling, 2 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-29 15:37 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Eli Zaretskii, monnier, emacs-devel

Andreas Schwab writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
 > 
 > > *sigh*  No, it's about unibyte being a premature pessimization.
 > 
 > Unibyte is a pure space optimisation.

It may be a space optimization, but it's hardly pure.  Else this
discussion wouldn't be happening.  And `string-as-unibyte' exposes the
internal representation of strings to Lisp.

 > Everything else should work as if all bytes in the range 128-255
 > are decoded in the eight-bit charset.

There seem to be conflicting opinions about that, and I would
certainly disagree as there are scads of European charsets that
happily fit into bytes.  I see no reason why character operations
(such as case conversion) shouldn't work transparently on bytes in GR
interpreted as the corresponding Latin-1 (or any ISO Latin) charset --
with a little extra metadata in (internal unibyte) buffers and strings
to indicate the charset implied.  (This charset is independent of the
various coding systems associated with buffers; it only says how to
interpret a byte as a character in operations on characters in
buffers.)

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 15:37                 ` Stephen J. Turnbull
@ 2014-03-29 15:55                   ` David Kastrup
  2014-03-29 16:28                     ` Stephen J. Turnbull
  2014-03-30  0:24                     ` Richard Stallman
  2014-03-29 15:58                   ` Andreas Schwab
  1 sibling, 2 replies; 103+ messages in thread
From: David Kastrup @ 2014-03-29 15:55 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Andreas Schwab writes:
>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>  > 
>  > > *sigh*  No, it's about unibyte being a premature pessimization.
>  > 
>  > Unibyte is a pure space optimisation.
>
> It may be a space optimization, but it's hardly pure.  Else this
> discussion wouldn't be happening.  And `string-as-unibyte' exposes the
> internal representation of strings to Lisp.
>
>  > Everything else should work as if all bytes in the range 128-255
>  > are decoded in the eight-bit charset.
>
> There seem to be conflicting opinions about that, and I would
> certainly disagree as there are scads of European charsets that
> happily fit into bytes.

That's not what unibyte buffers are for.  They are for byte streams, not
characters.  You would not want to edit a unibyte buffer, for example,
by inserting text and stuff.

Now for byte stream manipulation, code points other than 0..255 are a
nuisance.  Certainly a larger nuisance than having to clear
case-fold-search if you really want to do a byte search.

> I see no reason why character operations (such as case conversion)
> shouldn't work transparently on bytes in GR interpreted as the
> corresponding Latin-1 (or any ISO Latin) charset -- with a little
> extra metadata in (internal unibyte) buffers and strings to indicate
> the charset implied.  (This charset is independent of the various
> coding systems associated with buffers; it only says how to interpret
> a byte as a character in operations on characters in buffers.)

We have that "extra metadata", it is the unibyte flag.  But I consider
it a mistake to use it for anything but "character codes in this buffer
happen to range from 0..255 rather than 0..1000000 or whatever".

And since Unicode 128..255 happens to be the latin-1 plane where the
latin-1 plane is defined as all, this will mean that the result will
behave like the latin-1 plane.

Exactly because Emacs has _one_ underlying character set which happens
to be Unicode.

Which does not mean that it would be a good idea to use unibyte
buffers/strings for actual text that happens to be Latin-1 only.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 15:37                 ` Stephen J. Turnbull
  2014-03-29 15:55                   ` David Kastrup
@ 2014-03-29 15:58                   ` Andreas Schwab
  2014-03-29 16:35                     ` Stephen J. Turnbull
  1 sibling, 1 reply; 103+ messages in thread
From: Andreas Schwab @ 2014-03-29 15:58 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> There seem to be conflicting opinions about that, and I would
> certainly disagree as there are scads of European charsets that
> happily fit into bytes.

Unibyte strings are about raw bytes, not characters.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:22               ` Eli Zaretskii
@ 2014-03-29 16:03                 ` Stephen J. Turnbull
  2014-03-31 15:22                   ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-29 16:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dancol, monnier, emacs-devel

Eli Zaretskii writes:

 > I'm quite sure you can also describe the fine details of the
 > implementation, as long as you don't describe that by posting the
 > actual code.

No, that's not necessarily the case.  At least in the U.S., the
criteria are expressiveness, originality, and fixed in a medium.
Email is such a medium.  Obviously, design can be original.  Design
decisions are rarely dictated by the one feasible way to do it, and if
not, design is an expressive act and subject to copyright.

I don't know if Richard is still so cautious, but the above reasoning
is why would-be contributors to GNU of work-alike software are advised
to use different algorithms and data structures from the original in
their implementations.

 > AFAIU, copyright protects only the form, not the ideas.  Ideas can
 > be described and discussed at any level of detail, because
 > implementation of those same ideas by another person will never,
 > except by improbable accident, be so close to the original as to be
 > suspected of copying.

Unfortunately, many cases that some observers believe involve
independent invention in fact were resolved in favor of the plaintiff
on the basis that the appearance was sufficiently similar, and the
defendent couldn't prove non-copying.[1]  Your "probability" argument
doesn't hold up.

Footnotes: 
[1]  Copyright infringement is a tort, not a crime, here.  Criminal
infringement puts the burden of proof squarely on the prosecutor.
Civil cases, however, are based on the "preponderance of evidence".

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 11:12                 ` Eli Zaretskii
@ 2014-03-29 16:11                   ` Stephen J. Turnbull
  0 siblings, 0 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-29 16:11 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Andreas Schwab, monnier, emacs-devel

Eli Zaretskii writes:

 > I think [unibyte] is (or at least was) also a speed optimization.

It is.  Random access to position N multibyte buffer is average O(N),
and O(log N) with a position cache as used in XEmacs and I believe in
GNU Emacs too (haven't looked at GNU Emacs's implementation of buffer
movement since about v22, though).  This slows down mbox-based MUAs
like VM and RMail quite a bit if people use 8-bit or binary
content-transfer-encodings in their messages.

 > Reading or writing a huge buffer full of eight-bit characters might
 > be significantly slower if they are in their multibyte
 > representation.  Perhaps we should measure that.

This isn't true (Ben did measurements, as have the Python folks).
Coding systems are way faster than I/O.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 15:55                   ` David Kastrup
@ 2014-03-29 16:28                     ` Stephen J. Turnbull
  2014-03-29 17:00                       ` David Kastrup
  2014-03-29 17:08                       ` Andreas Schwab
  2014-03-30  0:24                     ` Richard Stallman
  1 sibling, 2 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-29 16:28 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > That's not what unibyte buffers are for.  They are for byte
 > streams, not characters.  You would not want to edit a unibyte
 > buffer, for example, by inserting text and stuff.

I beg to differ.  I would like to edit RFC 822 headers for HTTP, SMTP,
and other such wire protocols.  This is precisely the use case that
convinced van Rossum to restore %-formatting for bytes in Python 3.5
(to be released in about 18 months).

 > We have that "extra metadata", it is the unibyte flag.

Yes, I know, but my point is that it should be purely for use of the
internal implementation, and probably restricted to the C level.

 > But I consider it a mistake to use it for anything but "character
 > codes in this buffer happen to range from 0..255 rather than
 > 0..1000000 or whatever".

I sympathize, though I think it's overkill for Emacs to have separate
bytes and text types visible at the Lisp level.  FWIW, that's a big
step toward the design approach taken by Python 3, which has both
bytes and text, but you can't mix them without an explicit encoding or
decoding step, and the internal encoding of text is not exposed to
Python functions at all.

 > And since Unicode 128..255 happens to be the latin-1 plane where the
 > latin-1 plane is defined as all, this will mean that the result will
 > behave like the latin-1 plane.

That's not necessarily true.  It just requires a slightly more complex
design, which would be appropriate for Emacsen (as compared to Python).

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 15:58                   ` Andreas Schwab
@ 2014-03-29 16:35                     ` Stephen J. Turnbull
  2014-03-29 17:06                       ` Andreas Schwab
  0 siblings, 1 reply; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-29 16:35 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Eli Zaretskii, monnier, emacs-devel

Andreas Schwab writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
 > 
 > > There seem to be conflicting opinions about that, and I would
 > > certainly disagree as there are scads of European charsets that
 > > happily fit into bytes.
 > 
 > Unibyte strings are about raw bytes, not characters.

Obviously false, since bytes 0-127 are evidently interpreted as ASCII
at need.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 16:28                     ` Stephen J. Turnbull
@ 2014-03-29 17:00                       ` David Kastrup
  2014-03-30  2:05                         ` Stephen J. Turnbull
  2014-03-29 17:08                       ` Andreas Schwab
  1 sibling, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-29 17:00 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:

[...]

>  > And since Unicode 128..255 happens to be the latin-1 plane where
>  > the latin-1 plane is defined as all, this will mean that the result
>  > will behave like the latin-1 plane.
>
> That's not necessarily true.

Sure.  It depends on whether you value your users' sanity.

> It just requires a slightly more complex design, which would be
> appropriate for Emacsen (as compared to Python).

If the "slightly more complexity" hits in unexpected places, it's going
to end up a liability.  Having more than one charset to work with if
characters themselves don't contain a charset specification is affecting
a load of stuff that can then conceivably work in more than one way.

Unicode meaningfully uses values 128..255, Bytes meaningfully use values
128..255.  When one wants to work without surprises in both cases,
converting strings to characters will use 128..255 in either case.

Differentiating is, of course, possible.  One reasonably cute choice
would be mapping bytes (as opposed to characters) 128..255 to integers
-128..-1.  But if you are talking about case-fold-search semantics,
you'll actually need to remap 0..127 as well (they are more relevant
than 128..255).  And then things get really ugly.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29  9:23             ` Stephen J. Turnbull
                                 ` (3 preceding siblings ...)
  2014-03-29 11:06               ` Andreas Schwab
@ 2014-03-29 17:01               ` Nathan Trapuzzano
  2014-03-29 17:08                 ` Nathan Trapuzzano
  2014-03-29 17:16                 ` David Kastrup
  4 siblings, 2 replies; 103+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 17:01 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> What is relevant is how to represent byte streams in Emacs.  The
> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
> characters.  It is *extremely* convenient if the first 128 of those
> bytes correspond to the ASCII coded character set, because so many
> wire protocols use ASCII "words" syntactically.  The other 128 don't
> matter much, so why not just use the extremely convenient Latin-1 set
> for them?

Sorry if someone brought this up already, but one reason raw bytes
shouldn't be represented as Latin-1 characters is that the "raw
bytes"-ness would be lost when writing them back to disk if the stream
also contained characters outside the Latin-1 range.

For example, say we decode a stream of raw bytes as utf8, but that the
stream contains some non-utf8 sequences.  IIUC, Emacs will interpret
those as "raw bytes", so that when it goes to encode the string to write
it back, they will be written back verbatim.  Whereas, if they had been
interpreted as Latin-1 characters, they would get written back as the
UTF8 equivalents.  Hence you have the odd situation where you can decode
and then encode and end up with a different string.

Someone brought up Python in another post.  Python (version 3 at least)
does the same thing when, e.g., interpreting filenames.  If you pass a
string (_not_ bytes) to os.listdir, but the contents of the directory
can't all be decoded as utf-8, it will return strings (_not_ bytes)
where the non-utf8 sequences are Python-specific "characters" (in the
Unicode private use areas I believe) representing "raw bytes",
i.e. entities to be written back to the disk as the same raw sequences
that were read therefrom.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 16:35                     ` Stephen J. Turnbull
@ 2014-03-29 17:06                       ` Andreas Schwab
  0 siblings, 0 replies; 103+ messages in thread
From: Andreas Schwab @ 2014-03-29 17:06 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Andreas Schwab writes:
>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>  > 
>  > > There seem to be conflicting opinions about that, and I would
>  > > certainly disagree as there are scads of European charsets that
>  > > happily fit into bytes.
>  > 
>  > Unibyte strings are about raw bytes, not characters.
>
> Obviously false, since bytes 0-127 are evidently interpreted as ASCII
> at need.

That does not contradict my statement.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 16:28                     ` Stephen J. Turnbull
  2014-03-29 17:00                       ` David Kastrup
@ 2014-03-29 17:08                       ` Andreas Schwab
  1 sibling, 0 replies; 103+ messages in thread
From: Andreas Schwab @ 2014-03-29 17:08 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: David Kastrup, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> I beg to differ.  I would like to edit RFC 822 headers for HTTP, SMTP,
> and other such wire protocols.

Nothing stops you from editing eight-bit characters.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 17:01               ` Nathan Trapuzzano
@ 2014-03-29 17:08                 ` Nathan Trapuzzano
  2014-03-29 17:18                   ` David Kastrup
  2014-03-29 17:16                 ` David Kastrup
  1 sibling, 1 reply; 103+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 17:08 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel

Nathan Trapuzzano <nbtrap@nbtrap.com> writes:

> For example, say we decode a stream of raw bytes as utf8, but that the
> stream contains some non-utf8 sequences.

Of course, most programming languages would simply refuse to decode by,
e.g., throwing an exception.  But that's not really appropriate for an
editor.  On one hand, you need some way to distinguish between
characters and bytes, even if the distinction's not made by the type
system; on the other hand, an _editor_ of all things should be able to
deal with both kinds at the same time without the distinction being
lost, and Emacs does a tremendous job at this IMO.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 17:01               ` Nathan Trapuzzano
  2014-03-29 17:08                 ` Nathan Trapuzzano
@ 2014-03-29 17:16                 ` David Kastrup
  1 sibling, 0 replies; 103+ messages in thread
From: David Kastrup @ 2014-03-29 17:16 UTC (permalink / raw)
  To: emacs-devel

Nathan Trapuzzano <nbtrap@nbtrap.com> writes:

> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>> What is relevant is how to represent byte streams in Emacs.  The
>> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
>> characters.  It is *extremely* convenient if the first 128 of those
>> bytes correspond to the ASCII coded character set, because so many
>> wire protocols use ASCII "words" syntactically.  The other 128 don't
>> matter much, so why not just use the extremely convenient Latin-1 set
>> for them?
>
> Sorry if someone brought this up already, but one reason raw bytes
> shouldn't be represented as Latin-1 characters is that the "raw
> bytes"-ness would be lost when writing them back to disk if the stream
> also contained characters outside the Latin-1 range.

No.

> For example, say we decode a stream of raw bytes as utf8, but that the
> stream contains some non-utf8 sequences.  IIUC, Emacs will interpret
> those as "raw bytes", so that when it goes to encode the string to write
> it back, they will be written back verbatim.

"Raw bytes" here are represented as particular characters outside of the
Unicode range.  They are representable in multibyte buffers.  They never
were representable in unibyte buffers.  While it is conceivable to map
characters 128..255 in unibyte strings/buffers to the respective
character codes outside of the Unicode range, that would render
programmatic manipulation of bytes strenuous.

> Whereas, if they had been interpreted as Latin-1 characters, they
> would get written back as the UTF8 equivalents.  Hence you have the
> odd situation where you can decode and then encode and end up with a
> different string.

No, you can't unless you decode into a unibyte buffer, and then all bets
are off regarding reencoding.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 17:08                 ` Nathan Trapuzzano
@ 2014-03-29 17:18                   ` David Kastrup
  2014-03-29 17:33                     ` Nathan Trapuzzano
  0 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-29 17:18 UTC (permalink / raw)
  To: emacs-devel

Nathan Trapuzzano <nbtrap@nbtrap.com> writes:

> Nathan Trapuzzano <nbtrap@nbtrap.com> writes:
>
>> For example, say we decode a stream of raw bytes as utf8, but that the
>> stream contains some non-utf8 sequences.
>
> Of course, most programming languages would simply refuse to decode by,
> e.g., throwing an exception.  But that's not really appropriate for an
> editor.  On one hand, you need some way to distinguish between
> characters and bytes, even if the distinction's not made by the type
> system; on the other hand, an _editor_ of all things should be able to
> deal with both kinds at the same time without the distinction being
> lost, and Emacs does a tremendous job at this IMO.

_De_coding into a _unibyte_ buffer is a lossy operation by definition
since a unibyte buffer cannot hold the full set of values that
_de_coding delivers.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 17:18                   ` David Kastrup
@ 2014-03-29 17:33                     ` Nathan Trapuzzano
  2014-03-30  0:24                       ` Richard Stallman
  0 siblings, 1 reply; 103+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 17:33 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> _De_coding into a _unibyte_ buffer is a lossy operation by definition
> since a unibyte buffer cannot hold the full set of values that
> _de_coding delivers.

I know.  I was responding to what seemed to be a suggestion to just
conflate Latin-1 characters with raw bytes.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-29  6:40                   ` Eli Zaretskii
@ 2014-03-29 18:57                     ` Paul Eggert
  2014-03-29 19:46                       ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: Paul Eggert @ 2014-03-29 18:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii wrote:
> I suggested it only for char-equal.

That would make char-equal incompatible with upcase, downcase, etc. when 
in a unibyte buffer, which would be incoherent.

Unless you're also suggesting that upcase, downcase, etc. should be 
no-ops in unibyte buffers?  But then they would be incompatible with 
searching.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-28 18:52               ` Eli Zaretskii
  2014-03-28 19:21                 ` Paul Eggert
  2014-03-28 20:23                 ` Stefan Monnier
@ 2014-03-29 19:34                 ` Stefan Monnier
  2 siblings, 0 replies; 103+ messages in thread
From: Stefan Monnier @ 2014-03-29 19:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Paul Eggert, emacs-devel

> I suggested a solution: ignore case-fold-search in unibyte buffers.

As mentioned in your original message, char-equal should not pay
attention to the current buffer.


        Stefan



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings and buffers
  2014-03-29 18:57                     ` Paul Eggert
@ 2014-03-29 19:46                       ` Eli Zaretskii
  0 siblings, 0 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-29 19:46 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

> Date: Sat, 29 Mar 2014 11:57:53 -0700
> From: Paul Eggert <eggert@cs.ucla.edu>
> Cc: emacs-devel@gnu.org
> 
> Eli Zaretskii wrote:
> > I suggested it only for char-equal.
> 
> That would make char-equal incompatible with upcase, downcase, etc. when 
> in a unibyte buffer, which would be incoherent.
> 
> Unless you're also suggesting that upcase, downcase, etc. should be 
> no-ops in unibyte buffers?  But then they would be incompatible with 
> searching.

So now _you_ are looking for a perfect solution?



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 15:55                   ` David Kastrup
  2014-03-29 16:28                     ` Stephen J. Turnbull
@ 2014-03-30  0:24                     ` Richard Stallman
  2014-03-30  3:32                       ` Stefan Monnier
  1 sibling, 1 reply; 103+ messages in thread
From: Richard Stallman @ 2014-03-30  0:24 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Is there any need, nowadays, for a unibyte character to imply
a character set?

Originally unibyte buffers were meant as a backward compatibility
feature for old Emacs versions in which al buffers were unibyte.
Nowadays, I think we use unibyte buffers mainly (perhaps exclusively)
for buffers whose contents are largely not characters at all.
For those buffers, there is no reason to interpret the contents
as characters in any particular way.  We could consider them as
bytes, and nothing else.

This means converting those bytes to characters could be done by
explicit operations where you would specify what sort of conversion
you want.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 17:33                     ` Nathan Trapuzzano
@ 2014-03-30  0:24                       ` Richard Stallman
  2014-03-30  8:38                         ` Andreas Schwab
  0 siblings, 1 reply; 103+ messages in thread
From: Richard Stallman @ 2014-03-30  0:24 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Maybe we should implement decoding unibyte text to produce multibyte text.

* A function could decode text from a unibyte buffer and put it in
  another buffer which is multibyte.

* A function could decode a whole unibyte buffer
  into the same buffer, and mark it as multibyte.

For encoding, vice versa.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 17:00                       ` David Kastrup
@ 2014-03-30  2:05                         ` Stephen J. Turnbull
  2014-03-30  9:01                           ` David Kastrup
  0 siblings, 1 reply; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-30  2:05 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > It just requires a slightly more complex design, which would be
 > > appropriate for Emacsen (as compared to Python).
 > 
 > If the "slightly more complexity" hits in unexpected places, it's going
 > to end up a liability.  Having more than one charset to work with if
 > characters themselves don't contain a charset specification is affecting
 > a load of stuff that can then conceivably work in more than one
 > way.

I'm a little smarter than that.  The design I have in mind would be
transparent.  Maybe it wouldn't work; maybe it would be inefficient.
But one thing it wouldn't do is present a charset other than Unicode
to Lisp.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30  0:24                     ` Richard Stallman
@ 2014-03-30  3:32                       ` Stefan Monnier
  2014-03-30 15:13                         ` Richard Stallman
  0 siblings, 1 reply; 103+ messages in thread
From: Stefan Monnier @ 2014-03-30  3:32 UTC (permalink / raw)
  To: Richard Stallman; +Cc: David Kastrup, emacs-devel

> For those buffers, there is no reason to interpret the contents
> as characters in any particular way.  We could consider them as
> bytes, and nothing else.

That's pretty much what we do nowadays already.


        Stefan



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30  0:24                       ` Richard Stallman
@ 2014-03-30  8:38                         ` Andreas Schwab
  2014-03-30 15:12                           ` Richard Stallman
  0 siblings, 1 reply; 103+ messages in thread
From: Andreas Schwab @ 2014-03-30  8:38 UTC (permalink / raw)
  To: rms; +Cc: Nathan Trapuzzano, dak, emacs-devel

Richard Stallman <rms@gnu.org> writes:

> * A function could decode text from a unibyte buffer and put it in
>   another buffer which is multibyte.
>
> * A function could decode a whole unibyte buffer
>   into the same buffer, and mark it as multibyte.

That's what decode-coding-region provides (except for changing the
multibyte flag).

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30  2:05                         ` Stephen J. Turnbull
@ 2014-03-30  9:01                           ` David Kastrup
  2014-03-30 12:13                             ` Stephen J. Turnbull
  2014-03-30 14:25                             ` Andreas Schwab
  0 siblings, 2 replies; 103+ messages in thread
From: David Kastrup @ 2014-03-30  9:01 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>  > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>
>  > > It just requires a slightly more complex design, which would be
>  > > appropriate for Emacsen (as compared to Python).
>  > 
>  > If the "slightly more complexity" hits in unexpected places, it's going
>  > to end up a liability.  Having more than one charset to work with if
>  > characters themselves don't contain a charset specification is affecting
>  > a load of stuff that can then conceivably work in more than one
>  > way.
>
> I'm a little smarter than that.

Building on smartness is relying on a limited resource.  It's not always
easy to find wingmen (pun intended but unworkable).

> The design I have in mind would be transparent.

I don't think it gets much more transparent than "unibyte flag only
marks the valid Unicode-in-Emacs character range".  I'm for the range
0..255, Andreas for something like 0..127 U 4194176..4194303 which
I find cumbersome for little return.

> Maybe it wouldn't work; maybe it would be inefficient.  But one thing
> it wouldn't do is present a charset other than Unicode to Lisp.

Neither does the above.  Abolishing unibyte just means that
buffers/strings have only one possible character range.  That does not
really give any "transparency" per se from the Lisp level.  The
interesting level is the C level.  You need a byte stream representation
in C at some point anyway, and not being able to call this
representation either "string" or "buffer" may be neat in some manners
but will end up cumbersome in others.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30  9:01                           ` David Kastrup
@ 2014-03-30 12:13                             ` Stephen J. Turnbull
  2014-03-30 14:25                             ` Andreas Schwab
  1 sibling, 0 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-03-30 12:13 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > I don't think it gets much more transparent than "unibyte flag only
 > marks the valid Unicode-in-Emacs character range".  I'm for the
 > range 0..255,

It's easy to be more transparent in that case: no unibyte flag.
However, that delays detection of out-of-range characters to encoding
rather than the insert step.

 > Andreas for something like 0..127 U 4194176..4194303 which
 > I find cumbersome for little return.

Agreed.  If bytes are going to be non-characters, having a half-ASCII
type is just going to cause surprises when US English apps get
internationalized.

 > > Maybe it wouldn't work; maybe it would be inefficient.  But one
 > > thing it wouldn't do is present a charset other than Unicode to
 > > Lisp.
 > 
 > Neither does the above.  Abolishing unibyte just means that
 > buffers/strings have only one possible character range.

That's not really true.  Encoding and decoding will still constrain
ranges; as pointed out above, it delays detection on the one hand, on
the other avoids spurious errors when the user really does want to add
characters outside of the prespecified range for some reason.

 > That does not really give any "transparency" per se from the Lisp
 > level.

I disagree, based primarily on the experience of XEmacs that we can do
everything (with characters and bytes) that Emacs does[1], without
randomly injecting new bugs due to lack of unibyte that I can recall.
(Other bugs, yes, but bugs due to adapting code that used unibyte to
XEmacs where there is no unibyte, no.)

 > The interesting level is the C level.  You need a byte stream
 > representation in C at some point anyway, and not being able to
 > call this representation either "string" or "buffer" may be neat in
 > some manners but will end up cumbersome in others.

I don't see why you need that, actually.  Of course you need C level
streams for I/O, but I don't see why it needs to persist past decoding
into a buffer or string.

Footnotes: 
[1]  OK, we don't have a representation of "undecodable bytes".  But
that's not conceptually hard, just tedious enough that nobody's done
it yet.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30  9:01                           ` David Kastrup
  2014-03-30 12:13                             ` Stephen J. Turnbull
@ 2014-03-30 14:25                             ` Andreas Schwab
  2014-03-30 15:05                               ` David Kastrup
  1 sibling, 1 reply; 103+ messages in thread
From: Andreas Schwab @ 2014-03-30 14:25 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> I don't think it gets much more transparent than "unibyte flag only
> marks the valid Unicode-in-Emacs character range".  I'm for the range
> 0..255, Andreas for something like 0..127 U 4194176..4194303 which
> I find cumbersome for little return.

Before decoding there is no charset information yet, so using anything
other than the eight-bit charset would be wrong.  After decoding, the
eight-bit charset is used only for undecodable bytes.  That preserves
the distinction between encoded and decoded strings/buffers (except for
the uninteresting trivial ASCII decoding) in a world without unibyte
flag.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30 14:25                             ` Andreas Schwab
@ 2014-03-30 15:05                               ` David Kastrup
  2014-03-30 15:39                                 ` Andreas Schwab
  0 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-03-30 15:05 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: emacs-devel

Andreas Schwab <schwab@linux-m68k.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> I don't think it gets much more transparent than "unibyte flag only
>> marks the valid Unicode-in-Emacs character range".  I'm for the range
>> 0..255, Andreas for something like 0..127 U 4194176..4194303 which
>> I find cumbersome for little return.
>
> Before decoding there is no charset information yet, so using anything
> other than the eight-bit charset would be wrong.

When "right" does not buy you anything but trouble, why bother?

> After decoding, the eight-bit charset is used only for undecodable
> bytes.  That preserves the distinction between encoded and decoded
> strings/buffers (except for the uninteresting trivial ASCII decoding)
> in a world without unibyte flag.

The "uninteresting trivial ASCII" listens to case-fold-search just as
much as the latin-1 code page does.  So being "right" for half of the
coding range does not really buy anything.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30  8:38                         ` Andreas Schwab
@ 2014-03-30 15:12                           ` Richard Stallman
  0 siblings, 0 replies; 103+ messages in thread
From: Richard Stallman @ 2014-03-30 15:12 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: nbtrap, dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > * A function could decode text from a unibyte buffer and put it in
    >   another buffer which is multibyte.
    >
    > * A function could decode a whole unibyte buffer
    >   into the same buffer, and mark it as multibyte.

    That's what decode-coding-region provides (except for changing the
    multibyte flag).

That "except" is the crucial point.  Currently we need to access both
unibyte text and multibyte text with the same setting of the multibyte
flag.  These two functions might eliminate the need for that.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30  3:32                       ` Stefan Monnier
@ 2014-03-30 15:13                         ` Richard Stallman
  0 siblings, 0 replies; 103+ messages in thread
From: Richard Stallman @ 2014-03-30 15:13 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > For those buffers, there is no reason to interpret the contents
    > as characters in any particular way.  We could consider them as
    > bytes, and nothing else.

    That's pretty much what we do nowadays already.

If we make that 100% true, we could disconnect the multibyte flag
from operations (including case conversion) that pertain to text
rather than bytes.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-30 15:05                               ` David Kastrup
@ 2014-03-30 15:39                                 ` Andreas Schwab
  0 siblings, 0 replies; 103+ messages in thread
From: Andreas Schwab @ 2014-03-30 15:39 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> The "uninteresting trivial ASCII" listens to case-fold-search just as
> much as the latin-1 code page does.  So being "right" for half of the
> coding range does not really buy anything.

It doesn't matter, undecoded is just a brief intermediate state most of
the time.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-29 16:03                 ` Stephen J. Turnbull
@ 2014-03-31 15:22                   ` Eli Zaretskii
  2014-04-01  3:36                     ` Stephen J. Turnbull
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-03-31 15:22 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dancol, monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Sun, 30 Mar 2014 01:03:15 +0900
> Cc: dancol@dancol.org, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org
> 
> Eli Zaretskii writes:
> 
>  > AFAIU, copyright protects only the form, not the ideas.  Ideas can
>  > be described and discussed at any level of detail, because
>  > implementation of those same ideas by another person will never,
>  > except by improbable accident, be so close to the original as to be
>  > suspected of copying.
> 
> Unfortunately, many cases that some observers believe involve
> independent invention in fact were resolved in favor of the plaintiff
> on the basis that the appearance was sufficiently similar, and the
> defendent couldn't prove non-copying.  Your "probability" argument
> doesn't hold up.

Please show your references for that.  IANAL, but just by reading
related stuff on the Internet, I arrive to the opposite conclusion.
For example, here are citations from the last part of
http://en.wikipedia.org/wiki/Structure,_sequence_and_organization,
which seem to uphold my understanding and contradict yours:

  Competitors may create programs that provide essentially the same
  functionality as a protected program as long as they do not copy the
  code. The trend has been for courts to say that even if there are
  non-literal SSO similarities, there must be proof of copying. Some
  relevant court decisions allow for reverse-engineering to discover
  ideas that are not subject to copyright within a protected
  program. The ideas can be implemented in a competing program as long
  as the developers do not copy the original expression. With a clean
  room design approach one team of engineers derives a functional
  specification from the original code, and then a second team uses
  that specification to design and built the new code.
  [...]
  The judge [in the Oracle v Google case] asked for [both Google and
  Oracle] to comment on a ruling by the European Court of Justice in a
  similar case that found "Neither the functionality of a computer
  program nor the programming language and the format of data files
  used in a computer program in order to exploit certain of its
  functions constitute a form of expression. Accordingly, they do not
  enjoy copyright protection." On 31 May 2012 the judge ruled that "So
  long as the specific code used to implement a method is different,
  anyone is free under the Copyright Act to write his or her own code
  to carry out exactly the same function or specification of any
  methods used in the Java API."

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-03-31 15:22                   ` Eli Zaretskii
@ 2014-04-01  3:36                     ` Stephen J. Turnbull
  2014-04-01  7:42                       ` David Kastrup
  2014-04-01 15:16                       ` Eli Zaretskii
  0 siblings, 2 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-04-01  3:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dancol, monnier, emacs-devel

Eli Zaretskii writes:

 > Please show your references for that.  IANAL, but just by reading
 > related stuff on the Internet, I arrive to the opposite conclusion.

Hey, I'm perfectly happy to go on that kind of evidence; the projects
I mostly work on don't require assignment and I see no need for it.
But we're talking here about Emacs, which is extremely careful about
these things.




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-01  3:36                     ` Stephen J. Turnbull
@ 2014-04-01  7:42                       ` David Kastrup
  2014-04-01  9:38                         ` Stephen J. Turnbull
  2014-04-01 15:19                         ` Eli Zaretskii
  2014-04-01 15:16                       ` Eli Zaretskii
  1 sibling, 2 replies; 103+ messages in thread
From: David Kastrup @ 2014-04-01  7:42 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>
>  > Please show your references for that.  IANAL, but just by reading
>  > related stuff on the Internet, I arrive to the opposite conclusion.
>
> Hey, I'm perfectly happy to go on that kind of evidence; the projects
> I mostly work on don't require assignment and I see no need for it.
> But we're talking here about Emacs, which is extremely careful about
> these things.

Well, I remember a tense moment in XEmacs history where a major past
contributor stated that he would rescind permission to redistribute his
work in XEmacs when XEmacs was going to get relicensed under GPLv3
(I think it was GPLv3 but it may have been some other licensing change
originating at GNU Emacs).  XEmacs developers are on reasonably good
speaking terms to resolve such a conflict.  In particular if one can
point to the FSF as being the "real" guilty party and external to the
project.

Emacs does not have that excuse.

But that's tangential: you don't just have to secure the goodwill of
important contributors.  Given the current laws, you have to secure the
goodwill of the contributors' heirs 90 years or something after their
death, people who are not even born yet.  Good luck with that.

The single biggest deficiency that corporations have over single persons
is that they are immortal.

    Nam Sibyllam quidem Cumis ego ipse oculis meis vidi in ampulla
    pendere, et cum illi pueri dicerent: Σίβυλλα τί θέλεις; respondebat
    illa: ἀποθανεῖν θέλω.

Would it have been Walt Disney's will that many of the motion pictures
of his youth are rotting away and getting irretrievably lost because the
company bearing his name is fighting against legislation allowing them
to be copied (and the costs recuperated by distribution) before they
fall apart?

What would he or other people think if they were told that the future of
our cultural heritage and the laws governing it is determined between
the two major competing power houses of Mickey Mouse and Bugs Bunny
these days?

How sad is that?

At any rate, nobody knows what his heirs will do 90 years after his
death.  But corporations don't really die, and neither do contracts.
And that gives Emacs the best shot we have not to be killed by lawyers a
hundred years from now.  Which makes it free to grow into something
else, like culture should be able to and no longer can.

Well, this mail has definitely grown into something else.

Sue me.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-01  7:42                       ` David Kastrup
@ 2014-04-01  9:38                         ` Stephen J. Turnbull
  2014-04-01 15:19                         ` Eli Zaretskii
  1 sibling, 0 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-04-01  9:38 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > Sue me.

Not me.  I don't agree that it's worth worrying about, but I certainly
don't deny that you and other Emacs contributors have the right to be
concerned, and furthermore, the right to do something about it.

And a wise man once said something along the lines of "Extremism in
the defense of freedom is no vice."  I heartily agree with that, even
when I disagree with some of the extremists.<wink/>




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-01  3:36                     ` Stephen J. Turnbull
  2014-04-01  7:42                       ` David Kastrup
@ 2014-04-01 15:16                       ` Eli Zaretskii
  2014-04-02  4:20                         ` Stephen J. Turnbull
  1 sibling, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-04-01 15:16 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dancol, monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: dancol@dancol.org,
>     monnier@IRO.UMontreal.CA,
>     emacs-devel@gnu.org
> Date: Tue, 01 Apr 2014 12:36:45 +0900
> 
> Hey, I'm perfectly happy to go on that kind of evidence; the projects
> I mostly work on don't require assignment and I see no need for it.
> But we're talking here about Emacs, which is extremely careful about
> these things.

It would be madness IMO for Emacs to require legal paperwork from
everyone who at some point participated in some design discussion
here, which later got implemented, just because "design is expressive"
and "email is a medium" that fixes that expressiveness.  As a matter
of fact, this is not currently required, which I interpret as an
agreement with my understanding of the fine line that separates design
ideas from actual code.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-01  7:42                       ` David Kastrup
  2014-04-01  9:38                         ` Stephen J. Turnbull
@ 2014-04-01 15:19                         ` Eli Zaretskii
  1 sibling, 0 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-04-01 15:19 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Date: Tue, 01 Apr 2014 09:42:05 +0200
> 
> Well, this mail has definitely grown into something else.

Indeed.  To recall, the subject was whether communicating design and
implementation ideas that get implemented by someone else necessarily
makes all the participants of such discussions copyright holders of
the code that is written based on the discussions.  I very much hope
that's not the case, because otherwise we better shut down this list,
and fast.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-01 15:16                       ` Eli Zaretskii
@ 2014-04-02  4:20                         ` Stephen J. Turnbull
  2014-04-02 17:06                           ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-04-02  4:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dancol, monnier, emacs-devel

Eli Zaretskii writes:

 > It would be madness IMO for Emacs to require legal paperwork from
 > everyone who at some point participated in some design discussion
 > here, which later got implemented,

Of course that would be madness.  What you're ignoring is that we're
talking not just about participation in design discussion, but *also*
implementation by a person who is intimately familiar with and
participated another implementation of the same feature with the same
design that is not assigned, and is highly unlikely to ever be
assigned.

In that case if there were enough similarity that the FSF were taken
to court and the case not dismissed immediately, the "it's just an
accident" argument would not fly in court because it would be easy to
show that I know a lot about the XEmacs implementation, and I
personally would undoubtedly be at best greatly inconvenienced by
being called to testify, at worst liable for damages (remember, in
that case the FSF assignment makes me liable for FSF's court costs and
damages, and that agreement doesn't contain mitigating circumstances
like "in good faith" or "invited by Eli Z").

No, thank you.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-02  4:20                         ` Stephen J. Turnbull
@ 2014-04-02 17:06                           ` Eli Zaretskii
  2014-04-03 10:59                             ` David Kastrup
  2014-04-03 13:04                             ` Stephen J. Turnbull
  0 siblings, 2 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-04-02 17:06 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dancol, monnier, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: dancol@dancol.org,
>     monnier@IRO.UMontreal.CA,
>     emacs-devel@gnu.org
> Date: Wed, 02 Apr 2014 13:20:40 +0900
> 
> Eli Zaretskii writes:
> 
> In that case if there were enough similarity that the FSF were taken
> to court and the case not dismissed immediately, the "it's just an
> accident" argument would not fly in court because it would be easy
> to show that I know a lot about the XEmacs implementation, and I
> personally would undoubtedly be at best greatly inconvenienced by
> being called to testify, at worst liable for damages (remember, in
> that case the FSF assignment makes me liable for FSF's court costs
> and damages, and that agreement doesn't contain mitigating
> circumstances like "in good faith" or "invited by Eli Z").
> 
> No, thank you.

My goal is not to convince you to do something you don't want to.

The main issue here, at least for me, is not whether Mr. X wants to
describe an existing implementation -- we obviously cannot do anything
if he doesn't, no matter what are his reasons.  The main issue here
is, once Mr. X _did_ describe such an implementation, is it OK for
someone else, who is not familiar with the actual code, to
re-implement it from scratch, and then submit it to Emacs as their
own, under assigned copyright.  My conclusion from everything I know
and read is that YES, it is OK.

IOW, I'd like to avoid the situation where others here might become
intimidated by what you wrote in a broader sense, and will as result
refrain from participating in discussions that reveal details of other
implementations, or from assigning their code written based on those
discussions.  That would cause some real damage to Emacs.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-02 17:06                           ` Eli Zaretskii
@ 2014-04-03 10:59                             ` David Kastrup
  2014-04-03 16:07                               ` Eli Zaretskii
  2014-04-03 13:04                             ` Stephen J. Turnbull
  1 sibling, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-04-03 10:59 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> My goal is not to convince you to do something you don't want to.
>
> The main issue here, at least for me, is not whether Mr. X wants to
> describe an existing implementation -- we obviously cannot do anything
> if he doesn't, no matter what are his reasons.  The main issue here
> is, once Mr. X _did_ describe such an implementation, is it OK for
> someone else, who is not familiar with the actual code, to
> re-implement it from scratch, and then submit it to Emacs as their
> own, under assigned copyright.  My conclusion from everything I know
> and read is that YES, it is OK.
>
> IOW, I'd like to avoid the situation where others here might become
> intimidated by what you wrote in a broader sense, and will as result
> refrain from participating in discussions that reveal details of other
> implementations, or from assigning their code written based on those
> discussions.  That would cause some real damage to Emacs.

Nobody claimed that the broken copyright system does not lead to a whole
lot of real damage to a whole lot of software development.

<URL:https://en.wikipedia.org/wiki/Sequence,_structure_and_organization>
may be somewhat instructional about some current court practice in the
U.S.A.  Please note that Oracle/Google ruling is unfortunately somewhat
atypical and on appeal (appeal hearing was in December)
<URL:http://arstechnica.com/tech-policy/2013/12/googles-copyright-win-against-oracle-is-in-danger-on-appeal/>
and that the FSF would not have been in a position to pay the kind of
legal expenses incurred here.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-02 17:06                           ` Eli Zaretskii
  2014-04-03 10:59                             ` David Kastrup
@ 2014-04-03 13:04                             ` Stephen J. Turnbull
  1 sibling, 0 replies; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-04-03 13:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dancol, monnier, emacs-devel

Eli Zaretskii writes:

 > The main issue here, at least for me is, once Mr. X _did_ describe
 > such an implementation, is it OK for someone else, who is not
 > familiar with the actual code, to re-implement it from scratch, and
 > then submit it to Emacs as their own, under assigned copyright.  My
 > conclusion from everything I know and read is that YES, it is OK.

I'd risk it.  But it's not the classic "clean-room" reimplementation
where the behavior of the original in response to various inputs
(vs. "internal structure" etc) is used as a specification
(vs. "design") for the clone.

For Emacs, you'd have to ask an FSF lawyer.




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-03 10:59                             ` David Kastrup
@ 2014-04-03 16:07                               ` Eli Zaretskii
  2014-04-03 16:26                                 ` David Kastrup
  0 siblings, 1 reply; 103+ messages in thread
From: Eli Zaretskii @ 2014-04-03 16:07 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Date: Thu, 03 Apr 2014 12:59:20 +0200
> 
> > IOW, I'd like to avoid the situation where others here might become
> > intimidated by what you wrote in a broader sense, and will as result
> > refrain from participating in discussions that reveal details of other
> > implementations, or from assigning their code written based on those
> > discussions.  That would cause some real damage to Emacs.
> 
> Nobody claimed that the broken copyright system does not lead to a whole
> lot of real damage to a whole lot of software development.

On this general level, I agree.  However, I only talked about a very
specific situation.  In any case, the system being broken
notwithstanding, we shouldn't see problems where none exist (yet).

> <URL:https://en.wikipedia.org/wiki/Sequence,_structure_and_organization>
> may be somewhat instructional about some current court practice in the
> U.S.A.

That's the URL from which I quoted a few messages ago.

> Please note that Oracle/Google ruling is unfortunately somewhat
> atypical and on appeal (appeal hearing was in December)
> <URL:http://arstechnica.com/tech-policy/2013/12/googles-copyright-win-against-oracle-is-in-danger-on-appeal/>

Even if you take this article at face value (as opposed to someone
whose interests are unknown reiterating rumors), the conclusion is
that jury is still out in this issue.  Which is exactly what I wrote:
this issue is not decided yet, and precedents are contradictory.

> and that the FSF would not have been in a position to pay the kind of
> legal expenses incurred here.

If there is a precedent, you don't need to pay any expenses.

Anyway, this all is only relevant if someone of those who wrote the
code that was discussed and reimplemented actually sue the FSF.  Since
such code almost always comes from Free Software, I don't think
there's a danger of this.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-03 16:07                               ` Eli Zaretskii
@ 2014-04-03 16:26                                 ` David Kastrup
  2014-04-03 19:11                                   ` Eli Zaretskii
  0 siblings, 1 reply; 103+ messages in thread
From: David Kastrup @ 2014-04-03 16:26 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> <URL:http://arstechnica.com/tech-policy/2013/12/googles-copyright-win-against-oracle-is-in-danger-on-appeal/>
>
> Even if you take this article at face value (as opposed to someone
> whose interests are unknown reiterating rumors), the conclusion is
> that jury is still out in this issue.  Which is exactly what I wrote:
> this issue is not decided yet, and precedents are contradictory.
>
>> and that the FSF would not have been in a position to pay the kind of
>> legal expenses incurred here.
>
> If there is a precedent, you don't need to pay any expenses.

Nonsense.  For most court cases there are precedents that are getting
referenced.  In the U.S., both sides have to pay their own legal
expenses.  Judges _may_ award legal costs to a defendant if the case was
brought forward clearly frivolously and/or vexatiously.  That is very
rarely done.  A successful defense will be expensive even in the rare
case that the case is decided in summary judgment.

> Anyway, this all is only relevant if someone of those who wrote the
> code that was discussed and reimplemented actually sue the FSF.  Since
> such code almost always comes from Free Software, I don't think
> there's a danger of this.

If an employer of a non-assigned contributor is sued by the FSF over
infringement of some FSF-copyrighted software, the whole case can get
thrown out of court if the FSF is shown to have "dirty hands", namely to
have incorporated code themselves that is legally under copyright by the
employer.

In the case of XEmacs, we are not necessarily talking about core
developers highly sympathetic to the FSF.  There is no playful element
to the history of the Emacs/XEmacs schism like with the Emacs/vi "editor
wars".

The details of the complex Emacs/XEmacs relation aside, nobody should be
blamed for choosing to err on the safe side.  In particular since the
copyright maximalists are pretty successful in eroding the safe side and
moving the borderlines.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-03 16:26                                 ` David Kastrup
@ 2014-04-03 19:11                                   ` Eli Zaretskii
  2014-04-03 20:03                                     ` David Kastrup
  2014-04-04 11:40                                     ` Richard Stallman
  0 siblings, 2 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-04-03 19:11 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: emacs-devel@gnu.org
> Date: Thu, 03 Apr 2014 18:26:38 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> <URL:http://arstechnica.com/tech-policy/2013/12/googles-copyright-win-against-oracle-is-in-danger-on-appeal/>
> >
> > Even if you take this article at face value (as opposed to someone
> > whose interests are unknown reiterating rumors), the conclusion is
> > that jury is still out in this issue.  Which is exactly what I wrote:
> > this issue is not decided yet, and precedents are contradictory.
> >
> >> and that the FSF would not have been in a position to pay the kind of
> >> legal expenses incurred here.
> >
> > If there is a precedent, you don't need to pay any expenses.
> 
> Nonsense.

You misunderstood.  I meant there would be no need to pay for creating
a precedent where one already exists.

> If an employer of a non-assigned contributor is sued by the FSF over
> infringement of some FSF-copyrighted software, the whole case can get
> thrown out of court if the FSF is shown to have "dirty hands", namely to
> have incorporated code themselves that is legally under copyright by the
> employer.

If you are afraid to get into a road accident, stay inside.

> In the case of XEmacs, we are not necessarily talking about core
> developers highly sympathetic to the FSF.  There is no playful element
> to the history of the Emacs/XEmacs schism like with the Emacs/vi "editor
> wars".

The amount of code borrowed by XEmacs from Emacs is orders of
magnitude larger than the other way around.  So this is a red herring.

> nobody should be blamed for choosing to err on the safe side.

I never blamed anyone.  People should know the true state of affairs,
and then decide for themselves.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-03 19:11                                   ` Eli Zaretskii
@ 2014-04-03 20:03                                     ` David Kastrup
  2014-04-04  0:48                                       ` Stephen J. Turnbull
  2014-04-04  7:58                                       ` Eli Zaretskii
  2014-04-04 11:40                                     ` Richard Stallman
  1 sibling, 2 replies; 103+ messages in thread
From: David Kastrup @ 2014-04-03 20:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> The amount of code borrowed by XEmacs from Emacs is orders of
> magnitude larger than the other way around.  So this is a red herring.

Magnitude does not really matter with "dirty hands".

At any rate, you _are_ aware that Oracle sued Google for billions of
dollars because of what amounted to 11 lines of code?

They did not prevail at the first trial, but Google did not get attorney
costs back, either, and the whole thing went into appeal with murky
outlook.

>> nobody should be blamed for choosing to err on the safe side.
>
> I never blamed anyone.  People should know the true state of affairs,
> and then decide for themselves.

The true state of affairs is that the U.S. legal and political system
does not leave much leeway for paranoia.  It's as bad as imagination
gets.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-03 20:03                                     ` David Kastrup
@ 2014-04-04  0:48                                       ` Stephen J. Turnbull
  2014-04-04  8:08                                         ` Eli Zaretskii
  2014-04-04  7:58                                       ` Eli Zaretskii
  1 sibling, 1 reply; 103+ messages in thread
From: Stephen J. Turnbull @ 2014-04-04  0:48 UTC (permalink / raw)
  To: David Kastrup; +Cc: Eli Zaretskii, emacs-devel

David Kastrup writes:
 > Eli Zaretskii <eliz@gnu.org> writes:

 > > I never blamed anyone.  People should know the true state of
 > > affairs, and then decide for themselves.

Not in Emacs.  It's not up to the individual contributor, it's a
matter for project policy, ie, RMS as advised by the FSF legal dept.

 > The true state of affairs is that the U.S. legal and political system
 > does not leave much leeway for paranoia.  It's as bad as imagination
 > gets.

Oh, come on, David.  A German writes this in a thread that a resident
of Japan participates in?  Have you no sense of history?

Indeed, the reach of copyright and patent in the U.S. system has gone
way beyond the bounds that even a Milton Friedman can sanction.  But
it's not hard to imagine worse, even in just that limited area of law.

Bottom line: Eli's theoretical assessment of the "typical" risks
involved seems pretty plausible to me.  But in the worst case, things
can get pretty bad, and it's easy to justify "legal paranoia" on the
part of the FSF in managing software freedom of selected critical
projects, including Emacs.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-03 20:03                                     ` David Kastrup
  2014-04-04  0:48                                       ` Stephen J. Turnbull
@ 2014-04-04  7:58                                       ` Eli Zaretskii
  1 sibling, 0 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-04-04  7:58 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: emacs-devel@gnu.org
> Date: Thu, 03 Apr 2014 22:03:32 +0200
> 
> >> nobody should be blamed for choosing to err on the safe side.
> >
> > I never blamed anyone.  People should know the true state of affairs,
> > and then decide for themselves.
> 
> The true state of affairs is that the U.S. legal and political system
> does not leave much leeway for paranoia.  It's as bad as imagination
> gets.

Even if they really are after you, it doesn't mean you need to become
paranoid.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-04  0:48                                       ` Stephen J. Turnbull
@ 2014-04-04  8:08                                         ` Eli Zaretskii
  0 siblings, 0 replies; 103+ messages in thread
From: Eli Zaretskii @ 2014-04-04  8:08 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,
>     emacs-devel@gnu.org
> Date: Fri, 04 Apr 2014 09:48:17 +0900
> 
> David Kastrup writes:
>  > Eli Zaretskii <eliz@gnu.org> writes:
> 
>  > > I never blamed anyone.  People should know the true state of
>  > > affairs, and then decide for themselves.
> 
> Not in Emacs.  It's not up to the individual contributor, it's a
> matter for project policy, ie, RMS as advised by the FSF legal dept.

To some degree, yes.  (Although I hear only deafening silence from
those quarters about these matters.)  But since it is me who signs the
legal papers, and it is me who decides whether some code I submit
under the assignment fits the FSF standards of what can be called "my
original work", then I, too, am a part of this equation, and my
decisions on these matters do count.

> Bottom line: Eli's theoretical assessment of the "typical" risks
> involved seems pretty plausible to me.  But in the worst case, things
> can get pretty bad, and it's easy to justify "legal paranoia" on the
> part of the FSF in managing software freedom of selected critical
> projects, including Emacs.

I agree, FWIW.



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: Unibyte characters, strings, and buffers
  2014-04-03 19:11                                   ` Eli Zaretskii
  2014-04-03 20:03                                     ` David Kastrup
@ 2014-04-04 11:40                                     ` Richard Stallman
  1 sibling, 0 replies; 103+ messages in thread
From: Richard Stallman @ 2014-04-04 11:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

A discussion of the general issue of GPL enforcement is outside
of the purpose of emacs-devel.  The FSF studies this with lawyers,
which is the useful way to do it.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2014-04-04 11:40 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-26 19:04 Buffer-local variables affect general-purpose functions Eli Zaretskii
2014-03-26 19:32 ` Paul Eggert
2014-03-26 20:03   ` Eli Zaretskii
2014-03-26 21:50     ` Paul Eggert
2014-03-27 17:42       ` Eli Zaretskii
2014-03-27 18:55         ` Paul Eggert
2014-03-27 14:17 ` Stefan Monnier
2014-03-27 17:17   ` Eli Zaretskii
2014-03-27 21:04     ` Stefan Monnier
2014-03-28  7:11       ` Eli Zaretskii
2014-03-28  7:46         ` Paul Eggert
2014-03-28  8:18           ` Unibyte characters, strings and buffers Eli Zaretskii
2014-03-28 18:42             ` Paul Eggert
2014-03-28 18:52               ` Eli Zaretskii
2014-03-28 19:21                 ` Paul Eggert
2014-03-29  6:40                   ` Eli Zaretskii
2014-03-29 18:57                     ` Paul Eggert
2014-03-29 19:46                       ` Eli Zaretskii
2014-03-28 20:23                 ` Stefan Monnier
2014-03-29 19:34                 ` Stefan Monnier
2014-03-28 14:12         ` Buffer-local variables affect general-purpose functions Stefan Monnier
2014-03-28  3:38     ` Stephen J. Turnbull
2014-03-28  8:51       ` Unibyte characters, strings, and buffers Eli Zaretskii
2014-03-28 10:28         ` Stephen J. Turnbull
2014-03-28 10:58           ` David Kastrup
2014-03-28 11:22             ` Andreas Schwab
2014-03-28 11:34               ` David Kastrup
2014-03-28 11:42             ` Stephen J. Turnbull
2014-03-28 17:29           ` Eli Zaretskii
2014-03-28 17:50             ` David Kastrup
2014-03-28 18:31               ` Eli Zaretskii
2014-03-28 19:25                 ` David Kastrup
2014-03-29  6:43                   ` Eli Zaretskii
2014-03-29  7:23                     ` David Kastrup
2014-03-29  8:24                       ` Eli Zaretskii
2014-03-29  8:40                         ` David Kastrup
2014-03-29  9:25                           ` Eli Zaretskii
2014-03-28 20:27             ` Stefan Monnier
2014-03-29  9:23             ` Stephen J. Turnbull
2014-03-29  9:52               ` Andreas Schwab
2014-03-29 10:48                 ` Eli Zaretskii
2014-03-29 11:00                   ` Andreas Schwab
2014-03-29 11:18                     ` Eli Zaretskii
2014-03-29 11:30                       ` Andreas Schwab
     [not found]                         ` <83ha6hduzz.fsf@gnu.org>
2014-03-29 14:30                           ` Andreas Schwab
2014-03-29 14:47                             ` Eli Zaretskii
2014-03-29 10:42               ` David Kastrup
2014-03-29 11:07                 ` Eli Zaretskii
2014-03-29 11:30                   ` David Kastrup
2014-03-29 12:58                     ` Eli Zaretskii
2014-03-29 13:15                       ` David Kastrup
2014-03-29 10:44               ` Eli Zaretskii
2014-03-29 11:06               ` Andreas Schwab
2014-03-29 11:12                 ` Eli Zaretskii
2014-03-29 16:11                   ` Stephen J. Turnbull
2014-03-29 15:37                 ` Stephen J. Turnbull
2014-03-29 15:55                   ` David Kastrup
2014-03-29 16:28                     ` Stephen J. Turnbull
2014-03-29 17:00                       ` David Kastrup
2014-03-30  2:05                         ` Stephen J. Turnbull
2014-03-30  9:01                           ` David Kastrup
2014-03-30 12:13                             ` Stephen J. Turnbull
2014-03-30 14:25                             ` Andreas Schwab
2014-03-30 15:05                               ` David Kastrup
2014-03-30 15:39                                 ` Andreas Schwab
2014-03-29 17:08                       ` Andreas Schwab
2014-03-30  0:24                     ` Richard Stallman
2014-03-30  3:32                       ` Stefan Monnier
2014-03-30 15:13                         ` Richard Stallman
2014-03-29 15:58                   ` Andreas Schwab
2014-03-29 16:35                     ` Stephen J. Turnbull
2014-03-29 17:06                       ` Andreas Schwab
2014-03-29 17:01               ` Nathan Trapuzzano
2014-03-29 17:08                 ` Nathan Trapuzzano
2014-03-29 17:18                   ` David Kastrup
2014-03-29 17:33                     ` Nathan Trapuzzano
2014-03-30  0:24                       ` Richard Stallman
2014-03-30  8:38                         ` Andreas Schwab
2014-03-30 15:12                           ` Richard Stallman
2014-03-29 17:16                 ` David Kastrup
2014-03-28 18:45           ` Daniel Colascione
2014-03-28 19:35             ` Glenn Morris
2014-03-29 11:17             ` Stephen J. Turnbull
2014-03-29 11:22               ` Eli Zaretskii
2014-03-29 16:03                 ` Stephen J. Turnbull
2014-03-31 15:22                   ` Eli Zaretskii
2014-04-01  3:36                     ` Stephen J. Turnbull
2014-04-01  7:42                       ` David Kastrup
2014-04-01  9:38                         ` Stephen J. Turnbull
2014-04-01 15:19                         ` Eli Zaretskii
2014-04-01 15:16                       ` Eli Zaretskii
2014-04-02  4:20                         ` Stephen J. Turnbull
2014-04-02 17:06                           ` Eli Zaretskii
2014-04-03 10:59                             ` David Kastrup
2014-04-03 16:07                               ` Eli Zaretskii
2014-04-03 16:26                                 ` David Kastrup
2014-04-03 19:11                                   ` Eli Zaretskii
2014-04-03 20:03                                     ` David Kastrup
2014-04-04  0:48                                       ` Stephen J. Turnbull
2014-04-04  8:08                                         ` Eli Zaretskii
2014-04-04  7:58                                       ` Eli Zaretskii
2014-04-04 11:40                                     ` Richard Stallman
2014-04-03 13:04                             ` Stephen J. Turnbull

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).