unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* (aset UNIBYTE-STRING MULTIBYTE-CHAR)
@ 2008-02-13  2:36 Kenichi Handa
  2008-02-13  2:49 ` Stefan Monnier
  2008-02-13 22:01 ` Richard Stallman
  0 siblings, 2 replies; 43+ messages in thread
From: Kenichi Handa @ 2008-02-13  2:36 UTC (permalink / raw)
  To: emacs-devel

Before the unicode merge, this worked:
  (let ((str "a")) (aset str 0 (decode-char 'ucs #x100)))

In emacs-unicode-2 branch, there was a discussion about the
rightness of aset changing the multibyteness of a string,
and I changed the code to signal an error in the above case.

But, I got reports claiming that the change breaks some of
already existing Elisp packages.  Although changing the
current code again to make the above code work, it causes
another problem in this case:
  (let ((str "a")) (aset str 0 #xC0))

Currently, it changes STR to the unibyte string "\300" (that
is the same as before unicode merge), but if we allow
changing the string multibyteness, perhaps STR must be
changed to the multibyte string of A-grave "À" because the
character code of A-grave is #xC0.  But, that means we loose
a way to easily manipulate raw byte data in a unibyte
string.

What do you think is the right thing for this matter?

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13  2:36 (aset UNIBYTE-STRING MULTIBYTE-CHAR) Kenichi Handa
@ 2008-02-13  2:49 ` Stefan Monnier
  2008-02-13  3:48   ` Kenichi Handa
  2008-02-13 22:01 ` Richard Stallman
  1 sibling, 1 reply; 43+ messages in thread
From: Stefan Monnier @ 2008-02-13  2:49 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> Before the unicode merge, this worked:
>   (let ((str "a")) (aset str 0 (decode-char 'ucs #x100)))

> In emacs-unicode-2 branch, there was a discussion about the
> rightness of aset changing the multibyteness of a string,
> and I changed the code to signal an error in the above case.

An error sounds right.

> But, I got reports claiming that the change breaks some of
> already existing Elisp packages.  Although changing the

Details?

> What do you think is the right thing for this matter?

aset on strings is fundamentally problematic, so anything that restricts
it further is good in my book (my own local Emacs disallows them
plainly, and I rarely bump into code that needs it).


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13  2:49 ` Stefan Monnier
@ 2008-02-13  3:48   ` Kenichi Handa
  2008-02-13 15:33     ` Stefan Monnier
  0 siblings, 1 reply; 43+ messages in thread
From: Kenichi Handa @ 2008-02-13  3:48 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

In article <jwvir0t8tgm.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > Before the unicode merge, this worked:
> >   (let ((str "a")) (aset str 0 (decode-char 'ucs #x100)))

> > In emacs-unicode-2 branch, there was a discussion about the
> > rightness of aset changing the multibyteness of a string,
> > and I changed the code to signal an error in the above case.

> An error sounds right.

For this:
  (let ((str "\300")) (aset str 0 (decode-char 'ucs #x100)))
an error may be ok.  But for the first example, although "a"
is currently treated as a unibyte string, I think it's more
like multibyteness-not-yet-decided, i.e. it's neutral about
the multibyteness.

> > But, I got reports claiming that the change breaks some of
> > already existing Elisp packages.  Although changing the

> Details?

Something like this code:

         (setq result (cons 
                       (let ((str (make-string 1 0)))
                         (aset str 0 (make-char 'japanese-jisx0208 ku ten))

although it's easy to fix it...

> > What do you think is the right thing for this matter?

> aset on strings is fundamentally problematic, so anything that restricts
> it further is good in my book (my own local Emacs disallows them
> plainly, and I rarely bump into code that needs it).

What is the fundamental problem?

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13  3:48   ` Kenichi Handa
@ 2008-02-13 15:33     ` Stefan Monnier
  2008-02-13 18:06       ` Stephen J. Turnbull
  2008-02-15  1:39       ` Kenichi Handa
  0 siblings, 2 replies; 43+ messages in thread
From: Stefan Monnier @ 2008-02-13 15:33 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> Something like this code:

>          (setq result (cons 
>                        (let ((str (make-string 1 0)))
>                          (aset str 0 (make-char 'japanese-jisx0208 ku ten))

That's truly horrendous code.  I see no reason to support it.

> although it's easy to fix it...

Not only it's easy but the result is more efficient/legible/maintainable.

>> aset on strings is fundamentally problematic, so anything that restricts
>> it further is good in my book (my own local Emacs disallows them
>> plainly, and I rarely bump into code that needs it).

> What is the fundamental problem?

The one you're bumping into: multibyte strings are not arrays and
treating them like ones asks for trouble: the performance is not the one
expected, the implementation is complex and ugly, ...

When weighed against the *very* rare cases where aset is used (let
alone the even more rare cases where aset is actually useful and
convenient), the choice is trivial (for me anyway).


        Stefan


PS: I see bindat.el uses string-make-unibyte is a similar way to the
    place where we recently switched to unibyte-string, except that th
    source is an array rather than a list, and I was thinking: wouldn't
    it make sense to allow `apply' to take an array of args rather than
    a list of args?  Especially if it's of the form (apply FUN ARRAY)
    since we then could use ARRAY directly without having to copy the args
    one by one into a new C array.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13 15:33     ` Stefan Monnier
@ 2008-02-13 18:06       ` Stephen J. Turnbull
  2008-02-13 19:33         ` Stefan Monnier
                           ` (2 more replies)
  2008-02-15  1:39       ` Kenichi Handa
  1 sibling, 3 replies; 43+ messages in thread
From: Stephen J. Turnbull @ 2008-02-13 18:06 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Kenichi Handa, emacs-devel

Stefan Monnier writes:

 > PS: I see bindat.el uses string-make-unibyte is a similar way to the
 >     place where we recently switched to unibyte-string, except that th
 >     source is an array rather than a list, and I was thinking: wouldn't
 >     it make sense to allow `apply' to take an array of args rather than
 >     a list of args?

How Pythonic!





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13 18:06       ` Stephen J. Turnbull
@ 2008-02-13 19:33         ` Stefan Monnier
  2008-02-13 22:49         ` Miles Bader
  2008-02-14  4:42         ` Richard Stallman
  2 siblings, 0 replies; 43+ messages in thread
From: Stefan Monnier @ 2008-02-13 19:33 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Kenichi Handa, emacs-devel

>> PS: I see bindat.el uses string-make-unibyte is a similar way to the
>> place where we recently switched to unibyte-string, except that th
>> source is an array rather than a list, and I was thinking: wouldn't
>> it make sense to allow `apply' to take an array of args rather than
>> a list of args?

> How Pythonic!

I never used Python and the idea is already used by Elisp's mapcar
and mapconcat which predate Python...


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13  2:36 (aset UNIBYTE-STRING MULTIBYTE-CHAR) Kenichi Handa
  2008-02-13  2:49 ` Stefan Monnier
@ 2008-02-13 22:01 ` Richard Stallman
  2008-02-13 23:13   ` Miles Bader
  1 sibling, 1 reply; 43+ messages in thread
From: Richard Stallman @ 2008-02-13 22:01 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

    In emacs-unicode-2 branch, there was a discussion about the
    rightness of aset changing the multibyteness of a string,
    and I changed the code to signal an error in the above case.

    But, I got reports claiming that the change breaks some of
    already existing Elisp packages.

We should investigate what packages these are, what they are doing
that uses this, and how hard it is to fix them.  Based on that
we can decide.  It is ok to break a few things in a way that
is easy to fix.

But it is not terribly hard to make this case work once again, given
that all strings are indirect.  At worst, one can make a new string
with the modified contents, then swap the `data' pointers between the
new string and the old one.

That would be better than breaking lots of packages.





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13 18:06       ` Stephen J. Turnbull
  2008-02-13 19:33         ` Stefan Monnier
@ 2008-02-13 22:49         ` Miles Bader
  2008-02-14  1:11           ` Stephen J. Turnbull
  2008-02-14  4:42         ` Richard Stallman
  2 siblings, 1 reply; 43+ messages in thread
From: Miles Bader @ 2008-02-13 22:49 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Kenichi Handa, Stefan Monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
>  > PS: I see bindat.el uses string-make-unibyte is a similar way to the
>  >     place where we recently switched to unibyte-string, except that th
>  >     source is an array rather than a list, and I was thinking: wouldn't
>  >     it make sense to allow `apply' to take an array of args rather than
>  >     a list of args?
>
> How Pythonic!

No need for insults Stephen!

[p.s. Stefan -- great idea, probably will be a lot more efficient than
using a list...]

-Miles

-- 
"Don't just question authority,
Don't forget to question me."
-- Jello Biafra




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13 22:01 ` Richard Stallman
@ 2008-02-13 23:13   ` Miles Bader
  0 siblings, 0 replies; 43+ messages in thread
From: Miles Bader @ 2008-02-13 23:13 UTC (permalink / raw)
  To: rms; +Cc: Kenichi Handa, emacs-devel

Richard Stallman <rms@gnu.org> writes:
> But it is not terribly hard to make this case work once again, given
> that all strings are indirect.  At worst, one can make a new string
> with the modified contents, then swap the `data' pointers between the
> new string and the old one.

As Stefan noted, though, the entire idea of using aset to store into a
multibyte string is rather dodgy...I think there's a certain expectation
by users of aset that the operation follows general array behavior, most
notably O(1) complexity, and storing into a multibyte string doesn't
follow that.  If we can discourage this usage without too much fallout,
that seems a lot better in the long run.

-Miles

-- 
Twice, adv. Once too often.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13 22:49         ` Miles Bader
@ 2008-02-14  1:11           ` Stephen J. Turnbull
  2008-02-14  1:17             ` Miles Bader
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen J. Turnbull @ 2008-02-14  1:11 UTC (permalink / raw)
  To: Miles Bader; +Cc: Kenichi Handa, Stefan Monnier, emacs-devel

Miles Bader writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:

 > > How Pythonic!

 > No need for insults Stephen!

You're welcome to take insult if you like, but I don't know a higher
compliment in language design, for values of "design" that are verbs.
Isn't half bad for values of "design" that are nouns, for that matter.





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14  1:11           ` Stephen J. Turnbull
@ 2008-02-14  1:17             ` Miles Bader
  2008-02-14  1:40               ` Stefan Monnier
  2008-02-14  4:20               ` Stephen J. Turnbull
  0 siblings, 2 replies; 43+ messages in thread
From: Miles Bader @ 2008-02-14  1:17 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: Kenichi Handa, Stefan Monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
>  > > How Pythonic!
>
>  > No need for insults Stephen!
>
> You're welcome to take insult if you like, but I don't know a higher
> compliment in language design

Please tell me you're joking...

-Miles

-- 
Opportunity, n. A favorable occasion for grasping a disappointment.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14  1:17             ` Miles Bader
@ 2008-02-14  1:40               ` Stefan Monnier
  2008-02-14  1:49                 ` Miles Bader
  2008-02-14 18:10                 ` Richard Stallman
  2008-02-14  4:20               ` Stephen J. Turnbull
  1 sibling, 2 replies; 43+ messages in thread
From: Stefan Monnier @ 2008-02-14  1:40 UTC (permalink / raw)
  To: Miles Bader; +Cc: Stephen J. Turnbull, Kenichi Handa, emacs-devel

>> > > How Pythonic!
>> > No need for insults Stephen!
>> You're welcome to take insult if you like, but I don't know a higher
>> compliment in language design
> Please tell me you're joking...

While I don't consider Python as the best design by far (the lack of
type system rules it out right away), Elisp's dynamic scoping mixed with
buffer-local and frame-local and terminal-local variables is pretty
horrendous,


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14  1:40               ` Stefan Monnier
@ 2008-02-14  1:49                 ` Miles Bader
  2008-02-14 18:10                 ` Richard Stallman
  1 sibling, 0 replies; 43+ messages in thread
From: Miles Bader @ 2008-02-14  1:49 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Stephen J. Turnbull, Kenichi Handa, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:
> While I don't consider Python as the best design by far (the lack of
> type system rules it out right away), Elisp's dynamic scoping mixed with
> buffer-local and frame-local and terminal-local variables is pretty
> horrendous,

Hey I didn't make any claims about elisp's variables (or elisp at all
for that matter...).  Python is far from being the _worst_ language out
there.

Still, the phrase "damning with faint praise" comes to mind...

-Miles

-- 
Infancy, n. The period of our lives when, according to Wordsworth, 'Heaven
lies about us.' The world begins lying about us pretty soon afterward.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14  1:17             ` Miles Bader
  2008-02-14  1:40               ` Stefan Monnier
@ 2008-02-14  4:20               ` Stephen J. Turnbull
  1 sibling, 0 replies; 43+ messages in thread
From: Stephen J. Turnbull @ 2008-02-14  4:20 UTC (permalink / raw)
  To: Miles Bader; +Cc: Kenichi Handa, Stefan Monnier, emacs-devel

Miles Bader writes:
 > "Stephen J. Turnbull" <stephen@xemacs.org> writes:
 > >  > > How Pythonic!
 > >
 > >  > No need for insults Stephen!
 > >
 > > You're welcome to take insult if you like, but I don't know a higher
 > > compliment in language design
 > 
 > Please tell me you're joking...

No, I'm not.  You don't have to like their design goals, but to lack
respect for their success in achieving the ones they've chosen ...
well, go ahead, laugh at the Tao.  That's what it's there for, says
so right on the label.

As for the particular proposal of Stefan's, it *is* very Pythonic.
(It's duck typing for sequences.)  It's not particularly Emacs-Lisp-y,
what with aref and nth that do the same thing but to different types
of sequences, etc., and the dozen or more ways of implementing
dictionaries, all having distinct APIs for accessing properties by
keyword.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13 18:06       ` Stephen J. Turnbull
  2008-02-13 19:33         ` Stefan Monnier
  2008-02-13 22:49         ` Miles Bader
@ 2008-02-14  4:42         ` Richard Stallman
  2 siblings, 0 replies; 43+ messages in thread
From: Richard Stallman @ 2008-02-14  4:42 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: handa, monnier, emacs-devel

     > PS: I see bindat.el uses string-make-unibyte is a similar way to the
     >     place where we recently switched to unibyte-string, except that th
     >     source is an array rather than a list, and I was thinking: wouldn't
     >     it make sense to allow `apply' to take an array of args rather than
     >     a list of args?

    How Pythonic!

I see no harm in it.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14  1:40               ` Stefan Monnier
  2008-02-14  1:49                 ` Miles Bader
@ 2008-02-14 18:10                 ` Richard Stallman
  2008-02-14 22:40                   ` David Kastrup
  2008-02-14 23:37                   ` Leo
  1 sibling, 2 replies; 43+ messages in thread
From: Richard Stallman @ 2008-02-14 18:10 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: stephen, handa, emacs-devel, miles

    While I don't consider Python as the best design by far (the lack of
    type system rules it out right away), Elisp's dynamic scoping mixed with
    buffer-local and frame-local and terminal-local variables is pretty
    horrendous,

I think it is elegant.  Dynamic scoping is absolutely essential for an
Emacs-like editor, as explained in the Emacs paper from 1981.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14 18:10                 ` Richard Stallman
@ 2008-02-14 22:40                   ` David Kastrup
  2008-02-15  1:08                     ` Stephen J. Turnbull
  2008-02-15 12:58                     ` Richard Stallman
  2008-02-14 23:37                   ` Leo
  1 sibling, 2 replies; 43+ messages in thread
From: David Kastrup @ 2008-02-14 22:40 UTC (permalink / raw)
  To: rms; +Cc: miles, stephen, handa, Stefan Monnier, emacs-devel

Richard Stallman <rms@gnu.org> writes:

>     While I don't consider Python as the best design by far (the lack
>     of type system rules it out right away), Elisp's dynamic scoping
>     mixed with buffer-local and frame-local and terminal-local
>     variables is pretty horrendous,
>
> I think it is elegant.  Dynamic scoping is absolutely essential for an
> Emacs-like editor, as explained in the Emacs paper from 1981.

Shhhhh.  Stefan is still in Cc, and he'll be sad to hear that his
lexbind branch can't possibly exist.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14 18:10                 ` Richard Stallman
  2008-02-14 22:40                   ` David Kastrup
@ 2008-02-14 23:37                   ` Leo
  2008-02-15 12:59                     ` Richard Stallman
  1 sibling, 1 reply; 43+ messages in thread
From: Leo @ 2008-02-14 23:37 UTC (permalink / raw)
  To: emacs-devel

On 2008-02-14 18:10 +0000, Richard Stallman wrote:
> I think it is elegant.  Dynamic scoping is absolutely essential for an
> Emacs-like editor, as explained in the Emacs paper from 1981.

Actually that paper is misleading nowadays.

-- 
.:  Leo  :.  [ sdl.web AT gmail.com ]  .:  [ GPG Key: 9283AA3F ]  :.

          Use the best OS -- http://www.fedoraproject.org/





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14 22:40                   ` David Kastrup
@ 2008-02-15  1:08                     ` Stephen J. Turnbull
  2008-02-15  1:17                       ` Miles Bader
  2008-02-15 12:58                     ` Richard Stallman
  1 sibling, 1 reply; 43+ messages in thread
From: Stephen J. Turnbull @ 2008-02-15  1:08 UTC (permalink / raw)
  To: David Kastrup; +Cc: handa, emacs-devel, rms, Stefan Monnier, miles

David Kastrup writes:
 > Richard Stallman <rms@gnu.org> writes:
 > 
 > >     While I don't consider Python as the best design by far (the lack
 > >     of type system rules it out right away), Elisp's dynamic scoping
 > >     mixed with buffer-local and frame-local and terminal-local
 > >     variables is pretty horrendous,
 > >
 > > I think it is elegant.  Dynamic scoping is absolutely essential for an
 > > Emacs-like editor, as explained in the Emacs paper from 1981.
 > 
 > Shhhhh.  Stefan is still in Cc, and he'll be sad to hear that his
 > lexbind branch can't possibly exist.

No problema, amigos.  From the cited paper (SIGOA 1981, 2:1-2):

    It is not necessary for dynamic scope to be the *only* scope rule
    available, just useful for it to be available.

(Isn't it Miles who is maintaining the lexbind branch?  But he's
there, too. :-)




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-15  1:08                     ` Stephen J. Turnbull
@ 2008-02-15  1:17                       ` Miles Bader
  2008-02-15  7:27                         ` David Kastrup
  0 siblings, 1 reply; 43+ messages in thread
From: Miles Bader @ 2008-02-15  1:17 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: handa, rms, Stefan Monnier, emacs-devel

"Stephen J. Turnbull" <stephen@xemacs.org> writes:
> (Isn't it Miles who is maintaining the lexbind branch?  But he's
> there, too. :-)

Yes, the lexbind branch is mine...

-Miles

-- 
Philosophy, n. A route of many roads leading from nowhere to nothing.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-13 15:33     ` Stefan Monnier
  2008-02-13 18:06       ` Stephen J. Turnbull
@ 2008-02-15  1:39       ` Kenichi Handa
  2008-02-15  4:27         ` Stefan Monnier
                           ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: Kenichi Handa @ 2008-02-15  1:39 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

In article <jwvwsp8vpx8.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > Something like this code:
> >          (setq result (cons 
> >                        (let ((str (make-string 1 0)))
> >                          (aset str 0 (make-char 'japanese-jisx0208 ku ten))

> That's truly horrendous code.  I see no reason to support it.

> > although it's easy to fix it...

> Not only it's easy but the result is more efficient/legible/maintainable.

Even if the code is very bad, it worked in Emacs 22.  If
it doesn't work in Emacs 23, it's a regression.

>>> aset on strings is fundamentally problematic, so anything that restricts
>>> it further is good in my book (my own local Emacs disallows them
>>> plainly, and I rarely bump into code that needs it).

> > What is the fundamental problem?

> The one you're bumping into: multibyte strings are not arrays and
> treating them like ones asks for trouble: the performance is not the one
> expected, the implementation is complex and ugly, ...

The problem here is that (make-string 1 ?a) is a unibyte
string, but "a" generated by buffer-substring on a multibyte
buffer is a multibyte string.  The result of concatinating
them is also multibyte.  So, the multibyteness of strings is
difficult of expect.  If we are going to inhibit aset on
multibyte strings, I think we should inhibit aset on any
strings to avoid a further confusion.

> When weighed against the *very* rare cases where aset is used (let
> alone the even more rare cases where aset is actually useful and
> convenient), the choice is trivial (for me anyway).

Then, shouldn't we start the experiment of inhibitting aset
on strings just now?

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-15  1:39       ` Kenichi Handa
@ 2008-02-15  4:27         ` Stefan Monnier
  2008-02-15  8:42         ` Eli Zaretskii
  2008-02-16  5:53         ` Richard Stallman
  2 siblings, 0 replies; 43+ messages in thread
From: Stefan Monnier @ 2008-02-15  4:27 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> Even if the code is very bad, it worked in Emacs 22.
> If it doesn't work in Emacs 23, it's a regression.

If it makes them improve their code, it's an ... improvement.

> The problem here is that (make-string 1 ?a) is a unibyte
> string, but "a" generated by buffer-substring on a multibyte
> buffer is a multibyte string.  The result of concatinating
> them is also multibyte.  So, the multibyteness of strings is
> difficult of expect.

Indeed.  To work around this problem, my locally hacked Emacs
distinguishes between unibyte strings (byte-length < 0), multibyte
strings (byte-length > char-length), and "anybyte" strings (byte-length
= char-length).

> If we are going to inhibit aset on multibyte strings,

We can just inhibit aset if it requires changes the string's byte-length
or multibyte-ness.

> I think we should inhibit aset on any strings to avoid
> a further confusion.

And that's indeed what my locally hacked Emacs does.

>> When weighed against the *very* rare cases where aset is used (let
>> alone the even more rare cases where aset is actually useful and
>> convenient), the choice is trivial (for me anyway).

> Then, shouldn't we start the experiment of inhibitting aset
> on strings just now?

But I do not think we're ready for that.  Maybe 10 years from now...


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-15  1:17                       ` Miles Bader
@ 2008-02-15  7:27                         ` David Kastrup
  0 siblings, 0 replies; 43+ messages in thread
From: David Kastrup @ 2008-02-15  7:27 UTC (permalink / raw)
  To: Miles Bader; +Cc: Stephen J. Turnbull, handa, rms, Stefan Monnier, emacs-devel

Miles Bader <miles@gnu.org> writes:

> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
>> (Isn't it Miles who is maintaining the lexbind branch?  But he's
>> there, too. :-)
>
> Yes, the lexbind branch is mine...

Oops.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-15  1:39       ` Kenichi Handa
  2008-02-15  4:27         ` Stefan Monnier
@ 2008-02-15  8:42         ` Eli Zaretskii
  2008-02-15  8:53           ` Miles Bader
  2008-02-16  5:53         ` Richard Stallman
  2 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2008-02-15  8:42 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: monnier, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Date: Fri, 15 Feb 2008 10:39:01 +0900
> Cc: emacs-devel@gnu.org
> 
> The problem here is that (make-string 1 ?a) is a unibyte
> string, but "a" generated by buffer-substring on a multibyte
> buffer is a multibyte string.

Why should (make-string 1 ?a) produce a unibyte string?  Why can't it
produce a multibyte string instead?

More generally, how about if we make sure _all_ string-producing
primitives return multibyte strings, and unibyte strings can only be
produced by a few specialized ones which have "-unibyte-" in their
name?




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-15  8:42         ` Eli Zaretskii
@ 2008-02-15  8:53           ` Miles Bader
  2008-02-16 12:55             ` Eli Zaretskii
  0 siblings, 1 reply; 43+ messages in thread
From: Miles Bader @ 2008-02-15  8:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, monnier, Kenichi Handa

Eli Zaretskii <eliz@gnu.org> writes:
> Why should (make-string 1 ?a) produce a unibyte string?  Why can't it
> produce a multibyte string instead?
>
> More generally, how about if we make sure _all_ string-producing
> primitives return multibyte strings, and unibyte strings can only be
> produced by a few specialized ones which have "-unibyte-" in their
> name?

Why?  That doesn't seem to help with the issue being discussed (Stefan's
solution seems pretty good though)....

-Miles

-- 
Infancy, n. The period of our lives when, according to Wordsworth, 'Heaven
lies about us.' The world begins lying about us pretty soon afterward.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14 22:40                   ` David Kastrup
  2008-02-15  1:08                     ` Stephen J. Turnbull
@ 2008-02-15 12:58                     ` Richard Stallman
  1 sibling, 0 replies; 43+ messages in thread
From: Richard Stallman @ 2008-02-15 12:58 UTC (permalink / raw)
  To: David Kastrup; +Cc: miles, stephen, handa, monnier, emacs-devel

    > I think it is elegant.  Dynamic scoping is absolutely essential for an
    > Emacs-like editor, as explained in the Emacs paper from 1981.

    Shhhhh.  Stefan is still in Cc, and he'll be sad to hear that his
    lexbind branch can't possibly exist.

The lexbind branch does have dynamic scoping.
Without that it would not work at all.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-14 23:37                   ` Leo
@ 2008-02-15 12:59                     ` Richard Stallman
  0 siblings, 0 replies; 43+ messages in thread
From: Richard Stallman @ 2008-02-15 12:59 UTC (permalink / raw)
  To: Leo; +Cc: emacs-devel

    Actually that paper is misleading nowadays.

I am willing to listen to arguments to that effect.





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-15  1:39       ` Kenichi Handa
  2008-02-15  4:27         ` Stefan Monnier
  2008-02-15  8:42         ` Eli Zaretskii
@ 2008-02-16  5:53         ` Richard Stallman
  2008-02-16 14:33           ` Stefan Monnier
  2 siblings, 1 reply; 43+ messages in thread
From: Richard Stallman @ 2008-02-16  5:53 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: monnier, emacs-devel

      If we are going to inhibit aset on
    multibyte strings, I think we should inhibit aset on any
    strings to avoid a further confusion.

I think someone should try making it work.
The way I suggested should not be terribly hard.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-15  8:53           ` Miles Bader
@ 2008-02-16 12:55             ` Eli Zaretskii
  0 siblings, 0 replies; 43+ messages in thread
From: Eli Zaretskii @ 2008-02-16 12:55 UTC (permalink / raw)
  To: Miles Bader; +Cc: emacs-devel, monnier, handa

> From: Miles Bader <miles.bader@necel.com>
> Cc: Kenichi Handa <handa@m17n.org>, monnier@iro.umontreal.ca,
>         emacs-devel@gnu.org
> Date: Fri, 15 Feb 2008 17:53:50 +0900
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> > Why should (make-string 1 ?a) produce a unibyte string?  Why can't it
> > produce a multibyte string instead?
> >
> > More generally, how about if we make sure _all_ string-producing
> > primitives return multibyte strings, and unibyte strings can only be
> > produced by a few specialized ones which have "-unibyte-" in their
> > name?
> 
> Why?

Because unibyte strings are evil, and shouldn't be needed in Emacs,
except in a few very specialized situations.

> That doesn't seem to help with the issue being discussed (Stefan's
> solution seems pretty good though)....

The fact that (make-string 1 ?a) produces a unibyte string was
mentioned as one problem.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-16  5:53         ` Richard Stallman
@ 2008-02-16 14:33           ` Stefan Monnier
  2008-02-17 20:29             ` Richard Stallman
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Monnier @ 2008-02-16 14:33 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel, Kenichi Handa

>       If we are going to inhibit aset on multibyte strings, I think we
>     should inhibit aset on any strings to avoid a further confusion.

> I think someone should try making it work.
> The way I suggested should not be terribly hard.

The problem is the following: while it can be made to work, it will be
inefficient.  If we just make it work, the callers will never get to
know that they're doing things in a terribly inefficient way.  The real
fix is to change the caller.

BTW, I suggest the patch below to fix one such caller.


        Stefan


--- orig/src/casefiddle.c
+++ mod/src/casefiddle.c
@@ -75,23 +76,18 @@
       return obj;
     }
 
-  if (STRINGP (obj))
+  if (!STRINGP (obj))
+    wrong_type_argument (Qchar_or_string_p, obj);
+  else if (STRING_UNIBYTE (obj))
     {
-      int multibyte = STRING_MULTIBYTE (obj);
-      int i, i_byte, len;
-      int size = SCHARS (obj);
+      EMACS_INT i;
+      EMACS_INT size = SCHARS (obj);
 
       obj = Fcopy_sequence (obj);
-      for (i = i_byte = 0; i < size; i++, i_byte += len)
+      for (i = 0; i < size; i++)
 	{
-	  if (multibyte)
-	    c = STRING_CHAR_AND_LENGTH (SDATA (obj) + i_byte, 0, len);
-	  else
-	    {
-	      c = SREF (obj, i_byte);
-	      len = 1;
-	      MAKE_CHAR_MULTIBYTE (c);
-	    }
+	  c = SREF (obj, i);
+	  MAKE_CHAR_MULTIBYTE (c);
 	  c1 = c;
 	  if (inword && flag != CASE_CAPITALIZE_UP)
 	    c = DOWNCASE (c);
@@ -102,24 +98,51 @@
 	    inword = (SYNTAX (c) == Sword);
 	  if (c != c1)
 	    {
-	      if (! multibyte)
-		{
-		  MAKE_CHAR_UNIBYTE (c);
-		  SSET (obj, i_byte, c);
-		}
-	      else if (ASCII_CHAR_P (c1) && ASCII_CHAR_P (c))
-		SSET (obj, i_byte,  c);
-	      else
-		{
-		  Faset (obj, make_number (i), make_number (c));
-		  i_byte += CHAR_BYTES (c) - len;
-		}
+	      MAKE_CHAR_UNIBYTE (c);
+	      if (c < 0 || c > 255)
+		error ("Non-unibyte char in unibyte string");
+	      SSET (obj, i, c);
 	    }
 	}
       return obj;
     }
+  else
+    {
+      EMACS_INT i, i_byte, len;
+      EMACS_INT size = SCHARS (obj);
+      USE_SAFE_ALLOCA;
+      unsigned char *dst, *o;
+      /* Over-allocate by 12%: this is a minor overhead, but should be
+	 sufficient in 99.999% of the cases to avoid a reallocation.  */
+      EMACS_INT o_size = SBYTES (obj) + SBYTES (obj) / 8 + MAX_MULTIBYTE_LENGTH;
+      SAFE_ALLOCA (dst, void *, o_size);
+      o = dst;
 
-  wrong_type_argument (Qchar_or_string_p, obj);
+      for (i = i_byte = 0; i < size; i++, i_byte += len)
+	{
+	  if ((o - dst) + MAX_MULTIBYTE_LENGTH > o_size)
+	    { /* Not enough space for the next char: grow the destination.  */
+	      unsigned char *old_dst = dst;
+	      o_size += o_size;	/* Probably overkill, but extremely rare.  */
+	      SAFE_ALLOCA (dst, void *, o_size);
+	      bcopy (old_dst, dst, o - old_dst);
+	      o = dst + (o - old_dst);
+	    }
+	  c = STRING_CHAR_AND_LENGTH (SDATA (obj) + i_byte, 0, len);
+	  if (inword && flag != CASE_CAPITALIZE_UP)
+	    c = DOWNCASE (c);
+	  else if (!UPPERCASEP (c)
+		   && (!inword || flag != CASE_CAPITALIZE_UP))
+	    c = UPCASE1 (c);
+	  if ((int) flag >= (int) CASE_CAPITALIZE)
+	    inword = (SYNTAX (c) == Sword);
+	  o += CHAR_STRING (c, o);
+	}
+      eassert (o - dst <= o_size);
+      obj = make_multibyte_string (dst, size, o - dst);
+      SAFE_FREE ();
+      return obj;
+    }
 }
 
 DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0,
@@ -329,10 +352,10 @@
   return Qnil;
 }
 \f
-Lisp_Object
+static Lisp_Object
 operate_on_word (arg, newpoint)
      Lisp_Object arg;
-     int *newpoint;
+     EMACS_INT *newpoint;
 {
   Lisp_Object val;
   int farend;
@@ -358,7 +381,7 @@
      Lisp_Object arg;
 {
   Lisp_Object beg, end;
-  int newpoint;
+  EMACS_INT newpoint;
   XSETFASTINT (beg, PT);
   end = operate_on_word (arg, &newpoint);
   casify_region (CASE_UP, beg, end);
@@ -373,7 +396,7 @@
      Lisp_Object arg;
 {
   Lisp_Object beg, end;
-  int newpoint;
+  EMACS_INT newpoint;
   XSETFASTINT (beg, PT);
   end = operate_on_word (arg, &newpoint);
   casify_region (CASE_DOWN, beg, end);
@@ -390,7 +413,7 @@
      Lisp_Object arg;
 {
   Lisp_Object beg, end;
-  int newpoint;
+  EMACS_INT newpoint;
   XSETFASTINT (beg, PT);
   end = operate_on_word (arg, &newpoint);
   casify_region (CASE_CAPITALIZE, beg, end);




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-16 14:33           ` Stefan Monnier
@ 2008-02-17 20:29             ` Richard Stallman
  2008-02-18  1:15               ` Stefan Monnier
  0 siblings, 1 reply; 43+ messages in thread
From: Richard Stallman @ 2008-02-17 20:29 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel, handa

    >       If we are going to inhibit aset on multibyte strings, I think we
    >     should inhibit aset on any strings to avoid a further confusion.

    > I think someone should try making it work.
    > The way I suggested should not be terribly hard.

    The problem is the following: while it can be made to work, it will be
    inefficient.

That inefficiency may or may not be important in any given context.
Fixing it in casefiddle is definitely desirable.
But is it worth breaking all such packages just so that they
will optimize an operation that might not use much of the time anyway?




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-17 20:29             ` Richard Stallman
@ 2008-02-18  1:15               ` Stefan Monnier
  2008-02-18  4:00                 ` Kenichi Handa
  2008-02-18 17:31                 ` Richard Stallman
  0 siblings, 2 replies; 43+ messages in thread
From: Stefan Monnier @ 2008-02-18  1:15 UTC (permalink / raw)
  To: rms; +Cc: emacs-devel, handa

>> If we are going to inhibit aset on multibyte strings, I think we
>> should inhibit aset on any strings to avoid a further confusion.

>> I think someone should try making it work.
>> The way I suggested should not be terribly hard.

>     The problem is the following: while it can be made to work, it will be
>     inefficient.

> That inefficiency may or may not be important in any given context.
> Fixing it in casefiddle is definitely desirable.
> But is it worth breaking all such packages just so that they
> will optimize an operation that might not use much of the time anyway?

Why work around the problem in `aset' if it isn't worth fixing in the
original code?  Especially since implicit conversion of a unibyte-string
to multibyte is generally a bug in itself (since there are as many ways
to do that as there are coding systems).


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-18  1:15               ` Stefan Monnier
@ 2008-02-18  4:00                 ` Kenichi Handa
  2008-02-18 17:31                 ` Richard Stallman
  1 sibling, 0 replies; 43+ messages in thread
From: Kenichi Handa @ 2008-02-18  4:00 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: rms, emacs-devel

In article <jwvskzrgj6d.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > That inefficiency may or may not be important in any given context.
> > Fixing it in casefiddle is definitely desirable.
> > But is it worth breaking all such packages just so that they
> > will optimize an operation that might not use much of the time anyway?

> Why work around the problem in `aset' if it isn't worth fixing in the
> original code?

But you wrote:

> > Then, shouldn't we start the experiment of inhibitting aset
> > on strings just now?
> 
> But I do not think we're ready for that.  Maybe 10 years from now...

I want to avoid treating non-ASCII chars different from
ASCII.  Then, the only solution is to make aset work well
for multibyte characters.

> Especially since implicit conversion of a unibyte-string
> to multibyte is generally a bug in itself (since there are as many ways
> to do that as there are coding systems).

It's not a bug but I agree it's a very bad feature.  But for
the case (aset (string ?a) 0 MULTIBYTE-CHAR), I think it's
better to treat "a" as neutral, or in your terminology
"anybyte".

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-02-18  1:15               ` Stefan Monnier
  2008-02-18  4:00                 ` Kenichi Handa
@ 2008-02-18 17:31                 ` Richard Stallman
  1 sibling, 0 replies; 43+ messages in thread
From: Richard Stallman @ 2008-02-18 17:31 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel, handa

    Why work around the problem in `aset' if it isn't worth fixing in the
    original code?

I don't understand you.  I don't think I suggested "working around"
the string `aset' problem.  I suggested an easy way to fix the bug and
make `aset' work in all cases.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
@ 2008-04-15  7:11 Kenichi Handa
  2008-04-15 15:52 ` Stefan Monnier
  0 siblings, 1 reply; 43+ messages in thread
From: Kenichi Handa @ 2008-04-15  7:11 UTC (permalink / raw)
  To: emacs-devel; +Cc: kazu

The discussion on this problem has been suspended for long.
I'd like to settle it.

I wrote:

> In article <jwvskzrgj6d.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

> > > That inefficiency may or may not be important in any given context.
> > > Fixing it in casefiddle is definitely desirable.
> > > But is it worth breaking all such packages just so that they
> > > will optimize an operation that might not use much of the time anyway?

> > Why work around the problem in `aset' if it isn't worth fixing in the
> > original code?

> But you wrote:

> > > Then, shouldn't we start the experiment of inhibitting aset
> > > on strings just now?
> > 
> > But I do not think we're ready for that.  Maybe 10 years from now...

> I want to avoid treating non-ASCII chars different from
> ASCII.  Then, the only solution is to make aset work well
> for multibyte characters.

The attached simple change does the work.  May I install it?

---
Kenichi Handa
handa@ni.aist.go.jp


*** lisp.h.~1.617.~	2008-04-01 15:12:13.000000000 +0900
--- lisp.h	2008-04-15 15:42:52.000000000 +0900
***************
*** 725,730 ****
--- 725,737 ----
        (STR) = empty_unibyte_string;  \
      else XSTRING (STR)->size_byte = -1; } while (0)
  
+ /* Mark STR as a multibyte string.  Assure that STR contains only
+    ASCII characters in advance.  */
+ #define STRING_SET_MULTIBYTE(STR)  \
+   do { if (EQ (STR, empty_unibyte_string))  \
+       (STR) = empty_multibyte_string;  \
+     else XSTRING (STR)->size_byte = XSTRING (STR)->size; } while (0)
+ 
  /* Get text properties.  */
  #define STRING_INTERVALS(STR)  (XSTRING (STR)->intervals + 0)
  

*** data.c.~1.290.~	2008-03-27 20:16:37.000000000 +0900
--- data.c	2008-04-15 15:42:31.000000000 +0900
***************
*** 2093,2099 ****
        CHECK_NUMBER (newelt);
  
        if (XINT (newelt) >= 0 && ! SINGLE_BYTE_CHAR_P (XINT (newelt)))
! 	args_out_of_range (array, newelt);
        SSET (array, idxval, XINT (newelt));
      }
  
--- 2093,2109 ----
        CHECK_NUMBER (newelt);
  
        if (XINT (newelt) >= 0 && ! SINGLE_BYTE_CHAR_P (XINT (newelt)))
! 	{
! 	  int i;
! 
! 	  for (i = SBYTES (array) - 1; i >= 0; i--)
! 	    if (SREF (array, i) >= 0x80)
! 	      args_out_of_range (array, newelt);
! 	  /* ARRAY is an ASCII string.  Convert it to a multibyte
! 	     string, and try `aset' again.  */
! 	  STRING_SET_MULTIBYTE (array);
! 	  return Faset (array, idx, newelt);
! 	}
        SSET (array, idxval, XINT (newelt));
      }
  




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-04-15  7:11 Kenichi Handa
@ 2008-04-15 15:52 ` Stefan Monnier
  2008-04-17  1:13   ` Kenichi Handa
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Monnier @ 2008-04-15 15:52 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: kazu, emacs-devel

>> > > That inefficiency may or may not be important in any given context.
>> > > Fixing it in casefiddle is definitely desirable.
>> > > But is it worth breaking all such packages just so that they
>> > > will optimize an operation that might not use much of the time anyway?

>> > Why work around the problem in `aset' if it isn't worth fixing in the
>> > original code?

>> But you wrote:

>> > > Then, shouldn't we start the experiment of inhibitting aset
>> > > on strings just now?
>> > 
>> > But I do not think we're ready for that.  Maybe 10 years from now...

>> I want to avoid treating non-ASCII chars different from
>> ASCII.  Then, the only solution is to make aset work well
>> for multibyte characters.

> The attached simple change does the work.  May I install it?

I guess it's OK.  It's pretty ugly in terms of code, but in terms of
behavior it more or less matches the behavior of what I use (where
I distinguish between unibyte/anybyte/multibyte),


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-04-15 15:52 ` Stefan Monnier
@ 2008-04-17  1:13   ` Kenichi Handa
  0 siblings, 0 replies; 43+ messages in thread
From: Kenichi Handa @ 2008-04-17  1:13 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: kazu, emacs-devel

In article <jwvr6d7ce37.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> I want to avoid treating non-ASCII chars different from
>>> ASCII.  Then, the only solution is to make aset work well
>>> for multibyte characters.

> > The attached simple change does the work.  May I install it?

> I guess it's OK.  It's pretty ugly in terms of code, but in terms of
> behavior it more or less matches the behavior of what I use (where
> I distinguish between unibyte/anybyte/multibyte),

Ok, I've just installed it.

---
Kenichi Handa
handa@ni.aist.go.jp




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
@ 2008-05-07 19:31 Harald Hanche-Olsen
  2008-05-14  6:54 ` Harald Hanche-Olsen
  0 siblings, 1 reply; 43+ messages in thread
From: Harald Hanche-Olsen @ 2008-05-07 19:31 UTC (permalink / raw)
  To: emacs-devel; +Cc: eliz

This works as it should in the latest CVS:

(setq foo (make-string 4 ?a))
(aset foo 1 ?€) ; <= that's a euro sign

But this fails:

(setq foo (make-string 4 ?a))
(aset foo 1 ?å)
(aset foo 1 ?€) ; => Error: args out of range

The problem seems to lie in these lines (2095-2107) from data.c:

      if (XINT (newelt) >= 0 && ! SINGLE_BYTE_CHAR_P (XINT (newelt)))
	{
	  int i;

	  for (i = SBYTES (array) - 1; i >= 0; i--)
	    if (SREF (array, i) >= 0x80)
	      args_out_of_range (array, newelt);
	  /* ARRAY is an ASCII string.  Convert it to a multibyte
	     string, and try `aset' again.  */
	  STRING_SET_MULTIBYTE (array);
	  return Faset (array, idx, newelt);
	}
      SSET (array, idxval, XINT (newelt));

I am sure the test for members >= 0x80 is there for a good reason, but
it clearly screws up this case and makes the fix rather less useful
than it should have been. I don't know emacs internals well enough to
suggest a fix.

And yes, this did bite in real life: It caused mew to choke on a
malformed spam email. No disaster obviously, but inconvenient.

- Harald

PS. My apologies for messing up threading; I wasn't on the list when
the message I am responding to was posted on 2008-07-15, so I don't
know its message-id.




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-05-07 19:31 Harald Hanche-Olsen
@ 2008-05-14  6:54 ` Harald Hanche-Olsen
  2008-05-14 12:22   ` Stefan Monnier
  0 siblings, 1 reply; 43+ messages in thread
From: Harald Hanche-Olsen @ 2008-05-14  6:54 UTC (permalink / raw)
  To: emacs-devel

My message on this topic of a week ago elicited no responses, so I did
a little more research on my own (which I should have done in the
first place, maybe). This time I hope to see some discussion:

+ Harald Hanche-Olsen <hanche@math.ntnu.no>:

> This works as it should in the latest CVS:
> 
> (setq foo (make-string 4 ?a))
> (aset foo 1 ?€) ; <= that's a euro sign
> 
> But this fails:
> 
> (setq foo (make-string 4 ?a))
> (aset foo 1 ?å)
> (aset foo 1 ?€) ; => Error: args out of range

I went back in the mail archives and read the whole thread (it was in
February and April this year), and I realize that the whole idea of
changing a unibyte string into a multibyte one on the fly in order to
support aset on them is somewhat controversial. Be that as it may, the
above example shows that the fix put in by Kenichi Handa does not fix
it right. Moreover, it is clear from the commit message that he was
well aware of this limitation at the time:

Working file: data.c
revision 1.291
date: 2008-04-17 03:10:58 +0200;  author: handa;  state: Exp;  lines: +11 -1;  commitid: yW6gyKxwbZ4EPoZs;
(Faset): Allow setting a multibyte character in an
ASCII-only unibyte string.

It seems to me that in order to get it right, one has to reallocate
the data in the case of a non-ASCII-only unibyte string, using
code like what is already there for the case when aset replaces an
ASCII character with a non-ASCII one (which will increase the byte
count of the string). The end result will be ugly and inefficient, but
I see no other way if we are going to lay this one to rest.

Comments?

- Harald




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-05-14  6:54 ` Harald Hanche-Olsen
@ 2008-05-14 12:22   ` Stefan Monnier
  2008-05-14 12:50     ` Harald Hanche-Olsen
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Monnier @ 2008-05-14 12:22 UTC (permalink / raw)
  To: Harald Hanche-Olsen; +Cc: emacs-devel

> My message on this topic of a week ago elicited no responses, so I did
> a little more research on my own (which I should have done in the
> first place, maybe). This time I hope to see some discussion:

> + Harald Hanche-Olsen <hanche@math.ntnu.no>:

>> This works as it should in the latest CVS:
>> 
>> (setq foo (make-string 4 ?a))
>> (aset foo 1 ?€) ; <= that's a euro sign
>> 
>> But this fails:
>> 
>> (setq foo (make-string 4 ?a))
>> (aset foo 1 ?å)
>> (aset foo 1 ?€) ; => Error: args out of range

Show us the real code that bunmped into the problem and I'll tell you
how to do it so as to avoid the risk of such problems.


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-05-14 12:22   ` Stefan Monnier
@ 2008-05-14 12:50     ` Harald Hanche-Olsen
  2008-05-15  1:18       ` Stefan Monnier
  0 siblings, 1 reply; 43+ messages in thread
From: Harald Hanche-Olsen @ 2008-05-14 12:50 UTC (permalink / raw)
  To: monnier; +Cc: emacs-devel

+ Stefan Monnier <monnier@IRO.UMontreal.CA>:

> > + Harald Hanche-Olsen <hanche@math.ntnu.no>:
> 
> >> This works as it should in the latest CVS:
> >> 
> >> (setq foo (make-string 4 ?a))
> >> (aset foo 1 ?€) ; <= that's a euro sign
> >> 
> >> But this fails:
> >> 
> >> (setq foo (make-string 4 ?a))
> >> (aset foo 1 ?å)
> >> (aset foo 1 ?€) ; => Error: args out of range
> 
> Show us the real code that bunmped into the problem and I'll tell you
> how to do it so as to avoid the risk of such problems.

You'd have to tell the author of mew (http://mew.org/), Kazu Yamamoto.
Actually, I have a one line patch to mew that fixes the problem, but
he seems unwilling to apply it.

Now don't get me wrong: I am not asking for a change in emacs to fix a
problem in mew. I am suggesting a change in emacs for the sake of
robustness: I think that if the problem of inserting multibyte
characters in unibyte strings is worth fixing at all, it is worth
fixing so it works in all cases. Otherwise, why bother? I do
understand the arguments against fixing it, but the current situation
where it will often work, but fail sometimes does not seem good to me.

But at least, it's documented, I see that now:

  4.4 Modifying Strings
  =====================

  The most basic way to alter the contents of an existing string is with
  `aset' (*note Array Functions::).  `(aset STRING IDX CHAR)' stores CHAR
  into STRING at index IDX.  Each character occupies one or more bytes,
  and if CHAR needs a different number of bytes from the character
  already present at that index, `aset' signals an error.

That last bit actually seems to be outdated: An error is not ALWAYS
signaled in the indicated situation, only sometimes.

Anyway, the code you're asking for (in case you're really curious):
In mew-header.el

(defun mew-addrstr-parse-syntax-list (str sep addrp &optional depth allow-spc)
  (when str
    (let* ((i 0) (len (length str))
	   (par-cnt 0) (tmp-cnt 0) (sep-cnt 0)
	   (tmp (mew-make-string len))
	   c ret prevc)
      (catch 'max
	(while (< i len)
	  (setq c (aref str i)) ; <= problem occurs here
	  ... deleted ...)))))

My one-line fix consists of changing the definition (elsewhere)

(defun mew-make-string (len)
  (make-string len ?a))

into one that makes a multibyte string at the outset.

(I like mew (a lot), so I am willing to put up with its various
idiosynchrasies (and there are a some).)

- Harald




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-05-14 12:50     ` Harald Hanche-Olsen
@ 2008-05-15  1:18       ` Stefan Monnier
  2008-05-15  6:11         ` Harald Hanche-Olsen
  0 siblings, 1 reply; 43+ messages in thread
From: Stefan Monnier @ 2008-05-15  1:18 UTC (permalink / raw)
  To: Harald Hanche-Olsen; +Cc: emacs-devel

> Now don't get me wrong: I am not asking for a change in Emacs to fix
> a problem in Mew.  I am suggesting a change in Emacs for the sake of
> robustness: I think that if the problem of inserting multibyte
> characters in unibyte strings is worth fixing at all, it is worth
> fixing so it works in all cases.  Otherwise, why bother? I do
> understand the arguments against fixing it, but the current situation
> where it will often work, but fail sometimes does not seem good to me.

I don't claim that Mew does things wrong.  I just want to see more
examples to better understand the context and try to figure out what's
the right way to fix the problem.  Notice that in your example,

   (setq foo (make-string 4 ?a))
   (aset foo 1 ?å)
   (aset foo 1 ?€) ; => Error: args out of range

the problem comes from the fact that now that we use Unicode, ?å = 229.
So this integer is also the code of a byte, which is why the first aset
succeeds.  Maybe the better answer is for `make-string' to always create
multibyte strings, just like `string' now does.

In any case if you stay far away from `aset on strings' your life will
be generally better, the birds will sing and the sun will shine.

>   The most basic way to alter the contents of an existing string is with
>   `aset' (*note Array Functions::).  `(aset STRING IDX CHAR)' stores CHAR
>   into STRING at index IDX.  Each character occupies one or more bytes,
>   and if CHAR needs a different number of bytes from the character
>   already present at that index, `aset' signals an error.

> That last bit actually seems to be outdated: An error is not ALWAYS
> signaled in the indicated situation, only sometimes.

I hope the text is correct, if not, please report it as a bug.

> (defun mew-addrstr-parse-syntax-list (str sep addrp &optional depth allow-spc)
>   (when str
>     (let* ((i 0) (len (length str))
> 	   (par-cnt 0) (tmp-cnt 0) (sep-cnt 0)
> 	   (tmp (mew-make-string len))
> 	   c ret prevc)
>       (catch 'max
> 	(while (< i len)
> 	  (setq c (aref str i)) ; <= problem occurs here
> 	  ... deleted ...)))))

Hmm... I don't see any `aset'.


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR)
  2008-05-15  1:18       ` Stefan Monnier
@ 2008-05-15  6:11         ` Harald Hanche-Olsen
  0 siblings, 0 replies; 43+ messages in thread
From: Harald Hanche-Olsen @ 2008-05-15  6:11 UTC (permalink / raw)
  To: monnier; +Cc: emacs-devel

+ Stefan Monnier <monnier@iro.umontreal.ca>:

> I just want to see more
> examples to better understand the context and try to figure out what's
> the right way to fix the problem.  Notice that in your example,
> 
>    (setq foo (make-string 4 ?a))
>    (aset foo 1 ?å)
>    (aset foo 1 ?€) ; => Error: args out of range
> 
> the problem comes from the fact that now that we use Unicode, ?å = 229.
> So this integer is also the code of a byte, which is why the first aset
> succeeds.

Right. Or perhaps more accurately, it is why the first aset succeeds
without automagically converting foo to a multibyte string.

> Maybe the better answer is for `make-string' to always create
> multibyte strings, just like `string' now does.

Hmm. Except it doesn't, quite:

(multibyte-string-p (string ?a ?b ?c ?d)) => nil
(multibyte-string-p (string ?a ?b ?c ?å)) => t

It seems to be the presence of non-ASCII that triggers the creation of
a multibyte string, even though in this case a unibyte string could
also hold the result. In fact, the current behaviours of string and
make-string are quite similar:

(multibyte-string-p (make-string 3 ?a)) => nil
(multibyte-string-p (make-string 3 ?å)) => t

> In any case if you stay far away from `aset on strings' your life will
> be generally better, the birds will sing and the sun will shine.

8) I am willing to believe that.

> >   The most basic way to alter the contents of an existing string is with
> >   `aset' (*note Array Functions::).  `(aset STRING IDX CHAR)' stores CHAR
> >   into STRING at index IDX.  Each character occupies one or more bytes,
> >   and if CHAR needs a different number of bytes from the character
> >   already present at that index, `aset' signals an error.
> 
> > That last bit actually seems to be outdated: An error is not ALWAYS
> > signaled in the indicated situation, only sometimes.
> 
> I hope the text is correct, if not, please report it as a bug.

Okay. I'll run it past you here first, though, since my understanding
of multibyte strings is still patchy. This succeeds and returns "€a€":

(let ((str (make-string 3 ?€)))
  (aset str 1 ?a)
  str)

If I am not mistaken ?€ needs two bytes (or more?) while ?a needs one,
right? And since two (or more) is different from one, the above text
claims that aset signals an error? Or is my understanding wrong? There
is code in aset to shuffle the contents of a multibyte strings around
in case of a size mismatch, however:

      if (prev_bytes != new_bytes)
	{
	  /* We must relocate the string data.  */

> > (defun mew-addrstr-parse-syntax-list (str sep addrp &optional depth allow-spc)
> >   (when str
> >     (let* ((i 0) (len (length str))
> > 	   (par-cnt 0) (tmp-cnt 0) (sep-cnt 0)
> > 	   (tmp (mew-make-string len))
> > 	   c ret prevc)
> >       (catch 'max
> > 	(while (< i len)
> > 	  (setq c (aref str i)) ; <= problem occurs here
> > 	  ... deleted ...)))))
> 
> Hmm... I don't see any `aset'.

Rats. Not enough caffeine, too much work. The deleted code is a big
(cond ...), about 80 lines long, that I didn't want to burden the list
with (it performs parsing after all). I assure you that it contains
(aset tmp tmp-cnt c) in multiple places.

It could have achieved the same result by consing up a list of the
characters and using (string (nreverse char-list)), or perhaps by
appending chars to a temporary buffer, but it didn't.

- Harald




^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2008-05-15  6:11 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-13  2:36 (aset UNIBYTE-STRING MULTIBYTE-CHAR) Kenichi Handa
2008-02-13  2:49 ` Stefan Monnier
2008-02-13  3:48   ` Kenichi Handa
2008-02-13 15:33     ` Stefan Monnier
2008-02-13 18:06       ` Stephen J. Turnbull
2008-02-13 19:33         ` Stefan Monnier
2008-02-13 22:49         ` Miles Bader
2008-02-14  1:11           ` Stephen J. Turnbull
2008-02-14  1:17             ` Miles Bader
2008-02-14  1:40               ` Stefan Monnier
2008-02-14  1:49                 ` Miles Bader
2008-02-14 18:10                 ` Richard Stallman
2008-02-14 22:40                   ` David Kastrup
2008-02-15  1:08                     ` Stephen J. Turnbull
2008-02-15  1:17                       ` Miles Bader
2008-02-15  7:27                         ` David Kastrup
2008-02-15 12:58                     ` Richard Stallman
2008-02-14 23:37                   ` Leo
2008-02-15 12:59                     ` Richard Stallman
2008-02-14  4:20               ` Stephen J. Turnbull
2008-02-14  4:42         ` Richard Stallman
2008-02-15  1:39       ` Kenichi Handa
2008-02-15  4:27         ` Stefan Monnier
2008-02-15  8:42         ` Eli Zaretskii
2008-02-15  8:53           ` Miles Bader
2008-02-16 12:55             ` Eli Zaretskii
2008-02-16  5:53         ` Richard Stallman
2008-02-16 14:33           ` Stefan Monnier
2008-02-17 20:29             ` Richard Stallman
2008-02-18  1:15               ` Stefan Monnier
2008-02-18  4:00                 ` Kenichi Handa
2008-02-18 17:31                 ` Richard Stallman
2008-02-13 22:01 ` Richard Stallman
2008-02-13 23:13   ` Miles Bader
  -- strict thread matches above, loose matches on Subject: below --
2008-04-15  7:11 Kenichi Handa
2008-04-15 15:52 ` Stefan Monnier
2008-04-17  1:13   ` Kenichi Handa
2008-05-07 19:31 Harald Hanche-Olsen
2008-05-14  6:54 ` Harald Hanche-Olsen
2008-05-14 12:22   ` Stefan Monnier
2008-05-14 12:50     ` Harald Hanche-Olsen
2008-05-15  1:18       ` Stefan Monnier
2008-05-15  6:11         ` Harald Hanche-Olsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).