* (aset UNIBYTE-STRING MULTIBYTE-CHAR) @ 2008-02-13 2:36 Kenichi Handa 2008-02-13 2:49 ` Stefan Monnier 2008-02-13 22:01 ` Richard Stallman 0 siblings, 2 replies; 43+ messages in thread From: Kenichi Handa @ 2008-02-13 2:36 UTC (permalink / raw) To: emacs-devel Before the unicode merge, this worked: (let ((str "a")) (aset str 0 (decode-char 'ucs #x100))) In emacs-unicode-2 branch, there was a discussion about the rightness of aset changing the multibyteness of a string, and I changed the code to signal an error in the above case. But, I got reports claiming that the change breaks some of already existing Elisp packages. Although changing the current code again to make the above code work, it causes another problem in this case: (let ((str "a")) (aset str 0 #xC0)) Currently, it changes STR to the unibyte string "\300" (that is the same as before unicode merge), but if we allow changing the string multibyteness, perhaps STR must be changed to the multibyte string of A-grave "À" because the character code of A-grave is #xC0. But, that means we loose a way to easily manipulate raw byte data in a unibyte string. What do you think is the right thing for this matter? --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 2:36 (aset UNIBYTE-STRING MULTIBYTE-CHAR) Kenichi Handa @ 2008-02-13 2:49 ` Stefan Monnier 2008-02-13 3:48 ` Kenichi Handa 2008-02-13 22:01 ` Richard Stallman 1 sibling, 1 reply; 43+ messages in thread From: Stefan Monnier @ 2008-02-13 2:49 UTC (permalink / raw) To: Kenichi Handa; +Cc: emacs-devel > Before the unicode merge, this worked: > (let ((str "a")) (aset str 0 (decode-char 'ucs #x100))) > In emacs-unicode-2 branch, there was a discussion about the > rightness of aset changing the multibyteness of a string, > and I changed the code to signal an error in the above case. An error sounds right. > But, I got reports claiming that the change breaks some of > already existing Elisp packages. Although changing the Details? > What do you think is the right thing for this matter? aset on strings is fundamentally problematic, so anything that restricts it further is good in my book (my own local Emacs disallows them plainly, and I rarely bump into code that needs it). Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 2:49 ` Stefan Monnier @ 2008-02-13 3:48 ` Kenichi Handa 2008-02-13 15:33 ` Stefan Monnier 0 siblings, 1 reply; 43+ messages in thread From: Kenichi Handa @ 2008-02-13 3:48 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel In article <jwvir0t8tgm.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: > > Before the unicode merge, this worked: > > (let ((str "a")) (aset str 0 (decode-char 'ucs #x100))) > > In emacs-unicode-2 branch, there was a discussion about the > > rightness of aset changing the multibyteness of a string, > > and I changed the code to signal an error in the above case. > An error sounds right. For this: (let ((str "\300")) (aset str 0 (decode-char 'ucs #x100))) an error may be ok. But for the first example, although "a" is currently treated as a unibyte string, I think it's more like multibyteness-not-yet-decided, i.e. it's neutral about the multibyteness. > > But, I got reports claiming that the change breaks some of > > already existing Elisp packages. Although changing the > Details? Something like this code: (setq result (cons (let ((str (make-string 1 0))) (aset str 0 (make-char 'japanese-jisx0208 ku ten)) although it's easy to fix it... > > What do you think is the right thing for this matter? > aset on strings is fundamentally problematic, so anything that restricts > it further is good in my book (my own local Emacs disallows them > plainly, and I rarely bump into code that needs it). What is the fundamental problem? --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 3:48 ` Kenichi Handa @ 2008-02-13 15:33 ` Stefan Monnier 2008-02-13 18:06 ` Stephen J. Turnbull 2008-02-15 1:39 ` Kenichi Handa 0 siblings, 2 replies; 43+ messages in thread From: Stefan Monnier @ 2008-02-13 15:33 UTC (permalink / raw) To: Kenichi Handa; +Cc: emacs-devel > Something like this code: > (setq result (cons > (let ((str (make-string 1 0))) > (aset str 0 (make-char 'japanese-jisx0208 ku ten)) That's truly horrendous code. I see no reason to support it. > although it's easy to fix it... Not only it's easy but the result is more efficient/legible/maintainable. >> aset on strings is fundamentally problematic, so anything that restricts >> it further is good in my book (my own local Emacs disallows them >> plainly, and I rarely bump into code that needs it). > What is the fundamental problem? The one you're bumping into: multibyte strings are not arrays and treating them like ones asks for trouble: the performance is not the one expected, the implementation is complex and ugly, ... When weighed against the *very* rare cases where aset is used (let alone the even more rare cases where aset is actually useful and convenient), the choice is trivial (for me anyway). Stefan PS: I see bindat.el uses string-make-unibyte is a similar way to the place where we recently switched to unibyte-string, except that th source is an array rather than a list, and I was thinking: wouldn't it make sense to allow `apply' to take an array of args rather than a list of args? Especially if it's of the form (apply FUN ARRAY) since we then could use ARRAY directly without having to copy the args one by one into a new C array. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 15:33 ` Stefan Monnier @ 2008-02-13 18:06 ` Stephen J. Turnbull 2008-02-13 19:33 ` Stefan Monnier ` (2 more replies) 2008-02-15 1:39 ` Kenichi Handa 1 sibling, 3 replies; 43+ messages in thread From: Stephen J. Turnbull @ 2008-02-13 18:06 UTC (permalink / raw) To: Stefan Monnier; +Cc: Kenichi Handa, emacs-devel Stefan Monnier writes: > PS: I see bindat.el uses string-make-unibyte is a similar way to the > place where we recently switched to unibyte-string, except that th > source is an array rather than a list, and I was thinking: wouldn't > it make sense to allow `apply' to take an array of args rather than > a list of args? How Pythonic! ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 18:06 ` Stephen J. Turnbull @ 2008-02-13 19:33 ` Stefan Monnier 2008-02-13 22:49 ` Miles Bader 2008-02-14 4:42 ` Richard Stallman 2 siblings, 0 replies; 43+ messages in thread From: Stefan Monnier @ 2008-02-13 19:33 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Kenichi Handa, emacs-devel >> PS: I see bindat.el uses string-make-unibyte is a similar way to the >> place where we recently switched to unibyte-string, except that th >> source is an array rather than a list, and I was thinking: wouldn't >> it make sense to allow `apply' to take an array of args rather than >> a list of args? > How Pythonic! I never used Python and the idea is already used by Elisp's mapcar and mapconcat which predate Python... Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 18:06 ` Stephen J. Turnbull 2008-02-13 19:33 ` Stefan Monnier @ 2008-02-13 22:49 ` Miles Bader 2008-02-14 1:11 ` Stephen J. Turnbull 2008-02-14 4:42 ` Richard Stallman 2 siblings, 1 reply; 43+ messages in thread From: Miles Bader @ 2008-02-13 22:49 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Kenichi Handa, Stefan Monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > PS: I see bindat.el uses string-make-unibyte is a similar way to the > > place where we recently switched to unibyte-string, except that th > > source is an array rather than a list, and I was thinking: wouldn't > > it make sense to allow `apply' to take an array of args rather than > > a list of args? > > How Pythonic! No need for insults Stephen! [p.s. Stefan -- great idea, probably will be a lot more efficient than using a list...] -Miles -- "Don't just question authority, Don't forget to question me." -- Jello Biafra ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 22:49 ` Miles Bader @ 2008-02-14 1:11 ` Stephen J. Turnbull 2008-02-14 1:17 ` Miles Bader 0 siblings, 1 reply; 43+ messages in thread From: Stephen J. Turnbull @ 2008-02-14 1:11 UTC (permalink / raw) To: Miles Bader; +Cc: Kenichi Handa, Stefan Monnier, emacs-devel Miles Bader writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > How Pythonic! > No need for insults Stephen! You're welcome to take insult if you like, but I don't know a higher compliment in language design, for values of "design" that are verbs. Isn't half bad for values of "design" that are nouns, for that matter. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 1:11 ` Stephen J. Turnbull @ 2008-02-14 1:17 ` Miles Bader 2008-02-14 1:40 ` Stefan Monnier 2008-02-14 4:20 ` Stephen J. Turnbull 0 siblings, 2 replies; 43+ messages in thread From: Miles Bader @ 2008-02-14 1:17 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Kenichi Handa, Stefan Monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > How Pythonic! > > > No need for insults Stephen! > > You're welcome to take insult if you like, but I don't know a higher > compliment in language design Please tell me you're joking... -Miles -- Opportunity, n. A favorable occasion for grasping a disappointment. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 1:17 ` Miles Bader @ 2008-02-14 1:40 ` Stefan Monnier 2008-02-14 1:49 ` Miles Bader 2008-02-14 18:10 ` Richard Stallman 2008-02-14 4:20 ` Stephen J. Turnbull 1 sibling, 2 replies; 43+ messages in thread From: Stefan Monnier @ 2008-02-14 1:40 UTC (permalink / raw) To: Miles Bader; +Cc: Stephen J. Turnbull, Kenichi Handa, emacs-devel >> > > How Pythonic! >> > No need for insults Stephen! >> You're welcome to take insult if you like, but I don't know a higher >> compliment in language design > Please tell me you're joking... While I don't consider Python as the best design by far (the lack of type system rules it out right away), Elisp's dynamic scoping mixed with buffer-local and frame-local and terminal-local variables is pretty horrendous, Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 1:40 ` Stefan Monnier @ 2008-02-14 1:49 ` Miles Bader 2008-02-14 18:10 ` Richard Stallman 1 sibling, 0 replies; 43+ messages in thread From: Miles Bader @ 2008-02-14 1:49 UTC (permalink / raw) To: Stefan Monnier; +Cc: Stephen J. Turnbull, Kenichi Handa, emacs-devel Stefan Monnier <monnier@iro.umontreal.ca> writes: > While I don't consider Python as the best design by far (the lack of > type system rules it out right away), Elisp's dynamic scoping mixed with > buffer-local and frame-local and terminal-local variables is pretty > horrendous, Hey I didn't make any claims about elisp's variables (or elisp at all for that matter...). Python is far from being the _worst_ language out there. Still, the phrase "damning with faint praise" comes to mind... -Miles -- Infancy, n. The period of our lives when, according to Wordsworth, 'Heaven lies about us.' The world begins lying about us pretty soon afterward. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 1:40 ` Stefan Monnier 2008-02-14 1:49 ` Miles Bader @ 2008-02-14 18:10 ` Richard Stallman 2008-02-14 22:40 ` David Kastrup 2008-02-14 23:37 ` Leo 1 sibling, 2 replies; 43+ messages in thread From: Richard Stallman @ 2008-02-14 18:10 UTC (permalink / raw) To: Stefan Monnier; +Cc: stephen, handa, emacs-devel, miles While I don't consider Python as the best design by far (the lack of type system rules it out right away), Elisp's dynamic scoping mixed with buffer-local and frame-local and terminal-local variables is pretty horrendous, I think it is elegant. Dynamic scoping is absolutely essential for an Emacs-like editor, as explained in the Emacs paper from 1981. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 18:10 ` Richard Stallman @ 2008-02-14 22:40 ` David Kastrup 2008-02-15 1:08 ` Stephen J. Turnbull 2008-02-15 12:58 ` Richard Stallman 2008-02-14 23:37 ` Leo 1 sibling, 2 replies; 43+ messages in thread From: David Kastrup @ 2008-02-14 22:40 UTC (permalink / raw) To: rms; +Cc: miles, stephen, handa, Stefan Monnier, emacs-devel Richard Stallman <rms@gnu.org> writes: > While I don't consider Python as the best design by far (the lack > of type system rules it out right away), Elisp's dynamic scoping > mixed with buffer-local and frame-local and terminal-local > variables is pretty horrendous, > > I think it is elegant. Dynamic scoping is absolutely essential for an > Emacs-like editor, as explained in the Emacs paper from 1981. Shhhhh. Stefan is still in Cc, and he'll be sad to hear that his lexbind branch can't possibly exist. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 22:40 ` David Kastrup @ 2008-02-15 1:08 ` Stephen J. Turnbull 2008-02-15 1:17 ` Miles Bader 2008-02-15 12:58 ` Richard Stallman 1 sibling, 1 reply; 43+ messages in thread From: Stephen J. Turnbull @ 2008-02-15 1:08 UTC (permalink / raw) To: David Kastrup; +Cc: handa, emacs-devel, rms, Stefan Monnier, miles David Kastrup writes: > Richard Stallman <rms@gnu.org> writes: > > > While I don't consider Python as the best design by far (the lack > > of type system rules it out right away), Elisp's dynamic scoping > > mixed with buffer-local and frame-local and terminal-local > > variables is pretty horrendous, > > > > I think it is elegant. Dynamic scoping is absolutely essential for an > > Emacs-like editor, as explained in the Emacs paper from 1981. > > Shhhhh. Stefan is still in Cc, and he'll be sad to hear that his > lexbind branch can't possibly exist. No problema, amigos. From the cited paper (SIGOA 1981, 2:1-2): It is not necessary for dynamic scope to be the *only* scope rule available, just useful for it to be available. (Isn't it Miles who is maintaining the lexbind branch? But he's there, too. :-) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-15 1:08 ` Stephen J. Turnbull @ 2008-02-15 1:17 ` Miles Bader 2008-02-15 7:27 ` David Kastrup 0 siblings, 1 reply; 43+ messages in thread From: Miles Bader @ 2008-02-15 1:17 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: handa, rms, Stefan Monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > (Isn't it Miles who is maintaining the lexbind branch? But he's > there, too. :-) Yes, the lexbind branch is mine... -Miles -- Philosophy, n. A route of many roads leading from nowhere to nothing. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-15 1:17 ` Miles Bader @ 2008-02-15 7:27 ` David Kastrup 0 siblings, 0 replies; 43+ messages in thread From: David Kastrup @ 2008-02-15 7:27 UTC (permalink / raw) To: Miles Bader; +Cc: Stephen J. Turnbull, handa, rms, Stefan Monnier, emacs-devel Miles Bader <miles@gnu.org> writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: >> (Isn't it Miles who is maintaining the lexbind branch? But he's >> there, too. :-) > > Yes, the lexbind branch is mine... Oops. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 22:40 ` David Kastrup 2008-02-15 1:08 ` Stephen J. Turnbull @ 2008-02-15 12:58 ` Richard Stallman 1 sibling, 0 replies; 43+ messages in thread From: Richard Stallman @ 2008-02-15 12:58 UTC (permalink / raw) To: David Kastrup; +Cc: miles, stephen, handa, monnier, emacs-devel > I think it is elegant. Dynamic scoping is absolutely essential for an > Emacs-like editor, as explained in the Emacs paper from 1981. Shhhhh. Stefan is still in Cc, and he'll be sad to hear that his lexbind branch can't possibly exist. The lexbind branch does have dynamic scoping. Without that it would not work at all. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 18:10 ` Richard Stallman 2008-02-14 22:40 ` David Kastrup @ 2008-02-14 23:37 ` Leo 2008-02-15 12:59 ` Richard Stallman 1 sibling, 1 reply; 43+ messages in thread From: Leo @ 2008-02-14 23:37 UTC (permalink / raw) To: emacs-devel On 2008-02-14 18:10 +0000, Richard Stallman wrote: > I think it is elegant. Dynamic scoping is absolutely essential for an > Emacs-like editor, as explained in the Emacs paper from 1981. Actually that paper is misleading nowadays. -- .: Leo :. [ sdl.web AT gmail.com ] .: [ GPG Key: 9283AA3F ] :. Use the best OS -- http://www.fedoraproject.org/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 23:37 ` Leo @ 2008-02-15 12:59 ` Richard Stallman 0 siblings, 0 replies; 43+ messages in thread From: Richard Stallman @ 2008-02-15 12:59 UTC (permalink / raw) To: Leo; +Cc: emacs-devel Actually that paper is misleading nowadays. I am willing to listen to arguments to that effect. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-14 1:17 ` Miles Bader 2008-02-14 1:40 ` Stefan Monnier @ 2008-02-14 4:20 ` Stephen J. Turnbull 1 sibling, 0 replies; 43+ messages in thread From: Stephen J. Turnbull @ 2008-02-14 4:20 UTC (permalink / raw) To: Miles Bader; +Cc: Kenichi Handa, Stefan Monnier, emacs-devel Miles Bader writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > > How Pythonic! > > > > > No need for insults Stephen! > > > > You're welcome to take insult if you like, but I don't know a higher > > compliment in language design > > Please tell me you're joking... No, I'm not. You don't have to like their design goals, but to lack respect for their success in achieving the ones they've chosen ... well, go ahead, laugh at the Tao. That's what it's there for, says so right on the label. As for the particular proposal of Stefan's, it *is* very Pythonic. (It's duck typing for sequences.) It's not particularly Emacs-Lisp-y, what with aref and nth that do the same thing but to different types of sequences, etc., and the dozen or more ways of implementing dictionaries, all having distinct APIs for accessing properties by keyword. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 18:06 ` Stephen J. Turnbull 2008-02-13 19:33 ` Stefan Monnier 2008-02-13 22:49 ` Miles Bader @ 2008-02-14 4:42 ` Richard Stallman 2 siblings, 0 replies; 43+ messages in thread From: Richard Stallman @ 2008-02-14 4:42 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: handa, monnier, emacs-devel > PS: I see bindat.el uses string-make-unibyte is a similar way to the > place where we recently switched to unibyte-string, except that th > source is an array rather than a list, and I was thinking: wouldn't > it make sense to allow `apply' to take an array of args rather than > a list of args? How Pythonic! I see no harm in it. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 15:33 ` Stefan Monnier 2008-02-13 18:06 ` Stephen J. Turnbull @ 2008-02-15 1:39 ` Kenichi Handa 2008-02-15 4:27 ` Stefan Monnier ` (2 more replies) 1 sibling, 3 replies; 43+ messages in thread From: Kenichi Handa @ 2008-02-15 1:39 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel In article <jwvwsp8vpx8.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: > > Something like this code: > > (setq result (cons > > (let ((str (make-string 1 0))) > > (aset str 0 (make-char 'japanese-jisx0208 ku ten)) > That's truly horrendous code. I see no reason to support it. > > although it's easy to fix it... > Not only it's easy but the result is more efficient/legible/maintainable. Even if the code is very bad, it worked in Emacs 22. If it doesn't work in Emacs 23, it's a regression. >>> aset on strings is fundamentally problematic, so anything that restricts >>> it further is good in my book (my own local Emacs disallows them >>> plainly, and I rarely bump into code that needs it). > > What is the fundamental problem? > The one you're bumping into: multibyte strings are not arrays and > treating them like ones asks for trouble: the performance is not the one > expected, the implementation is complex and ugly, ... The problem here is that (make-string 1 ?a) is a unibyte string, but "a" generated by buffer-substring on a multibyte buffer is a multibyte string. The result of concatinating them is also multibyte. So, the multibyteness of strings is difficult of expect. If we are going to inhibit aset on multibyte strings, I think we should inhibit aset on any strings to avoid a further confusion. > When weighed against the *very* rare cases where aset is used (let > alone the even more rare cases where aset is actually useful and > convenient), the choice is trivial (for me anyway). Then, shouldn't we start the experiment of inhibitting aset on strings just now? --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-15 1:39 ` Kenichi Handa @ 2008-02-15 4:27 ` Stefan Monnier 2008-02-15 8:42 ` Eli Zaretskii 2008-02-16 5:53 ` Richard Stallman 2 siblings, 0 replies; 43+ messages in thread From: Stefan Monnier @ 2008-02-15 4:27 UTC (permalink / raw) To: Kenichi Handa; +Cc: emacs-devel > Even if the code is very bad, it worked in Emacs 22. > If it doesn't work in Emacs 23, it's a regression. If it makes them improve their code, it's an ... improvement. > The problem here is that (make-string 1 ?a) is a unibyte > string, but "a" generated by buffer-substring on a multibyte > buffer is a multibyte string. The result of concatinating > them is also multibyte. So, the multibyteness of strings is > difficult of expect. Indeed. To work around this problem, my locally hacked Emacs distinguishes between unibyte strings (byte-length < 0), multibyte strings (byte-length > char-length), and "anybyte" strings (byte-length = char-length). > If we are going to inhibit aset on multibyte strings, We can just inhibit aset if it requires changes the string's byte-length or multibyte-ness. > I think we should inhibit aset on any strings to avoid > a further confusion. And that's indeed what my locally hacked Emacs does. >> When weighed against the *very* rare cases where aset is used (let >> alone the even more rare cases where aset is actually useful and >> convenient), the choice is trivial (for me anyway). > Then, shouldn't we start the experiment of inhibitting aset > on strings just now? But I do not think we're ready for that. Maybe 10 years from now... Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-15 1:39 ` Kenichi Handa 2008-02-15 4:27 ` Stefan Monnier @ 2008-02-15 8:42 ` Eli Zaretskii 2008-02-15 8:53 ` Miles Bader 2008-02-16 5:53 ` Richard Stallman 2 siblings, 1 reply; 43+ messages in thread From: Eli Zaretskii @ 2008-02-15 8:42 UTC (permalink / raw) To: Kenichi Handa; +Cc: monnier, emacs-devel > From: Kenichi Handa <handa@m17n.org> > Date: Fri, 15 Feb 2008 10:39:01 +0900 > Cc: emacs-devel@gnu.org > > The problem here is that (make-string 1 ?a) is a unibyte > string, but "a" generated by buffer-substring on a multibyte > buffer is a multibyte string. Why should (make-string 1 ?a) produce a unibyte string? Why can't it produce a multibyte string instead? More generally, how about if we make sure _all_ string-producing primitives return multibyte strings, and unibyte strings can only be produced by a few specialized ones which have "-unibyte-" in their name? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-15 8:42 ` Eli Zaretskii @ 2008-02-15 8:53 ` Miles Bader 2008-02-16 12:55 ` Eli Zaretskii 0 siblings, 1 reply; 43+ messages in thread From: Miles Bader @ 2008-02-15 8:53 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel, monnier, Kenichi Handa Eli Zaretskii <eliz@gnu.org> writes: > Why should (make-string 1 ?a) produce a unibyte string? Why can't it > produce a multibyte string instead? > > More generally, how about if we make sure _all_ string-producing > primitives return multibyte strings, and unibyte strings can only be > produced by a few specialized ones which have "-unibyte-" in their > name? Why? That doesn't seem to help with the issue being discussed (Stefan's solution seems pretty good though).... -Miles -- Infancy, n. The period of our lives when, according to Wordsworth, 'Heaven lies about us.' The world begins lying about us pretty soon afterward. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-15 8:53 ` Miles Bader @ 2008-02-16 12:55 ` Eli Zaretskii 0 siblings, 0 replies; 43+ messages in thread From: Eli Zaretskii @ 2008-02-16 12:55 UTC (permalink / raw) To: Miles Bader; +Cc: emacs-devel, monnier, handa > From: Miles Bader <miles.bader@necel.com> > Cc: Kenichi Handa <handa@m17n.org>, monnier@iro.umontreal.ca, > emacs-devel@gnu.org > Date: Fri, 15 Feb 2008 17:53:50 +0900 > > Eli Zaretskii <eliz@gnu.org> writes: > > Why should (make-string 1 ?a) produce a unibyte string? Why can't it > > produce a multibyte string instead? > > > > More generally, how about if we make sure _all_ string-producing > > primitives return multibyte strings, and unibyte strings can only be > > produced by a few specialized ones which have "-unibyte-" in their > > name? > > Why? Because unibyte strings are evil, and shouldn't be needed in Emacs, except in a few very specialized situations. > That doesn't seem to help with the issue being discussed (Stefan's > solution seems pretty good though).... The fact that (make-string 1 ?a) produces a unibyte string was mentioned as one problem. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-15 1:39 ` Kenichi Handa 2008-02-15 4:27 ` Stefan Monnier 2008-02-15 8:42 ` Eli Zaretskii @ 2008-02-16 5:53 ` Richard Stallman 2008-02-16 14:33 ` Stefan Monnier 2 siblings, 1 reply; 43+ messages in thread From: Richard Stallman @ 2008-02-16 5:53 UTC (permalink / raw) To: Kenichi Handa; +Cc: monnier, emacs-devel If we are going to inhibit aset on multibyte strings, I think we should inhibit aset on any strings to avoid a further confusion. I think someone should try making it work. The way I suggested should not be terribly hard. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-16 5:53 ` Richard Stallman @ 2008-02-16 14:33 ` Stefan Monnier 2008-02-17 20:29 ` Richard Stallman 0 siblings, 1 reply; 43+ messages in thread From: Stefan Monnier @ 2008-02-16 14:33 UTC (permalink / raw) To: rms; +Cc: emacs-devel, Kenichi Handa > If we are going to inhibit aset on multibyte strings, I think we > should inhibit aset on any strings to avoid a further confusion. > I think someone should try making it work. > The way I suggested should not be terribly hard. The problem is the following: while it can be made to work, it will be inefficient. If we just make it work, the callers will never get to know that they're doing things in a terribly inefficient way. The real fix is to change the caller. BTW, I suggest the patch below to fix one such caller. Stefan --- orig/src/casefiddle.c +++ mod/src/casefiddle.c @@ -75,23 +76,18 @@ return obj; } - if (STRINGP (obj)) + if (!STRINGP (obj)) + wrong_type_argument (Qchar_or_string_p, obj); + else if (STRING_UNIBYTE (obj)) { - int multibyte = STRING_MULTIBYTE (obj); - int i, i_byte, len; - int size = SCHARS (obj); + EMACS_INT i; + EMACS_INT size = SCHARS (obj); obj = Fcopy_sequence (obj); - for (i = i_byte = 0; i < size; i++, i_byte += len) + for (i = 0; i < size; i++) { - if (multibyte) - c = STRING_CHAR_AND_LENGTH (SDATA (obj) + i_byte, 0, len); - else - { - c = SREF (obj, i_byte); - len = 1; - MAKE_CHAR_MULTIBYTE (c); - } + c = SREF (obj, i); + MAKE_CHAR_MULTIBYTE (c); c1 = c; if (inword && flag != CASE_CAPITALIZE_UP) c = DOWNCASE (c); @@ -102,24 +98,51 @@ inword = (SYNTAX (c) == Sword); if (c != c1) { - if (! multibyte) - { - MAKE_CHAR_UNIBYTE (c); - SSET (obj, i_byte, c); - } - else if (ASCII_CHAR_P (c1) && ASCII_CHAR_P (c)) - SSET (obj, i_byte, c); - else - { - Faset (obj, make_number (i), make_number (c)); - i_byte += CHAR_BYTES (c) - len; - } + MAKE_CHAR_UNIBYTE (c); + if (c < 0 || c > 255) + error ("Non-unibyte char in unibyte string"); + SSET (obj, i, c); } } return obj; } + else + { + EMACS_INT i, i_byte, len; + EMACS_INT size = SCHARS (obj); + USE_SAFE_ALLOCA; + unsigned char *dst, *o; + /* Over-allocate by 12%: this is a minor overhead, but should be + sufficient in 99.999% of the cases to avoid a reallocation. */ + EMACS_INT o_size = SBYTES (obj) + SBYTES (obj) / 8 + MAX_MULTIBYTE_LENGTH; + SAFE_ALLOCA (dst, void *, o_size); + o = dst; - wrong_type_argument (Qchar_or_string_p, obj); + for (i = i_byte = 0; i < size; i++, i_byte += len) + { + if ((o - dst) + MAX_MULTIBYTE_LENGTH > o_size) + { /* Not enough space for the next char: grow the destination. */ + unsigned char *old_dst = dst; + o_size += o_size; /* Probably overkill, but extremely rare. */ + SAFE_ALLOCA (dst, void *, o_size); + bcopy (old_dst, dst, o - old_dst); + o = dst + (o - old_dst); + } + c = STRING_CHAR_AND_LENGTH (SDATA (obj) + i_byte, 0, len); + if (inword && flag != CASE_CAPITALIZE_UP) + c = DOWNCASE (c); + else if (!UPPERCASEP (c) + && (!inword || flag != CASE_CAPITALIZE_UP)) + c = UPCASE1 (c); + if ((int) flag >= (int) CASE_CAPITALIZE) + inword = (SYNTAX (c) == Sword); + o += CHAR_STRING (c, o); + } + eassert (o - dst <= o_size); + obj = make_multibyte_string (dst, size, o - dst); + SAFE_FREE (); + return obj; + } } DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0, @@ -329,10 +352,10 @@ return Qnil; } \f -Lisp_Object +static Lisp_Object operate_on_word (arg, newpoint) Lisp_Object arg; - int *newpoint; + EMACS_INT *newpoint; { Lisp_Object val; int farend; @@ -358,7 +381,7 @@ Lisp_Object arg; { Lisp_Object beg, end; - int newpoint; + EMACS_INT newpoint; XSETFASTINT (beg, PT); end = operate_on_word (arg, &newpoint); casify_region (CASE_UP, beg, end); @@ -373,7 +396,7 @@ Lisp_Object arg; { Lisp_Object beg, end; - int newpoint; + EMACS_INT newpoint; XSETFASTINT (beg, PT); end = operate_on_word (arg, &newpoint); casify_region (CASE_DOWN, beg, end); @@ -390,7 +413,7 @@ Lisp_Object arg; { Lisp_Object beg, end; - int newpoint; + EMACS_INT newpoint; XSETFASTINT (beg, PT); end = operate_on_word (arg, &newpoint); casify_region (CASE_CAPITALIZE, beg, end); ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-16 14:33 ` Stefan Monnier @ 2008-02-17 20:29 ` Richard Stallman 2008-02-18 1:15 ` Stefan Monnier 0 siblings, 1 reply; 43+ messages in thread From: Richard Stallman @ 2008-02-17 20:29 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel, handa > If we are going to inhibit aset on multibyte strings, I think we > should inhibit aset on any strings to avoid a further confusion. > I think someone should try making it work. > The way I suggested should not be terribly hard. The problem is the following: while it can be made to work, it will be inefficient. That inefficiency may or may not be important in any given context. Fixing it in casefiddle is definitely desirable. But is it worth breaking all such packages just so that they will optimize an operation that might not use much of the time anyway? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-17 20:29 ` Richard Stallman @ 2008-02-18 1:15 ` Stefan Monnier 2008-02-18 4:00 ` Kenichi Handa 2008-02-18 17:31 ` Richard Stallman 0 siblings, 2 replies; 43+ messages in thread From: Stefan Monnier @ 2008-02-18 1:15 UTC (permalink / raw) To: rms; +Cc: emacs-devel, handa >> If we are going to inhibit aset on multibyte strings, I think we >> should inhibit aset on any strings to avoid a further confusion. >> I think someone should try making it work. >> The way I suggested should not be terribly hard. > The problem is the following: while it can be made to work, it will be > inefficient. > That inefficiency may or may not be important in any given context. > Fixing it in casefiddle is definitely desirable. > But is it worth breaking all such packages just so that they > will optimize an operation that might not use much of the time anyway? Why work around the problem in `aset' if it isn't worth fixing in the original code? Especially since implicit conversion of a unibyte-string to multibyte is generally a bug in itself (since there are as many ways to do that as there are coding systems). Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-18 1:15 ` Stefan Monnier @ 2008-02-18 4:00 ` Kenichi Handa 2008-02-18 17:31 ` Richard Stallman 1 sibling, 0 replies; 43+ messages in thread From: Kenichi Handa @ 2008-02-18 4:00 UTC (permalink / raw) To: Stefan Monnier; +Cc: rms, emacs-devel In article <jwvskzrgj6d.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: > > That inefficiency may or may not be important in any given context. > > Fixing it in casefiddle is definitely desirable. > > But is it worth breaking all such packages just so that they > > will optimize an operation that might not use much of the time anyway? > Why work around the problem in `aset' if it isn't worth fixing in the > original code? But you wrote: > > Then, shouldn't we start the experiment of inhibitting aset > > on strings just now? > > But I do not think we're ready for that. Maybe 10 years from now... I want to avoid treating non-ASCII chars different from ASCII. Then, the only solution is to make aset work well for multibyte characters. > Especially since implicit conversion of a unibyte-string > to multibyte is generally a bug in itself (since there are as many ways > to do that as there are coding systems). It's not a bug but I agree it's a very bad feature. But for the case (aset (string ?a) 0 MULTIBYTE-CHAR), I think it's better to treat "a" as neutral, or in your terminology "anybyte". --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-18 1:15 ` Stefan Monnier 2008-02-18 4:00 ` Kenichi Handa @ 2008-02-18 17:31 ` Richard Stallman 1 sibling, 0 replies; 43+ messages in thread From: Richard Stallman @ 2008-02-18 17:31 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel, handa Why work around the problem in `aset' if it isn't worth fixing in the original code? I don't understand you. I don't think I suggested "working around" the string `aset' problem. I suggested an easy way to fix the bug and make `aset' work in all cases. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 2:36 (aset UNIBYTE-STRING MULTIBYTE-CHAR) Kenichi Handa 2008-02-13 2:49 ` Stefan Monnier @ 2008-02-13 22:01 ` Richard Stallman 2008-02-13 23:13 ` Miles Bader 1 sibling, 1 reply; 43+ messages in thread From: Richard Stallman @ 2008-02-13 22:01 UTC (permalink / raw) To: Kenichi Handa; +Cc: emacs-devel In emacs-unicode-2 branch, there was a discussion about the rightness of aset changing the multibyteness of a string, and I changed the code to signal an error in the above case. But, I got reports claiming that the change breaks some of already existing Elisp packages. We should investigate what packages these are, what they are doing that uses this, and how hard it is to fix them. Based on that we can decide. It is ok to break a few things in a way that is easy to fix. But it is not terribly hard to make this case work once again, given that all strings are indirect. At worst, one can make a new string with the modified contents, then swap the `data' pointers between the new string and the old one. That would be better than breaking lots of packages. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-02-13 22:01 ` Richard Stallman @ 2008-02-13 23:13 ` Miles Bader 0 siblings, 0 replies; 43+ messages in thread From: Miles Bader @ 2008-02-13 23:13 UTC (permalink / raw) To: rms; +Cc: Kenichi Handa, emacs-devel Richard Stallman <rms@gnu.org> writes: > But it is not terribly hard to make this case work once again, given > that all strings are indirect. At worst, one can make a new string > with the modified contents, then swap the `data' pointers between the > new string and the old one. As Stefan noted, though, the entire idea of using aset to store into a multibyte string is rather dodgy...I think there's a certain expectation by users of aset that the operation follows general array behavior, most notably O(1) complexity, and storing into a multibyte string doesn't follow that. If we can discourage this usage without too much fallout, that seems a lot better in the long run. -Miles -- Twice, adv. Once too often. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) @ 2008-04-15 7:11 Kenichi Handa 2008-04-15 15:52 ` Stefan Monnier 0 siblings, 1 reply; 43+ messages in thread From: Kenichi Handa @ 2008-04-15 7:11 UTC (permalink / raw) To: emacs-devel; +Cc: kazu The discussion on this problem has been suspended for long. I'd like to settle it. I wrote: > In article <jwvskzrgj6d.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: > > > That inefficiency may or may not be important in any given context. > > > Fixing it in casefiddle is definitely desirable. > > > But is it worth breaking all such packages just so that they > > > will optimize an operation that might not use much of the time anyway? > > Why work around the problem in `aset' if it isn't worth fixing in the > > original code? > But you wrote: > > > Then, shouldn't we start the experiment of inhibitting aset > > > on strings just now? > > > > But I do not think we're ready for that. Maybe 10 years from now... > I want to avoid treating non-ASCII chars different from > ASCII. Then, the only solution is to make aset work well > for multibyte characters. The attached simple change does the work. May I install it? --- Kenichi Handa handa@ni.aist.go.jp *** lisp.h.~1.617.~ 2008-04-01 15:12:13.000000000 +0900 --- lisp.h 2008-04-15 15:42:52.000000000 +0900 *************** *** 725,730 **** --- 725,737 ---- (STR) = empty_unibyte_string; \ else XSTRING (STR)->size_byte = -1; } while (0) + /* Mark STR as a multibyte string. Assure that STR contains only + ASCII characters in advance. */ + #define STRING_SET_MULTIBYTE(STR) \ + do { if (EQ (STR, empty_unibyte_string)) \ + (STR) = empty_multibyte_string; \ + else XSTRING (STR)->size_byte = XSTRING (STR)->size; } while (0) + /* Get text properties. */ #define STRING_INTERVALS(STR) (XSTRING (STR)->intervals + 0) *** data.c.~1.290.~ 2008-03-27 20:16:37.000000000 +0900 --- data.c 2008-04-15 15:42:31.000000000 +0900 *************** *** 2093,2099 **** CHECK_NUMBER (newelt); if (XINT (newelt) >= 0 && ! SINGLE_BYTE_CHAR_P (XINT (newelt))) ! args_out_of_range (array, newelt); SSET (array, idxval, XINT (newelt)); } --- 2093,2109 ---- CHECK_NUMBER (newelt); if (XINT (newelt) >= 0 && ! SINGLE_BYTE_CHAR_P (XINT (newelt))) ! { ! int i; ! ! for (i = SBYTES (array) - 1; i >= 0; i--) ! if (SREF (array, i) >= 0x80) ! args_out_of_range (array, newelt); ! /* ARRAY is an ASCII string. Convert it to a multibyte ! string, and try `aset' again. */ ! STRING_SET_MULTIBYTE (array); ! return Faset (array, idx, newelt); ! } SSET (array, idxval, XINT (newelt)); } ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-04-15 7:11 Kenichi Handa @ 2008-04-15 15:52 ` Stefan Monnier 2008-04-17 1:13 ` Kenichi Handa 0 siblings, 1 reply; 43+ messages in thread From: Stefan Monnier @ 2008-04-15 15:52 UTC (permalink / raw) To: Kenichi Handa; +Cc: kazu, emacs-devel >> > > That inefficiency may or may not be important in any given context. >> > > Fixing it in casefiddle is definitely desirable. >> > > But is it worth breaking all such packages just so that they >> > > will optimize an operation that might not use much of the time anyway? >> > Why work around the problem in `aset' if it isn't worth fixing in the >> > original code? >> But you wrote: >> > > Then, shouldn't we start the experiment of inhibitting aset >> > > on strings just now? >> > >> > But I do not think we're ready for that. Maybe 10 years from now... >> I want to avoid treating non-ASCII chars different from >> ASCII. Then, the only solution is to make aset work well >> for multibyte characters. > The attached simple change does the work. May I install it? I guess it's OK. It's pretty ugly in terms of code, but in terms of behavior it more or less matches the behavior of what I use (where I distinguish between unibyte/anybyte/multibyte), Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-04-15 15:52 ` Stefan Monnier @ 2008-04-17 1:13 ` Kenichi Handa 0 siblings, 0 replies; 43+ messages in thread From: Kenichi Handa @ 2008-04-17 1:13 UTC (permalink / raw) To: Stefan Monnier; +Cc: kazu, emacs-devel In article <jwvr6d7ce37.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes: >>> I want to avoid treating non-ASCII chars different from >>> ASCII. Then, the only solution is to make aset work well >>> for multibyte characters. > > The attached simple change does the work. May I install it? > I guess it's OK. It's pretty ugly in terms of code, but in terms of > behavior it more or less matches the behavior of what I use (where > I distinguish between unibyte/anybyte/multibyte), Ok, I've just installed it. --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) @ 2008-05-07 19:31 Harald Hanche-Olsen 2008-05-14 6:54 ` Harald Hanche-Olsen 0 siblings, 1 reply; 43+ messages in thread From: Harald Hanche-Olsen @ 2008-05-07 19:31 UTC (permalink / raw) To: emacs-devel; +Cc: eliz This works as it should in the latest CVS: (setq foo (make-string 4 ?a)) (aset foo 1 ?€) ; <= that's a euro sign But this fails: (setq foo (make-string 4 ?a)) (aset foo 1 ?å) (aset foo 1 ?€) ; => Error: args out of range The problem seems to lie in these lines (2095-2107) from data.c: if (XINT (newelt) >= 0 && ! SINGLE_BYTE_CHAR_P (XINT (newelt))) { int i; for (i = SBYTES (array) - 1; i >= 0; i--) if (SREF (array, i) >= 0x80) args_out_of_range (array, newelt); /* ARRAY is an ASCII string. Convert it to a multibyte string, and try `aset' again. */ STRING_SET_MULTIBYTE (array); return Faset (array, idx, newelt); } SSET (array, idxval, XINT (newelt)); I am sure the test for members >= 0x80 is there for a good reason, but it clearly screws up this case and makes the fix rather less useful than it should have been. I don't know emacs internals well enough to suggest a fix. And yes, this did bite in real life: It caused mew to choke on a malformed spam email. No disaster obviously, but inconvenient. - Harald PS. My apologies for messing up threading; I wasn't on the list when the message I am responding to was posted on 2008-07-15, so I don't know its message-id. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-05-07 19:31 Harald Hanche-Olsen @ 2008-05-14 6:54 ` Harald Hanche-Olsen 2008-05-14 12:22 ` Stefan Monnier 0 siblings, 1 reply; 43+ messages in thread From: Harald Hanche-Olsen @ 2008-05-14 6:54 UTC (permalink / raw) To: emacs-devel My message on this topic of a week ago elicited no responses, so I did a little more research on my own (which I should have done in the first place, maybe). This time I hope to see some discussion: + Harald Hanche-Olsen <hanche@math.ntnu.no>: > This works as it should in the latest CVS: > > (setq foo (make-string 4 ?a)) > (aset foo 1 ?€) ; <= that's a euro sign > > But this fails: > > (setq foo (make-string 4 ?a)) > (aset foo 1 ?å) > (aset foo 1 ?€) ; => Error: args out of range I went back in the mail archives and read the whole thread (it was in February and April this year), and I realize that the whole idea of changing a unibyte string into a multibyte one on the fly in order to support aset on them is somewhat controversial. Be that as it may, the above example shows that the fix put in by Kenichi Handa does not fix it right. Moreover, it is clear from the commit message that he was well aware of this limitation at the time: Working file: data.c revision 1.291 date: 2008-04-17 03:10:58 +0200; author: handa; state: Exp; lines: +11 -1; commitid: yW6gyKxwbZ4EPoZs; (Faset): Allow setting a multibyte character in an ASCII-only unibyte string. It seems to me that in order to get it right, one has to reallocate the data in the case of a non-ASCII-only unibyte string, using code like what is already there for the case when aset replaces an ASCII character with a non-ASCII one (which will increase the byte count of the string). The end result will be ugly and inefficient, but I see no other way if we are going to lay this one to rest. Comments? - Harald ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-05-14 6:54 ` Harald Hanche-Olsen @ 2008-05-14 12:22 ` Stefan Monnier 2008-05-14 12:50 ` Harald Hanche-Olsen 0 siblings, 1 reply; 43+ messages in thread From: Stefan Monnier @ 2008-05-14 12:22 UTC (permalink / raw) To: Harald Hanche-Olsen; +Cc: emacs-devel > My message on this topic of a week ago elicited no responses, so I did > a little more research on my own (which I should have done in the > first place, maybe). This time I hope to see some discussion: > + Harald Hanche-Olsen <hanche@math.ntnu.no>: >> This works as it should in the latest CVS: >> >> (setq foo (make-string 4 ?a)) >> (aset foo 1 ?€) ; <= that's a euro sign >> >> But this fails: >> >> (setq foo (make-string 4 ?a)) >> (aset foo 1 ?å) >> (aset foo 1 ?€) ; => Error: args out of range Show us the real code that bunmped into the problem and I'll tell you how to do it so as to avoid the risk of such problems. Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-05-14 12:22 ` Stefan Monnier @ 2008-05-14 12:50 ` Harald Hanche-Olsen 2008-05-15 1:18 ` Stefan Monnier 0 siblings, 1 reply; 43+ messages in thread From: Harald Hanche-Olsen @ 2008-05-14 12:50 UTC (permalink / raw) To: monnier; +Cc: emacs-devel + Stefan Monnier <monnier@IRO.UMontreal.CA>: > > + Harald Hanche-Olsen <hanche@math.ntnu.no>: > > >> This works as it should in the latest CVS: > >> > >> (setq foo (make-string 4 ?a)) > >> (aset foo 1 ?€) ; <= that's a euro sign > >> > >> But this fails: > >> > >> (setq foo (make-string 4 ?a)) > >> (aset foo 1 ?å) > >> (aset foo 1 ?€) ; => Error: args out of range > > Show us the real code that bunmped into the problem and I'll tell you > how to do it so as to avoid the risk of such problems. You'd have to tell the author of mew (http://mew.org/), Kazu Yamamoto. Actually, I have a one line patch to mew that fixes the problem, but he seems unwilling to apply it. Now don't get me wrong: I am not asking for a change in emacs to fix a problem in mew. I am suggesting a change in emacs for the sake of robustness: I think that if the problem of inserting multibyte characters in unibyte strings is worth fixing at all, it is worth fixing so it works in all cases. Otherwise, why bother? I do understand the arguments against fixing it, but the current situation where it will often work, but fail sometimes does not seem good to me. But at least, it's documented, I see that now: 4.4 Modifying Strings ===================== The most basic way to alter the contents of an existing string is with `aset' (*note Array Functions::). `(aset STRING IDX CHAR)' stores CHAR into STRING at index IDX. Each character occupies one or more bytes, and if CHAR needs a different number of bytes from the character already present at that index, `aset' signals an error. That last bit actually seems to be outdated: An error is not ALWAYS signaled in the indicated situation, only sometimes. Anyway, the code you're asking for (in case you're really curious): In mew-header.el (defun mew-addrstr-parse-syntax-list (str sep addrp &optional depth allow-spc) (when str (let* ((i 0) (len (length str)) (par-cnt 0) (tmp-cnt 0) (sep-cnt 0) (tmp (mew-make-string len)) c ret prevc) (catch 'max (while (< i len) (setq c (aref str i)) ; <= problem occurs here ... deleted ...))))) My one-line fix consists of changing the definition (elsewhere) (defun mew-make-string (len) (make-string len ?a)) into one that makes a multibyte string at the outset. (I like mew (a lot), so I am willing to put up with its various idiosynchrasies (and there are a some).) - Harald ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-05-14 12:50 ` Harald Hanche-Olsen @ 2008-05-15 1:18 ` Stefan Monnier 2008-05-15 6:11 ` Harald Hanche-Olsen 0 siblings, 1 reply; 43+ messages in thread From: Stefan Monnier @ 2008-05-15 1:18 UTC (permalink / raw) To: Harald Hanche-Olsen; +Cc: emacs-devel > Now don't get me wrong: I am not asking for a change in Emacs to fix > a problem in Mew. I am suggesting a change in Emacs for the sake of > robustness: I think that if the problem of inserting multibyte > characters in unibyte strings is worth fixing at all, it is worth > fixing so it works in all cases. Otherwise, why bother? I do > understand the arguments against fixing it, but the current situation > where it will often work, but fail sometimes does not seem good to me. I don't claim that Mew does things wrong. I just want to see more examples to better understand the context and try to figure out what's the right way to fix the problem. Notice that in your example, (setq foo (make-string 4 ?a)) (aset foo 1 ?å) (aset foo 1 ?€) ; => Error: args out of range the problem comes from the fact that now that we use Unicode, ?å = 229. So this integer is also the code of a byte, which is why the first aset succeeds. Maybe the better answer is for `make-string' to always create multibyte strings, just like `string' now does. In any case if you stay far away from `aset on strings' your life will be generally better, the birds will sing and the sun will shine. > The most basic way to alter the contents of an existing string is with > `aset' (*note Array Functions::). `(aset STRING IDX CHAR)' stores CHAR > into STRING at index IDX. Each character occupies one or more bytes, > and if CHAR needs a different number of bytes from the character > already present at that index, `aset' signals an error. > That last bit actually seems to be outdated: An error is not ALWAYS > signaled in the indicated situation, only sometimes. I hope the text is correct, if not, please report it as a bug. > (defun mew-addrstr-parse-syntax-list (str sep addrp &optional depth allow-spc) > (when str > (let* ((i 0) (len (length str)) > (par-cnt 0) (tmp-cnt 0) (sep-cnt 0) > (tmp (mew-make-string len)) > c ret prevc) > (catch 'max > (while (< i len) > (setq c (aref str i)) ; <= problem occurs here > ... deleted ...))))) Hmm... I don't see any `aset'. Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: (aset UNIBYTE-STRING MULTIBYTE-CHAR) 2008-05-15 1:18 ` Stefan Monnier @ 2008-05-15 6:11 ` Harald Hanche-Olsen 0 siblings, 0 replies; 43+ messages in thread From: Harald Hanche-Olsen @ 2008-05-15 6:11 UTC (permalink / raw) To: monnier; +Cc: emacs-devel + Stefan Monnier <monnier@iro.umontreal.ca>: > I just want to see more > examples to better understand the context and try to figure out what's > the right way to fix the problem. Notice that in your example, > > (setq foo (make-string 4 ?a)) > (aset foo 1 ?å) > (aset foo 1 ?€) ; => Error: args out of range > > the problem comes from the fact that now that we use Unicode, ?å = 229. > So this integer is also the code of a byte, which is why the first aset > succeeds. Right. Or perhaps more accurately, it is why the first aset succeeds without automagically converting foo to a multibyte string. > Maybe the better answer is for `make-string' to always create > multibyte strings, just like `string' now does. Hmm. Except it doesn't, quite: (multibyte-string-p (string ?a ?b ?c ?d)) => nil (multibyte-string-p (string ?a ?b ?c ?å)) => t It seems to be the presence of non-ASCII that triggers the creation of a multibyte string, even though in this case a unibyte string could also hold the result. In fact, the current behaviours of string and make-string are quite similar: (multibyte-string-p (make-string 3 ?a)) => nil (multibyte-string-p (make-string 3 ?å)) => t > In any case if you stay far away from `aset on strings' your life will > be generally better, the birds will sing and the sun will shine. 8) I am willing to believe that. > > The most basic way to alter the contents of an existing string is with > > `aset' (*note Array Functions::). `(aset STRING IDX CHAR)' stores CHAR > > into STRING at index IDX. Each character occupies one or more bytes, > > and if CHAR needs a different number of bytes from the character > > already present at that index, `aset' signals an error. > > > That last bit actually seems to be outdated: An error is not ALWAYS > > signaled in the indicated situation, only sometimes. > > I hope the text is correct, if not, please report it as a bug. Okay. I'll run it past you here first, though, since my understanding of multibyte strings is still patchy. This succeeds and returns "€a€": (let ((str (make-string 3 ?€))) (aset str 1 ?a) str) If I am not mistaken ?€ needs two bytes (or more?) while ?a needs one, right? And since two (or more) is different from one, the above text claims that aset signals an error? Or is my understanding wrong? There is code in aset to shuffle the contents of a multibyte strings around in case of a size mismatch, however: if (prev_bytes != new_bytes) { /* We must relocate the string data. */ > > (defun mew-addrstr-parse-syntax-list (str sep addrp &optional depth allow-spc) > > (when str > > (let* ((i 0) (len (length str)) > > (par-cnt 0) (tmp-cnt 0) (sep-cnt 0) > > (tmp (mew-make-string len)) > > c ret prevc) > > (catch 'max > > (while (< i len) > > (setq c (aref str i)) ; <= problem occurs here > > ... deleted ...))))) > > Hmm... I don't see any `aset'. Rats. Not enough caffeine, too much work. The deleted code is a big (cond ...), about 80 lines long, that I didn't want to burden the list with (it performs parsing after all). I assure you that it contains (aset tmp tmp-cnt c) in multiple places. It could have achieved the same result by consing up a list of the characters and using (string (nreverse char-list)), or perhaps by appending chars to a temporary buffer, but it didn't. - Harald ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2008-05-15 6:11 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-02-13 2:36 (aset UNIBYTE-STRING MULTIBYTE-CHAR) Kenichi Handa 2008-02-13 2:49 ` Stefan Monnier 2008-02-13 3:48 ` Kenichi Handa 2008-02-13 15:33 ` Stefan Monnier 2008-02-13 18:06 ` Stephen J. Turnbull 2008-02-13 19:33 ` Stefan Monnier 2008-02-13 22:49 ` Miles Bader 2008-02-14 1:11 ` Stephen J. Turnbull 2008-02-14 1:17 ` Miles Bader 2008-02-14 1:40 ` Stefan Monnier 2008-02-14 1:49 ` Miles Bader 2008-02-14 18:10 ` Richard Stallman 2008-02-14 22:40 ` David Kastrup 2008-02-15 1:08 ` Stephen J. Turnbull 2008-02-15 1:17 ` Miles Bader 2008-02-15 7:27 ` David Kastrup 2008-02-15 12:58 ` Richard Stallman 2008-02-14 23:37 ` Leo 2008-02-15 12:59 ` Richard Stallman 2008-02-14 4:20 ` Stephen J. Turnbull 2008-02-14 4:42 ` Richard Stallman 2008-02-15 1:39 ` Kenichi Handa 2008-02-15 4:27 ` Stefan Monnier 2008-02-15 8:42 ` Eli Zaretskii 2008-02-15 8:53 ` Miles Bader 2008-02-16 12:55 ` Eli Zaretskii 2008-02-16 5:53 ` Richard Stallman 2008-02-16 14:33 ` Stefan Monnier 2008-02-17 20:29 ` Richard Stallman 2008-02-18 1:15 ` Stefan Monnier 2008-02-18 4:00 ` Kenichi Handa 2008-02-18 17:31 ` Richard Stallman 2008-02-13 22:01 ` Richard Stallman 2008-02-13 23:13 ` Miles Bader -- strict thread matches above, loose matches on Subject: below -- 2008-04-15 7:11 Kenichi Handa 2008-04-15 15:52 ` Stefan Monnier 2008-04-17 1:13 ` Kenichi Handa 2008-05-07 19:31 Harald Hanche-Olsen 2008-05-14 6:54 ` Harald Hanche-Olsen 2008-05-14 12:22 ` Stefan Monnier 2008-05-14 12:50 ` Harald Hanche-Olsen 2008-05-15 1:18 ` Stefan Monnier 2008-05-15 6:11 ` Harald Hanche-Olsen
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).