* Buffer-local variables affect general-purpose functions @ 2014-03-26 19:04 Eli Zaretskii 2014-03-26 19:32 ` Paul Eggert 2014-03-27 14:17 ` Stefan Monnier 0 siblings, 2 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-26 19:04 UTC (permalink / raw) To: emacs-devel (See bug#17011 for some context.) In some cases, Emacs uses buffer-local variables in ways that affect operations which might not have anything with buffer text. One example, from bug #17011 is this: M-x find-file-literally RET some-file RET M-x set-variable RET case-fold-search RET t RET M-: (chars-equal ?à ?À) RET This produces nil, although the characters should compare equal under case-fold-search. Why? Because we are in a unibyte buffer, where values between 128 and 255 are interpreted as eight-bit raw bytes, not as Latin characters, and raw bytes don't have lower/upper-case pairs. Another example, from the same sequence of commands above, is the fact that setting case-fold-search for the buffer affects comparison of characters that don't belong to the buffer, merely because that buffer happens to be current at the moment of comparison. Yet another example is 'downcase' and 'upcase' functions -- they use case tables local to the current buffer, even when the functions they are applied to characters and strings not from the buffer. This could produce subtle bugs, and is certainly confusing and unexpected, at least by some. The question is: do we want to do something about that? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-26 19:04 Buffer-local variables affect general-purpose functions Eli Zaretskii @ 2014-03-26 19:32 ` Paul Eggert 2014-03-26 20:03 ` Eli Zaretskii 2014-03-27 14:17 ` Stefan Monnier 1 sibling, 1 reply; 103+ messages in thread From: Paul Eggert @ 2014-03-26 19:32 UTC (permalink / raw) To: Eli Zaretskii, emacs-devel Eli Zaretskii wrote: > do we want to do something about that? Yes, and we should start by removing the backwards-compatibility hacks in question. Whether the current buffer is unibyte should not affect the behavior of general-purpose functions on characters. Elisp code that blindly extracts bytes from unibyte buffers or strings, and treats these bytes as characters, is broken anyway. It needs to be fixed to convert bytes to characters (using 'unibyte-char-to-multibyte', say) before it gives them to general-purpose character functions like 'downcase' and 'char-equal'. Years ago, when these backwards-compatibility hacks were put in, it made sense to have them, because unibyte non-ASCII locales were widespread and converting code to multibyte was a hassle. But nowadays the vast majority of non-ASCII usage is multibyte and these hacks cause more trouble than they're worth -- not just core dumps such as Bug#17011, but subtle behavioral problems not easily diagnosed. It's time for the hacks to go. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-26 19:32 ` Paul Eggert @ 2014-03-26 20:03 ` Eli Zaretskii 2014-03-26 21:50 ` Paul Eggert 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-26 20:03 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Wed, 26 Mar 2014 12:32:05 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > > Eli Zaretskii wrote: > > do we want to do something about that? > > Yes, and we should start by removing the backwards-compatibility hacks > in question. Whether the current buffer is unibyte should not affect > the behavior of general-purpose functions on characters. Well, the change in behavior is not limited to unibyte buffers, as I told in my OP. I think the problem is wider. > Elisp code that blindly extracts bytes from unibyte buffers or strings, > and treats these bytes as characters, is broken anyway. It needs to be > fixed to convert bytes to characters (using 'unibyte-char-to-multibyte', > say) before it gives them to general-purpose character functions like > 'downcase' and 'char-equal'. But there should still be a way to compare bytes and strings of bytes in a unibyte buffer, right? So perhaps we should have special functions just for that purpose, and char-equal should signal an error when presented with unibyte non-ASCII values. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-26 20:03 ` Eli Zaretskii @ 2014-03-26 21:50 ` Paul Eggert 2014-03-27 17:42 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: Paul Eggert @ 2014-03-26 21:50 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii wrote: > I think the problem is wider. Yes, it is. > But there should still be a way to compare bytes and strings of bytes > in a unibyte buffer, right? Byte-strings vs character-strings shouldn't be a problem, as the string itself tells you whether it's multibyte. The problem is bytes vs characters, as both are modeled as small integers. > So perhaps we should have special > functions just for that purpose, and char-equal should signal an error > when presented with unibyte non-ASCII values. Sorry, I don't follow. How could char-equal know whether 224 is a raw byte or the Latin-1 character 'à'? It'd have to know that, to signal an error in the former case. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-26 21:50 ` Paul Eggert @ 2014-03-27 17:42 ` Eli Zaretskii 2014-03-27 18:55 ` Paul Eggert 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-27 17:42 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Wed, 26 Mar 2014 14:50:52 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > How could char-equal know whether 224 is a raw byte or the Latin-1 > character 'à'? The same way it "knows" today. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-27 17:42 ` Eli Zaretskii @ 2014-03-27 18:55 ` Paul Eggert 0 siblings, 0 replies; 103+ messages in thread From: Paul Eggert @ 2014-03-27 18:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On 03/27/2014 10:42 AM, Eli Zaretskii wrote: >> How could char-equal know whether 224 is a raw byte or the Latin-1 >> >character 'à'? > The same way it "knows" today. So (char-equal ?x ?à) would signal an error in a unibyte buffer (because ?à < 256), and (char-equal ?x ?α) would return nil (because 255 < ?α)? That doesn't sound right, but most likely I'm misunderstanding the proposal. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-26 19:04 Buffer-local variables affect general-purpose functions Eli Zaretskii 2014-03-26 19:32 ` Paul Eggert @ 2014-03-27 14:17 ` Stefan Monnier 2014-03-27 17:17 ` Eli Zaretskii 1 sibling, 1 reply; 103+ messages in thread From: Stefan Monnier @ 2014-03-27 14:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > M-x find-file-literally RET some-file RET > M-x set-variable RET case-fold-search RET t RET > M-: (chars-equal ?à ?À) RET > This produces nil, although the characters should compare equal under > case-fold-search. Why? Because we are in a unibyte buffer, where > values between 128 and 255 are interpreted as eight-bit raw bytes, not > as Latin characters, and raw bytes don't have lower/upper-case pairs. I agree with Paul on this one: this should be fixed to disregard unibyte setting. `char-equal' compares chars, not bytes (use `eq' for bytes). It's an old backward compatibility hack that should go. > Another example, from the same sequence of commands above, is the fact > that setting case-fold-search for the buffer affects comparison of > characters that don't belong to the buffer, merely because that buffer > happens to be current at the moment of comparison. IIUC this is the kind of problem you really want to talk about in this thread, and yes, it's a problem. Usually case-fold-search is let-bound rather than set buffer-locally, but we have similar problems with syntax-tables, case-tables, etc... > The question is: do we want to do something about that? Not sure. It's hard to find all occurrences of this problem. And I don't think we can find a "general" solution: each case might be best solved in a different way. Furthermore the right solution will sometimes (often?) be to throw away the current functionality and replace it with something different. But we can definitely try to solve it on a case-by-case basis. > Yet another example is 'downcase' and 'upcase' functions -- they use > case tables local to the current buffer, even when the functions they > are applied to characters and strings not from the buffer. The solution here is simple: throw away buffer-local case-tables. AFAICT, set-case-table is used at only one place: in with-case-table. % grep set-case-table **/*.el emacs-lisp/cl-lib.el:;; (gv-define-simple-setter current-case-table set-case-table) subr.el: (progn (set-case-table ,table) subr.el: (set-case-table ,old-case-table)))))) So the only use of set-case-table is in with-case-table. % grep with-case-table **/*.el emacs-lisp/lisp-mode.el: "eval-and-compile" "eval-when-compile" "with-case-table" leim/quail/sisheng.el: (with-case-table (standard-case-table) mail/smtpmail.el: (with-case-table ascii-case-table ;Why? subr.el:(defmacro with-case-table (table &rest body) And the only uses of with-case-table are in lisp/leim/quail/sisheng.el (where it sets the standard case table, so it should have no effect) and in lisp/mail/smtpmail.el (where it uses ascii-case-table but should only apply it to ASCII text, so it could just as well use the standard case table). And then we can use the Unicode 'case tables' as recently discussed. Patch for that welcome on trunk. > This could produce subtle bugs, and is certainly confusing and > unexpected, at least by some. Agreed. Stefan ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-27 14:17 ` Stefan Monnier @ 2014-03-27 17:17 ` Eli Zaretskii 2014-03-27 21:04 ` Stefan Monnier 2014-03-28 3:38 ` Stephen J. Turnbull 0 siblings, 2 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-27 17:17 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@IRO.UMontreal.CA> > Cc: emacs-devel@gnu.org > Date: Thu, 27 Mar 2014 10:17:23 -0400 > > > M-x find-file-literally RET some-file RET > > M-x set-variable RET case-fold-search RET t RET > > M-: (chars-equal ?à ?À) RET > > > This produces nil, although the characters should compare equal under > > case-fold-search. Why? Because we are in a unibyte buffer, where > > values between 128 and 255 are interpreted as eight-bit raw bytes, not > > as Latin characters, and raw bytes don't have lower/upper-case pairs. > > I agree with Paul on this one: this should be fixed to disregard > unibyte setting. `char-equal' compares chars, not bytes (use `eq' > for bytes). > It's an old backward compatibility hack that should go. Paul seemed to say something more broad: that _all_ behaviors specific to unibyte buffers should go away. Do you agree? Anyway, what should replace those hacks? Arbitrarily interpreting raw bytes as Latin characters is not TRT, IMO. Actually, in the above case, we could simply make char-equal disregard case-fold-search in unibyte buffers -- that would give you and Paul what you want, but also keep backward compatibility (except for ASCII characters). > > The question is: do we want to do something about that? > > Not sure. It's hard to find all occurrences of this problem. > And I don't think we can find a "general" solution: each case might be > best solved in a different way. Furthermore the right solution will > sometimes (often?) be to throw away the current functionality and > replace it with something different. Maybe so, but something like (with-buffer-defaults BODY) might be the solution, and should be easy enough to implement. Or maybe some other way of telling primitives: don't apply buffer-specific behavior to this code. > % grep with-case-table **/*.el > emacs-lisp/lisp-mode.el: "eval-and-compile" "eval-when-compile" "with-case-table" > leim/quail/sisheng.el: (with-case-table (standard-case-table) > mail/smtpmail.el: (with-case-table ascii-case-table ;Why? > subr.el:(defmacro with-case-table (table &rest body) > > And the only uses of with-case-table are in lisp/leim/quail/sisheng.el > (where it sets the standard case table, so it should have no effect) and > in lisp/mail/smtpmail.el (where it uses ascii-case-table but should only > apply it to ASCII text, so it could just as well use the standard case > table). > > And then we can use the Unicode 'case tables' as recently discussed. > Patch for that welcome on trunk. OK. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-27 17:17 ` Eli Zaretskii @ 2014-03-27 21:04 ` Stefan Monnier 2014-03-28 7:11 ` Eli Zaretskii 2014-03-28 3:38 ` Stephen J. Turnbull 1 sibling, 1 reply; 103+ messages in thread From: Stefan Monnier @ 2014-03-27 21:04 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > Paul seemed to say something more broad: that _all_ behaviors specific > to unibyte buffers should go away. Do you agree? Too broad to answer. I think this needs to be decided on a case-by-case basis. > Anyway, what should replace those hacks? Arbitrarily interpreting raw > bytes as Latin characters is not TRT, IMO. I think it is: char-equal compares *chars*, not *bytes*. IOW it's a bug to pass bytes to it. > Maybe so, but something like > (with-buffer-defaults BODY) > might be the solution, and should be easy enough to implement. > Or maybe some other way of telling primitives: don't apply > buffer-specific behavior to this code. That might be a valid option, but in any case it's incompatible and the incompatibility will have different consequences for different uses, so we're back to "case-by-case basis". Stefan ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-27 21:04 ` Stefan Monnier @ 2014-03-28 7:11 ` Eli Zaretskii 2014-03-28 7:46 ` Paul Eggert 2014-03-28 14:12 ` Buffer-local variables affect general-purpose functions Stefan Monnier 0 siblings, 2 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-28 7:11 UTC (permalink / raw) To: Stefan Monnier; +Cc: emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: emacs-devel@gnu.org > Date: Thu, 27 Mar 2014 17:04:45 -0400 > > > Anyway, what should replace those hacks? Arbitrarily interpreting raw > > bytes as Latin characters is not TRT, IMO. > > I think it is: char-equal compares *chars*, not *bytes*. IOW it's a bug > to pass bytes to it. How to compare bytes, then? Anyway, we don't have a way of distinguishing between characters and bytes, unless we look on something besides the arguments themselves. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-28 7:11 ` Eli Zaretskii @ 2014-03-28 7:46 ` Paul Eggert 2014-03-28 8:18 ` Unibyte characters, strings and buffers Eli Zaretskii 2014-03-28 14:12 ` Buffer-local variables affect general-purpose functions Stefan Monnier 1 sibling, 1 reply; 103+ messages in thread From: Paul Eggert @ 2014-03-28 7:46 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii wrote: > How to compare bytes, then? It depends on what kind of comparison one wants. Simplest is to use '='. To ignore case and treat bytes 128-255 as Latin-1 characters, use 'downcase' first. To ignore case and treat bytes 128-255 as uninterpreted bit patterns, use 'unibyte-char-to-multibyte' before downcasing. Etc. > we don't have a way of distinguishing between characters and > bytes, unless we look on something besides the arguments themselves. Yes, that's right. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-28 7:46 ` Paul Eggert @ 2014-03-28 8:18 ` Eli Zaretskii 2014-03-28 18:42 ` Paul Eggert 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-28 8:18 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel (I retitled the subject, because the unibyte issue is sufficiently different from what I originally raised.) > Date: Fri, 28 Mar 2014 00:46:01 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > Eli Zaretskii wrote: > > How to compare bytes, then? > > It depends on what kind of comparison one wants. Simplest is to use > '='. To ignore case and treat bytes 128-255 as Latin-1 characters, use > 'downcase' first. To ignore case and treat bytes 128-255 as > uninterpreted bit patterns, use 'unibyte-char-to-multibyte' before > downcasing. Etc. > > > we don't have a way of distinguishing between characters and > > bytes, unless we look on something besides the arguments themselves. > > Yes, that's right. Which is why your suggestions above will not necessarily DTRT. Arbitrary interpretation of bytes 128-255 as Latin-1 is not guaranteed to be correct, and therefore 'downcase' will sometimes produce unexpected results, unless we can make sure, somehow, that raw bytes will never be exposed to Lisp as having these values. Unless you show a practical way towards the latter goal, what you suggest will just replace one set of subtly buggy behaviors with another (in which case I vote for what we already have, because that one is at least well known and passed some test of time). ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-28 8:18 ` Unibyte characters, strings and buffers Eli Zaretskii @ 2014-03-28 18:42 ` Paul Eggert 2014-03-28 18:52 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: Paul Eggert @ 2014-03-28 18:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Emacs development discussions On 03/28/2014 01:18 AM, Eli Zaretskii wrote: > what you suggest will just > replace one set of subtly buggy behaviors with another Code that blithly passes bytes in the range 128-255 to char-equal is *already* buggy. Although the proposed change wouldn't fix those bugs, it'd fix others, so it'd be a win. Plus, the change is simpler and easier to explain than what we have now, and that is a long-term win. I'm afraid what I'm hearing is "although it's broken, unless we come up with a perfect solution we shouldn't do anything". I'd rather fix this particular problem now, even if it's not practical to fix all the related problems now. We don't need to slay the entire unibyte dragon to fix the relatively minor issue of comparing characters. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-28 18:42 ` Paul Eggert @ 2014-03-28 18:52 ` Eli Zaretskii 2014-03-28 19:21 ` Paul Eggert ` (2 more replies) 0 siblings, 3 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-28 18:52 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Fri, 28 Mar 2014 11:42:16 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: Emacs development discussions <emacs-devel@gnu.org> > > On 03/28/2014 01:18 AM, Eli Zaretskii wrote: > > what you suggest will just > > replace one set of subtly buggy behaviors with another > > Code that blithly passes bytes in the range 128-255 to char-equal is > *already* buggy. There's nothing wrong with those bytes, certainly not when they stand for Latin-1 characters. > Although the proposed change wouldn't fix those bugs, it'd fix > others, so it'd be a win. How is it a win, when it actually _adds_ bugs? E.g., under your proposal, (char-equal 192 224) will yield non-nil when case-fold-search is non-nil. > Plus, the change is simpler and easier to explain than what we have now, > and that is a long-term win. I don't see how it is simpler or easier to explain. It replaces one lopsided interpretation of 128-255 values with another. > I'm afraid what I'm hearing is "although it's broken, unless we come up > with a perfect solution we shouldn't do anything". I don't know where you heard that. I certainly didn't say anything like that. > I'd rather fix this particular problem now, even if it's not > practical to fix all the related problems now. I suggested a solution: ignore case-fold-search in unibyte buffers. I think that's a greater win. > We don't need to slay the entire unibyte dragon to fix the > relatively minor issue of comparing characters. I agree. But then you are responding in a wrong thread ;-) ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-28 18:52 ` Eli Zaretskii @ 2014-03-28 19:21 ` Paul Eggert 2014-03-29 6:40 ` Eli Zaretskii 2014-03-28 20:23 ` Stefan Monnier 2014-03-29 19:34 ` Stefan Monnier 2 siblings, 1 reply; 103+ messages in thread From: Paul Eggert @ 2014-03-28 19:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel >> Code that blithly passes bytes in the range 128-255 to char-equal is >> *already* buggy. > There's nothing wrong with those bytes, certainly not when they stand > for Latin-1 characters. Sure, and if they stand for Latin-1 characters the proposed change will do the right thing. > How is it a win, when it actually _adds_ bugs? E.g., under your > proposal, (char-equal 192 224) will yield non-nil when > case-fold-search is non-nil. That's not a bug, since À and à are the same character, ignoring case. As I understand it, the scenario you're worried about is that someone is visiting a unibyte buffer and is doing a case-folded search involving non-ASCII bytes and doesn't want these bytes to match their Latin-1 case-folded counterparts. This scenario is not common enough to worry about. Changing the behavior for this rare case is a cost, I suppose, but it's outweighed by the benefit of simplifying case-equal and fixing its semantics to be a bit saner. >> Plus, the change is simpler and easier to explain than what we have now, >> and that is a long-term win. > I don't see how it is simpler or easier to explain. It replaces one > lopsided interpretation of 128-255 values with another. > It's simpler because it decouples the rules for char-equal from the question of whether the current buffer is multibyte. Separation of concerns is a win. > I suggested a solution: ignore case-fold-search in unibyte buffers. Sorry, I didn't see that suggestion. It would be better than what we have now for char-equal, but it would have undesirable side effects elsewhere. When I type find-file-literally to visit a buffer in raw-text form, it's more convenient if I can type C-s h t m l (or whatever) and find "HTML". I'd rather not lose that capability. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-28 19:21 ` Paul Eggert @ 2014-03-29 6:40 ` Eli Zaretskii 2014-03-29 18:57 ` Paul Eggert 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 6:40 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Fri, 28 Mar 2014 12:21:04 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > CC: emacs-devel@gnu.org > > > I suggested a solution: ignore case-fold-search in unibyte buffers. > > Sorry, I didn't see that suggestion. It would be better than what we > have now for char-equal, but it would have undesirable side effects > elsewhere. I suggested it only for char-equal. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-29 6:40 ` Eli Zaretskii @ 2014-03-29 18:57 ` Paul Eggert 2014-03-29 19:46 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: Paul Eggert @ 2014-03-29 18:57 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii wrote: > I suggested it only for char-equal. That would make char-equal incompatible with upcase, downcase, etc. when in a unibyte buffer, which would be incoherent. Unless you're also suggesting that upcase, downcase, etc. should be no-ops in unibyte buffers? But then they would be incompatible with searching. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-29 18:57 ` Paul Eggert @ 2014-03-29 19:46 ` Eli Zaretskii 0 siblings, 0 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 19:46 UTC (permalink / raw) To: Paul Eggert; +Cc: emacs-devel > Date: Sat, 29 Mar 2014 11:57:53 -0700 > From: Paul Eggert <eggert@cs.ucla.edu> > Cc: emacs-devel@gnu.org > > Eli Zaretskii wrote: > > I suggested it only for char-equal. > > That would make char-equal incompatible with upcase, downcase, etc. when > in a unibyte buffer, which would be incoherent. > > Unless you're also suggesting that upcase, downcase, etc. should be > no-ops in unibyte buffers? But then they would be incompatible with > searching. So now _you_ are looking for a perfect solution? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-28 18:52 ` Eli Zaretskii 2014-03-28 19:21 ` Paul Eggert @ 2014-03-28 20:23 ` Stefan Monnier 2014-03-29 19:34 ` Stefan Monnier 2 siblings, 0 replies; 103+ messages in thread From: Stefan Monnier @ 2014-03-28 20:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Paul Eggert, emacs-devel > How is it a win, when it actually _adds_ bugs? E.g., under your > proposal, (char-equal 192 224) will yield non-nil when > case-fold-search is non-nil. Non-nil is the right answer. Doesn't sound like a bug to me. Stefan ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings and buffers 2014-03-28 18:52 ` Eli Zaretskii 2014-03-28 19:21 ` Paul Eggert 2014-03-28 20:23 ` Stefan Monnier @ 2014-03-29 19:34 ` Stefan Monnier 2 siblings, 0 replies; 103+ messages in thread From: Stefan Monnier @ 2014-03-29 19:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Paul Eggert, emacs-devel > I suggested a solution: ignore case-fold-search in unibyte buffers. As mentioned in your original message, char-equal should not pay attention to the current buffer. Stefan ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-28 7:11 ` Eli Zaretskii 2014-03-28 7:46 ` Paul Eggert @ 2014-03-28 14:12 ` Stefan Monnier 1 sibling, 0 replies; 103+ messages in thread From: Stefan Monnier @ 2014-03-28 14:12 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel > How to compare bytes, then? As mentioned in my first mention of the problem: `eq'. > Anyway, we don't have a way of distinguishing between characters and > bytes, unless we look on something besides the arguments themselves. `char-equal' has something to distinguish: the fact that we call `char-equal' instead of `eq' is just the info needed to decide that the arguments are chars rather than bytes. Stefan ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Buffer-local variables affect general-purpose functions 2014-03-27 17:17 ` Eli Zaretskii 2014-03-27 21:04 ` Stefan Monnier @ 2014-03-28 3:38 ` Stephen J. Turnbull 2014-03-28 8:51 ` Unibyte characters, strings, and buffers Eli Zaretskii 1 sibling, 1 reply; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-28 3:38 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stefan Monnier, emacs-devel Eli Zaretskii writes: > Paul seemed to say something more broad: that _all_ behaviors specific > to unibyte buffers should go away. Do you agree? Yes, please. XEmacs has never had the unibyte hack with Mule, and never has had much trouble with that. It also has never had an instance of the \201 bug since Mule was declared stable -- where Emacs has had *many* regressions. It's arguable that there are performance implications, but simply aliasing the binary codec to latin1-unix has *never* caused a bug in handling binary files -- all bugs are due to autodetection errors, not the buffer representation. I don't recall a case where a programmer "did something stupid" with a character function that technically is inappropriate for true binary (eg, upcase) -- invariably they were doing something like upcasing all the HTML tags as they came off the wire. Ie, the stream was a binary protocol where all of the syntax was represented with ASCII bytes, and therefore "readable words". If the performance implications bother you, then a buffer representation like http://www.python.org/dev/peps/pep-0393/ may be useful. You could do that halfway, as well (ie, buffers containing pure Latin1 text or binary text would be represented as a flat buffer of bytes, buffers containing scalars >= 256 would be represented as UTF-8b, or whatever the hack for representing undecodable bytes currently is). > Anyway, what should replace those hacks? Arbitrarily interpreting raw > bytes as Latin characters is not TRT, IMO. Python has a bytes/character distinction, but they have completely separate implementations. Emacs doesn't need that, unless you want to compete with the P-languages as a web framework platform. OTOH Emacs' unibyte buffer toggle is a design bug, pure and simple, and it should be backed up against a wall and immersed in insecticide. If you stick to the interpretation that bytes contain non-negative integers less than 256, you won't have a problem in practice if you think them as the first 256 Unicode characters, but choose not to use functions that make sense only with characters. Python actually implements many polymorphic functions (ie, they can be interpreted as bytes->bytes or characters->characters, etc) by converting bytes to characters as Latin-1, then using the character implementation of the function. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 3:38 ` Stephen J. Turnbull @ 2014-03-28 8:51 ` Eli Zaretskii 2014-03-28 10:28 ` Stephen J. Turnbull 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-28 8:51 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: monnier, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Date: Fri, 28 Mar 2014 12:38:10 +0900 > Cc: Stefan Monnier <monnier@IRO.UMontreal.CA>, emacs-devel@gnu.org > > Eli Zaretskii writes: > > > Paul seemed to say something more broad: that _all_ behaviors specific > > to unibyte buffers should go away. Do you agree? > > Yes, please. XEmacs has never had the unibyte hack with Mule, and > never has had much trouble with that. It also has never had an > instance of the \201 bug since Mule was declared stable -- where Emacs > has had *many* regressions. Let's not talk about Emacs 20 vintage problems, that is not useful. Likewise examples from XEmacs, since the differences in this area between Emacs and XEmacs are substantial, and that precludes useful comparison. > It's arguable that there are performance implications, but simply > aliasing the binary codec to latin1-unix has *never* caused a bug in > handling binary files -- all bugs are due to autodetection errors, > not the buffer representation. Forget about performance, there are real problems unrelated to that which need to be solved, and I don't see how can you avoid them by treating raw bytes as Latin-1 characters. Let me explain. First, we must have a way to have buffer "text" that represents a stream of bytes, not some human-readable text. (Just as a random example, a buffer visiting an mbox file, from which you decode portions into another buffer for display.) Agreed? In such unibyte buffers, we need a way to represent raw bytes, which are parts of as yet un-decoded byte sequences that represent encoded characters. We cannot represent each such byte as a Latin-1 character, because Latin-1 characters are stored inside Emacs as 2-byte sequences of their UTF-8 encoding. If you interpret bytes as Latin-1 characters, functions like string-bytes will return wrong results for those raw bytes. Agreed? So here you have already at least 2 valid reasons why Emacs must be able to support raw bytes that are distinguishable from Latin-1 characters that have the same byte values, and why we must have buffers that hold such raw bytes. If we want to get rid of unibyte, Someone(TM) should present a complete practical solution to those two problems (and a few others), otherwise, this whole discussion leads nowhere. ("Practical" means that suggestions to introduce a character data type are out of scope, or at least belong to an entirely different discussion.) > OTOH Emacs' unibyte buffer toggle is a design bug, pure and simple, > and it should be backed up against a wall and immersed in > insecticide. I might even agree with you about the toggle. But eliminating the toggle doesn't solve the bigger issue, see above. > If you stick to the interpretation that bytes contain non-negative > integers less than 256, you won't have a problem in practice if you > think them as the first 256 Unicode characters, but choose not to use > functions that make sense only with characters. What do you mean by "choose"? Lisp code is used by many programmers out there; sometimes, they aren't even aware if the buffer they work on is unibyte, or what that means. Even when they are aware, they just want Emacs to DTRT, for their own value of "RT". Unless each one of those programmers "chooses" not to use the problematic functions, we are back at square one. And what does "choose not to use" mean, anyway? How do you choose not to use 'insert', for example? what do you use instead? The issue at hand is how do you pull the trick, in practice, of doing TRT with the legitimate use cases where Emacs needs to manipulate raw bytes. > Python actually implements many polymorphic functions (ie, they can > be interpreted as bytes->bytes or characters->characters, etc) by > converting bytes to characters as Latin-1, then using the character > implementation of the function. As long as Emacs exposes the character values to Lisp programs as simple integers, I don't think we can take this path. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 8:51 ` Unibyte characters, strings, and buffers Eli Zaretskii @ 2014-03-28 10:28 ` Stephen J. Turnbull 2014-03-28 10:58 ` David Kastrup ` (2 more replies) 0 siblings, 3 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-28 10:28 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Eli Zaretskii writes: > Let's not talk about Emacs 20 vintage problems, If they were *only* Emacs 20 vintage, this thread wouldn't exist. > Likewise examples from XEmacs, since the differences in this area > between Emacs and XEmacs are substantial, and that precludes useful > comparison. "It works fine" isn't useful information? XEmacs has *two* reasons to want to change its internal representation. (1) A Unicode representation, especially UTF-8, would allow all autosave files to be readable by other programs. (2) A PEP 393-like representation would be way faster for big buffers and strings. Bytes-character confusion is just plain not an issue, not for anybody, not at all. > First, we must have a way to have buffer "text" that represents a > stream of bytes, not some human-readable text. (Just as a random > example, a buffer visiting an mbox file, from which you decode > portions into another buffer for display.) Agreed? No, I disagree. XEmacs/MULE has never had such a feature, yet we can run all Emacs programs without changing the buffer representation (modulo inability to represent all Unicode characters properly, but the JIT charsets are plenty good enough in practice). > In such unibyte buffers, we need a way to represent raw bytes, which > are parts of as yet un-decoded byte sequences that represent encoded > characters. Again, I disagree. Unibyte is a design mistake, and unnecessary. XEmacs proves it -- we use (essentially) the same code in many applications (VM, Gnus for two mbox-using examples) as GNU Emacs does. The variations for XEmacs and Emacs are due to extents vs. overlays and such-like, not due to buffer representation. For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as no-ops forever, and as far as I can tell nobody's ever needed to worry about it (of course, maybe the folks who use those are just more clued than the poor user in my next paragraph). I agree that having a way to represent "undecodable bytes" in a string or buffer is extremely convenient. XEmacs's lack of this capability is surely a deficiency (Hi, David K!) But this is a completely different issue from unibyte buffers. Emacs doesn't need unibyte buffers to perform its work, and if they are desirable on the grounds of space or time efficiency, they should be opaque to Lisp. > We cannot represent each such byte as a Latin-1 character, because > Latin-1 characters are stored inside Emacs as 2-byte sequences of > their UTF-8 encoding. If you interpret bytes as Latin-1 > characters, functions like string-bytes will return wrong results > for those raw bytes. Agreed? No, I still disagree. `(defun string-bytes (&rest junk) (error))', and live happily ever after. You don't need `string-bytes' unless you've exposed internal representation to Lisp, then you desperately need it to write correct code (which some users won't be able to do anyway without help, cf. https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk). So *don't expose internal representation* (and the hammer marks on users' foreheads will disappear in due time, and the headaches even faster!) > So here you have already at least 2 valid reasons No, *you* have them. XEmacs works perfectly well without them, using code written for Emacs. > If we want to get rid of unibyte, Someone(TM) should present a > complete practical solution to those two problems (and a few > others), otherwise, this whole discussion leads nowhere. Complete practical solution: "They are non-problems, forget about them, and rewrite any code that implies you need to remember them." Fortunately for me, I am *intimately* familiar with XEmacs internals, and therefore RMS won't let me write this code for Emacs. :-) > > If you stick to the interpretation that bytes contain non-negative > > integers less than 256, you won't have a problem in practice if you > > think them as the first 256 Unicode characters, but choose not to use > > functions that make sense only with characters. > > What do you mean by "choose"? Lisp code is used by many programmers > out there; sometimes, they aren't even aware if the buffer they work > on is unibyte, or what that means. Which is precisely why we're having this thread. If there were *no* Lisp-visibile unibyte buffers or strings, it couldn't possibly matter. > Even when they are aware, they just want Emacs to DTRT, for their > own value of "RT". Too bad for them, as long as Emacs has unibyte buffers. They have to be aware, and write code correctly for the mode of the buffer. Viz. the poor serial port programmer in comp.emacs. In XEmacs, they don't have to; they just use an appropriate network-coding-system, and it just works. That may not be *obvious* to a programmer coming from a different background (say, Python) who expects there to be both byte streams and text streams, but since there's no other way to do it, it's not hard to get it right. > And what does "choose not to use" mean, anyway? How do you choose not > to use 'insert', for example? what do you use instead? Of course you use `insert'. What I'm saying is that if you don't want to trash a binary buffer where each byte is represented by an ISO-8859-1 character in internal representation, you need to avoid (1) coding-system-for-write other than 'binary (in XEmacs, aliased to 'iso-8859-1-unix), and (2) functions that mutate characters using properties of characters that bytes don't have (eg, upcase). That's really all there is to it. > The issue at hand is how do you pull the trick, in practice, of > doing TRT with the legitimate use cases where Emacs needs to > manipulate raw bytes. Follow the Nike advice: Just Do It. Works fine, I assure you. I can understand that you're worried by this: > As long as Emacs exposes the character values to Lisp programs as > simple integers, I don't think we can take this path. ... but I'm not really sure why not. I'll grant that after drinking the Ben Wing Kool-Aid the idea of Emacsen without a character type gives me hives, but that's because arbitrary integers, if decomposed into byte- sized fields and inserted into a buffer, can become non-characters and crash XEmacs. But surely you have a function like `char-int-p'[1] that is used (implicitly by `insert') to prevent non-characters (in Emacs, 0xFFFF and surrogates would be examples, I suppose) from being inserted in buffers. Otherwise you'd have crashes all over the place, I would imagine. Since you don't, you must be doing something to prevent arbitrary integers from getting inserted. It seems to me that the only real issue, given that you have a way in Emacs to represent undecodable bytes (XEmacs doesn't, but Emacs does) is what to do if somebody reads in data as 'binary, then proceeds to insert non-Latin-1 characters in the buffer. I can think of three possibilities: (1) don't allow it without changing the buffer's output codec, (2) treat the existing characters as Latin-1, or (3) convert all the existing "bytes" to undecodable bytes representation. XEmacs implicitly does (2) ((3) can't be implemented at all, at present). I tend to prefer (1), but ISTR that would not have worked very well with some programs, specifically readmail and VM (whose author had a lot of influence on how XEmacs internals were designed), because they narrowed the buffer and converted wire format (including raw multibyte encodings) to displayed text in-place. Footnotes: [1] `char-int-p' is a built-in function (char-int-p OBJECT) Documentation: Return t if OBJECT is an integer that can be converted into a character. See `char-int'. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 10:28 ` Stephen J. Turnbull @ 2014-03-28 10:58 ` David Kastrup 2014-03-28 11:22 ` Andreas Schwab 2014-03-28 11:42 ` Stephen J. Turnbull 2014-03-28 17:29 ` Eli Zaretskii 2014-03-28 18:45 ` Daniel Colascione 2 siblings, 2 replies; 103+ messages in thread From: David Kastrup @ 2014-03-28 10:58 UTC (permalink / raw) To: emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > I agree that having a way to represent "undecodable bytes" in a string > or buffer is extremely convenient. XEmacs's lack of this capability > is surely a deficiency (Hi, David K!) Doing this in an utf-8 based internal coding is somewhat doable by employing non-utf-8 sequences. Either using code points above the Unicode code range (2^20 + something, requiring 4 bytes), or by using non-minimal encodings (since the minimal ones are two bytes, requiring 3 bytes). Either way, the size increases significantly. > But this is a completely different issue from unibyte buffers. Emacs > doesn't need unibyte buffers to perform its work, and if they are > desirable on the grounds of space or time efficiency, they should be > opaque to Lisp. Well, Emacs is more following the non-opaque philosophy (XEmacs, in contrast, has even an opaque character type and several other ones). That has the advantage that you can use all sorts of available tools as long as they don't break. It has the disadvantage that the question "what is the right behavior for x?" needs to be answered quite more often since you can't take the "x does not apply to y anyway" route out as often. > > We cannot [...] > > No, I still disagree. Sure, everything is actually "We cannot efficiently" rather than "We cannot". But we still changed buffer positions from byte counts (as in early Emacs 20) to character counts. Efficiency took a dive but the alternatives were just too horrible API-wise. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 10:58 ` David Kastrup @ 2014-03-28 11:22 ` Andreas Schwab 2014-03-28 11:34 ` David Kastrup 2014-03-28 11:42 ` Stephen J. Turnbull 1 sibling, 1 reply; 103+ messages in thread From: Andreas Schwab @ 2014-03-28 11:22 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup <dak@gnu.org> writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > >> I agree that having a way to represent "undecodable bytes" in a string >> or buffer is extremely convenient. XEmacs's lack of this capability >> is surely a deficiency (Hi, David K!) > > Doing this in an utf-8 based internal coding is somewhat doable by > employing non-utf-8 sequences. Either using code points above the > Unicode code range (2^20 + something, requiring 4 bytes), or by using > non-minimal encodings (since the minimal ones are two bytes, requiring 3 > bytes). Either way, the size increases significantly. Emacs uses U3fff80-U3fffff for raw 8-bit bytes, internally represented by 2 bytes. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 11:22 ` Andreas Schwab @ 2014-03-28 11:34 ` David Kastrup 0 siblings, 0 replies; 103+ messages in thread From: David Kastrup @ 2014-03-28 11:34 UTC (permalink / raw) To: emacs-devel Andreas Schwab <schwab@linux-m68k.org> writes: > David Kastrup <dak@gnu.org> writes: > >> "Stephen J. Turnbull" <stephen@xemacs.org> writes: >> >>> I agree that having a way to represent "undecodable bytes" in a string >>> or buffer is extremely convenient. XEmacs's lack of this capability >>> is surely a deficiency (Hi, David K!) >> >> Doing this in an utf-8 based internal coding is somewhat doable by >> employing non-utf-8 sequences. Either using code points above the >> Unicode code range (2^20 + something, requiring 4 bytes), or by using >> non-minimal encodings (since the minimal ones are two bytes, requiring 3 >> bytes). Either way, the size increases significantly. > > Emacs uses U3fff80-U3fffff for raw 8-bit bytes, internally represented > by 2 bytes. Well, I forgot the non-minimal encodings for 0x00-0x7f, namely two-byte sequences starting with 0xc0 or 0xc1 and ending with 0x80-0xbf. Those would still fit the representation invariants. Are those the two-byte encodings used for "raw 0x80 to 0xff"? -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 10:58 ` David Kastrup 2014-03-28 11:22 ` Andreas Schwab @ 2014-03-28 11:42 ` Stephen J. Turnbull 1 sibling, 0 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-28 11:42 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > But this is a completely different issue from unibyte buffers. Emacs > > doesn't need unibyte buffers to perform its work, and if they are > > desirable on the grounds of space or time efficiency, they should be > > opaque to Lisp. > > Well, Emacs is more following the non-opaque philosophy (XEmacs, in > contrast, has even an opaque character type and several other > ones). Those are irrelevant to my point, though. The problem here is that unibyte buffers are a second representation of a single type (the buffer). "Mr. Foot, meet Mr. Bullet, I'm sure you'll get along fine!" > That has the advantage that you can use all sorts of available tools as > long as they don't break. In this case, it's like being offered the hammer head and the handle separately. I'll say one thing for that approach, though -- now you have *two* excellent ways to give yourself a headache, with two different (musical?) sounds when you drum on your crown! > It has the disadvantage that the question "what is the right behavior > for x?" needs to be answered quite more often since you can't take the > "x does not apply to y anyway" route out as often. The right behavior here is for a unibyte buffer to do *exactly* the same thing that a multibyte buffer would. In which case you have a single (opaque) type, as far as users can tell. > Efficiency took a dive but the alternatives were just too horrible > API-wise. Unibyte buffer is just too horrible API-wise. My advice is: nuke it. Steve ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 10:28 ` Stephen J. Turnbull 2014-03-28 10:58 ` David Kastrup @ 2014-03-28 17:29 ` Eli Zaretskii 2014-03-28 17:50 ` David Kastrup ` (2 more replies) 2014-03-28 18:45 ` Daniel Colascione 2 siblings, 3 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-28 17:29 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: monnier, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: monnier@IRO.UMontreal.CA, > emacs-devel@gnu.org > Date: Fri, 28 Mar 2014 19:28:56 +0900 > > Eli Zaretskii writes: > > > Let's not talk about Emacs 20 vintage problems, > > If they were *only* Emacs 20 vintage, this thread wouldn't exist. This thread is about different issues. > > Likewise examples from XEmacs, since the differences in this area > > between Emacs and XEmacs are substantial, and that precludes useful > > comparison. > > "It works fine" isn't useful information? No, because it describes a very different implementation. > > First, we must have a way to have buffer "text" that represents a > > stream of bytes, not some human-readable text. (Just as a random > > example, a buffer visiting an mbox file, from which you decode > > portions into another buffer for display.) Agreed? > > No, I disagree. Then I guess you will have to suggest how to implement this without unibyte buffers. > > In such unibyte buffers, we need a way to represent raw bytes, which > > are parts of as yet un-decoded byte sequences that represent encoded > > characters. > > Again, I disagree. Unibyte is a design mistake, and unnecessary. Then what do you call a buffer whose "text" is encoded? > XEmacs proves it -- we use (essentially) the same code in many > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does. I asked you not to bring XEmacs into the discussion, because I cannot talk intelligently about its implementation. If you insist on doing that, this discussion is futile from my POV. > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as > no-ops forever I wasn't talking about those functions. I was talking about the need to have unibyte buffers and strings. > I agree that having a way to represent "undecodable bytes" in a string > or buffer is extremely convenient. XEmacs's lack of this capability > is surely a deficiency (Hi, David K!) But this is a completely > different issue from unibyte buffers. How is it different? What would be the encoding of a buffer that contains raw bytes? > > We cannot represent each such byte as a Latin-1 character, because > > Latin-1 characters are stored inside Emacs as 2-byte sequences of > > their UTF-8 encoding. If you interpret bytes as Latin-1 > > characters, functions like string-bytes will return wrong results > > for those raw bytes. Agreed? > > No, I still disagree. > > `(defun string-bytes (&rest junk) (error))', and live happily ever > after. But that's ridiculous: a raw byte is just a single byte, so string-bytes should return a meaningful value for a string of such bytes. > You don't need `string-bytes' unless you've exposed internal > representation to Lisp, then you desperately need it to write correct > code (which some users won't be able to do anyway without help, cf. > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk). So > *don't expose internal representation* (and the hammer marks on users' > foreheads will disappear in due time, and the headaches even faster!) How else would you know how many bytes will a string take on disk? > > So here you have already at least 2 valid reasons > > No, *you* have them. XEmacs works perfectly well without them, using > code written for Emacs. XEmacs also works "perfectly well" without bidi and other stuff. That doesn't help at all in this discussion. > > If we want to get rid of unibyte, Someone(TM) should present a > > complete practical solution to those two problems (and a few > > others), otherwise, this whole discussion leads nowhere. > > Complete practical solution: "They are non-problems, forget about > them, and rewrite any code that implies you need to remember them." That a slogan, not a solution. > Fortunately for me, I am *intimately* familiar with XEmacs internals, > and therefore RMS won't let me write this code for Emacs. :-) Then perhaps you shouldn't be part of this discussion. > > > If you stick to the interpretation that bytes contain non-negative > > > integers less than 256, you won't have a problem in practice if you > > > think them as the first 256 Unicode characters, but choose not to use > > > functions that make sense only with characters. > > > > What do you mean by "choose"? Lisp code is used by many programmers > > out there; sometimes, they aren't even aware if the buffer they work > > on is unibyte, or what that means. > > Which is precisely why we're having this thread. If there were *no* > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter. And if I had $5M on by bank account, I'd probably be elsewhere enjoying myself. IOW, how are "if there were no..." arguments useful? > > Even when they are aware, they just want Emacs to DTRT, for their > > own value of "RT". > > Too bad for them, as long as Emacs has unibyte buffers. They have to > be aware, and write code correctly for the mode of the buffer. > Viz. the poor serial port programmer in comp.emacs. > > In XEmacs, they don't have to; they just use an appropriate > network-coding-system, and it just works. This is not a discussion about whose model is better, Emacs or XEmacs. This is a discussion of whether and how can we remove unibyte buffers, strings, and characters from Emacs. You must start by understanding how are they used in Emacs 24, and then suggest practical ways to change that. Saying "look at XEmacs" doesn't help, because we can't, and you know it. I explicitly asked not to bring these arguments into the discussion, and yet you still insist on doing precisely that. > > And what does "choose not to use" mean, anyway? How do you choose not > > to use 'insert', for example? what do you use instead? > > Of course you use `insert'. In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers and characters. If you use it, you get what it does. > What I'm saying is that if you don't want to trash a binary buffer > where each byte is represented by an ISO-8859-1 character in > internal representation, you need to avoid (1) > coding-system-for-write other than 'binary (in XEmacs, aliased to > 'iso-8859-1-unix), and (2) functions that mutate characters using > properties of characters that bytes don't have (eg, upcase). That's > really all there is to it. If the buffer is not marked specially, how will I know to avoid those? > But surely you have a function like > `char-int-p'[1] that is used (implicitly by `insert') to prevent > non-characters (in Emacs, 0xFFFF and surrogates would be examples, I > suppose) from being inserted in buffers. Otherwise you'd have crashes > all over the place, I would imagine. Since you don't, you must be > doing something to prevent arbitrary integers from getting inserted. There's char-valid-p, but I don't see how that is relevant to the current discussion. > It seems to me that the only real issue, given that you have a way in > Emacs to represent undecodable bytes (XEmacs doesn't, but Emacs does) > is what to do if somebody reads in data as 'binary, then proceeds to > insert non-Latin-1 characters in the buffer. I can think of three > possibilities: (1) don't allow it without changing the buffer's output > codec, (2) treat the existing characters as Latin-1, or (3) convert > all the existing "bytes" to undecodable bytes representation. > > XEmacs implicitly does (2) ((3) can't be implemented at all, at > present). Not sure I understand what you describe, but if I do, Emacs does (3). And I still don't see how this is relevant. You are describing a marginally valid use case, while I'm talking about use cases we meet every day, and which must be supported, e.g. when some Lisp wants to decode or encode text by hand. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 17:29 ` Eli Zaretskii @ 2014-03-28 17:50 ` David Kastrup 2014-03-28 18:31 ` Eli Zaretskii 2014-03-28 20:27 ` Stefan Monnier 2014-03-29 9:23 ` Stephen J. Turnbull 2 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-28 17:50 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: "Stephen J. Turnbull" <stephen@xemacs.org> > >> Again, I disagree. Unibyte is a design mistake, and unnecessary. > > Then what do you call a buffer whose "text" is encoded? I can't speak for Stephen, of course, but my impression was he would call it "a bad idea". -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 17:50 ` David Kastrup @ 2014-03-28 18:31 ` Eli Zaretskii 2014-03-28 19:25 ` David Kastrup 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-28 18:31 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Date: Fri, 28 Mar 2014 18:50:02 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> From: "Stephen J. Turnbull" <stephen@xemacs.org> > > > >> Again, I disagree. Unibyte is a design mistake, and unnecessary. > > > > Then what do you call a buffer whose "text" is encoded? > > I can't speak for Stephen, of course, but my impression was he would > call it "a bad idea". Then what other ideas to use when Lisp code needs to encode or decode text manually? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 18:31 ` Eli Zaretskii @ 2014-03-28 19:25 ` David Kastrup 2014-03-29 6:43 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-28 19:25 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: David Kastrup <dak@gnu.org> >> Date: Fri, 28 Mar 2014 18:50:02 +0100 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> >> From: "Stephen J. Turnbull" <stephen@xemacs.org> >> > >> >> Again, I disagree. Unibyte is a design mistake, and unnecessary. >> > >> > Then what do you call a buffer whose "text" is encoded? >> >> I can't speak for Stephen, of course, but my impression was he would >> call it "a bad idea". > > Then what other ideas to use when Lisp code needs to encode or decode > text manually? Redecode right to a "binary" coding system would be my guess. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 19:25 ` David Kastrup @ 2014-03-29 6:43 ` Eli Zaretskii 2014-03-29 7:23 ` David Kastrup 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 6:43 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Cc: emacs-devel@gnu.org > Date: Fri, 28 Mar 2014 20:25:17 +0100 > > >> > Then what do you call a buffer whose "text" is encoded? > >> > >> I can't speak for Stephen, of course, but my impression was he would > >> call it "a bad idea". > > > > Then what other ideas to use when Lisp code needs to encode or decode > > text manually? > > Redecode right to a "binary" coding system would be my guess. Sorry, I don't follow. Can you tell more what that means? The situation I was describing is that I need to do something with undecoded bytes before decoding them, or after encoding them. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 6:43 ` Eli Zaretskii @ 2014-03-29 7:23 ` David Kastrup 2014-03-29 8:24 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-29 7:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: David Kastrup <dak@gnu.org> >> Cc: emacs-devel@gnu.org >> Date: Fri, 28 Mar 2014 20:25:17 +0100 >> >> >> > Then what do you call a buffer whose "text" is encoded? >> >> >> >> I can't speak for Stephen, of course, but my impression was he would >> >> call it "a bad idea". >> > >> > Then what other ideas to use when Lisp code needs to encode or decode >> > text manually? >> >> Redecode right to a "binary" coding system would be my guess. > > Sorry, I don't follow. Can you tell more what that means? It means a buffer where each _character_ has the same value that the no-longer-available unibyte buffer would have in its bytes/characters. > The situation I was describing is that I need to do something with > undecoded bytes before decoding them, or after encoding them. You can do that whether or not the conceptual array of 0..255 characters is internally encoded in unibyte or multibyte encodings. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 7:23 ` David Kastrup @ 2014-03-29 8:24 ` Eli Zaretskii 2014-03-29 8:40 ` David Kastrup 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 8:24 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Cc: emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 08:23:33 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> From: David Kastrup <dak@gnu.org> > >> Cc: emacs-devel@gnu.org > >> Date: Fri, 28 Mar 2014 20:25:17 +0100 > >> > >> >> > Then what do you call a buffer whose "text" is encoded? > >> >> > >> >> I can't speak for Stephen, of course, but my impression was he would > >> >> call it "a bad idea". > >> > > >> > Then what other ideas to use when Lisp code needs to encode or decode > >> > text manually? > >> > >> Redecode right to a "binary" coding system would be my guess. > > > > Sorry, I don't follow. Can you tell more what that means? > > It means a buffer where each _character_ has the same value that the > no-longer-available unibyte buffer would have in its bytes/characters. This doesn't seem to be a complete description of what is suggested. E.g., just by looking at the values of characters, it is impossible to distinguish between Latin characters below 256 and raw bytes. In a unibyte buffer, we know how to make that distinction, but if there are no unibyte buffers, something else is needed for doing that. > > The situation I was describing is that I need to do something with > > undecoded bytes before decoding them, or after encoding them. > > You can do that whether or not the conceptual array of 0..255 characters > is internally encoded in unibyte or multibyte encodings. What do you mean by "multibyte encodings" in this context? Are you suggesting to store the bytes 128..255 as Latin-1 characters, i.e. using the 2-byte UTF-8 sequences of the corresponding Latin characters? Or are you suggesting something else? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 8:24 ` Eli Zaretskii @ 2014-03-29 8:40 ` David Kastrup 2014-03-29 9:25 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-29 8:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: David Kastrup <dak@gnu.org> >> Cc: emacs-devel@gnu.org >> Date: Sat, 29 Mar 2014 08:23:33 +0100 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> >> From: David Kastrup <dak@gnu.org> >> >> Cc: emacs-devel@gnu.org >> >> Date: Fri, 28 Mar 2014 20:25:17 +0100 >> >> >> >> >> > Then what do you call a buffer whose "text" is encoded? >> >> >> >> >> >> I can't speak for Stephen, of course, but my impression was he would >> >> >> call it "a bad idea". >> >> > >> >> > Then what other ideas to use when Lisp code needs to encode or decode >> >> > text manually? >> >> >> >> Redecode right to a "binary" coding system would be my guess. >> > >> > Sorry, I don't follow. Can you tell more what that means? >> >> It means a buffer where each _character_ has the same value that the >> no-longer-available unibyte buffer would have in its bytes/characters. > > This doesn't seem to be a complete description of what is suggested. > E.g., just by looking at the values of characters, it is impossible to > distinguish between Latin characters below 256 and raw bytes. In a > unibyte buffer, we know how to make that distinction, Uh, what? The point of a unibyte buffer is that it does not make the distinction. > but if there are no unibyte buffers, something else is needed for > doing that. >> You can do that whether or not the conceptual array of 0..255 characters >> is internally encoded in unibyte or multibyte encodings. > > What do you mean by "multibyte encodings" in this context? Are you > suggesting to store the bytes 128..255 as Latin-1 characters, > i.e. using the 2-byte UTF-8 sequences of the corresponding Latin > characters? That would make the most sense, yes. > Or are you suggesting something else? You could also use the "raw byte" character encodings we use for not losing information when reading not properly formed utf-8 files into a multibyte buffer, but that seems less practical when working with the character codes. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 8:40 ` David Kastrup @ 2014-03-29 9:25 ` Eli Zaretskii 0 siblings, 0 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 9:25 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Cc: emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 09:40:03 +0100 > > >> It means a buffer where each _character_ has the same value that the > >> no-longer-available unibyte buffer would have in its bytes/characters. > > > > This doesn't seem to be a complete description of what is suggested. > > E.g., just by looking at the values of characters, it is impossible to > > distinguish between Latin characters below 256 and raw bytes. In a > > unibyte buffer, we know how to make that distinction, > > Uh, what? The point of a unibyte buffer is that it does not make the > distinction. Yes, it does: it treats every character as a raw byte. So the dilemma is resolved there by definition. How to do that without unibyte buffers remains to be defined, otherwise plans to remove unibyte buffers are impractical. > > but if there are no unibyte buffers, something else is needed for > > doing that. > > >> You can do that whether or not the conceptual array of 0..255 characters > >> is internally encoded in unibyte or multibyte encodings. > > > > What do you mean by "multibyte encodings" in this context? Are you > > suggesting to store the bytes 128..255 as Latin-1 characters, > > i.e. using the 2-byte UTF-8 sequences of the corresponding Latin > > characters? > > That would make the most sense, yes. Then the above distinction is impossible, and all kinds of subtly incorrect behaviors creep in. > > Or are you suggesting something else? > > You could also use the "raw byte" character encodings we use for not > losing information when reading not properly formed utf-8 files into a > multibyte buffer, but that seems less practical when working with the > character codes. Why less practical? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 17:29 ` Eli Zaretskii 2014-03-28 17:50 ` David Kastrup @ 2014-03-28 20:27 ` Stefan Monnier 2014-03-29 9:23 ` Stephen J. Turnbull 2 siblings, 0 replies; 103+ messages in thread From: Stefan Monnier @ 2014-03-28 20:27 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Stephen J. Turnbull, emacs-devel >> Again, I disagree. Unibyte is a design mistake, and unnecessary. > Then what do you call a buffer whose "text" is encoded? I think they call it "a buffer" ;-) More seriously, IIUC they represent bytes 0..7F as ASCII (like we do) and 80..FF as latin-1-ish chars (i.e. occupying two bytes in the internal representation, IIRC). Stefan ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 17:29 ` Eli Zaretskii 2014-03-28 17:50 ` David Kastrup 2014-03-28 20:27 ` Stefan Monnier @ 2014-03-29 9:23 ` Stephen J. Turnbull 2014-03-29 9:52 ` Andreas Schwab ` (4 more replies) 2 siblings, 5 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-29 9:23 UTC (permalink / raw) To: Eli Zaretskii; +Cc: monnier, emacs-devel Eli Zaretskii writes: > This thread is about different issues. *sigh* No, it's about unibyte being a premature pessimization. > > > Likewise examples from XEmacs, since the differences in this area > > > between Emacs and XEmacs are substantial, and that precludes useful > > > comparison. > > > > "It works fine" isn't useful information? > > No, because it describes a very different implementation. Not at all. The implementation of multibyte buffers is very similar. What's different is that Emacs complifusticates matters by also having a separate implementation of unibyte buffers, and then basically making a union out of the two structures called "buffer". XEmacs simply implements binary as a particular coding system in and out of multibyte buffers. > Then I guess you will have to suggest how to implement this without > unibyte buffers. No, I don't. I already told you how to do it: nuke unibyte buffers and use iso-8859-1-unix as the binary codec. Then you're done, except for those applications that actually make the mistake of using unibyte text explicitly. If there are cases where unibyte happens implicitly, and this transformation causes a bug, I think you'll discover unibyte itself was problematic. > > > In such unibyte buffers, we need a way to represent raw bytes, which > > > are parts of as yet un-decoded byte sequences that represent encoded > > > characters. > > > > Again, I disagree. Unibyte is a design mistake, and unnecessary. > > Then what do you call a buffer whose "text" is encoded? "Binary." > > XEmacs proves it -- we use (essentially) the same code in many > > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does. > > I asked you not to bring XEmacs into the discussion, because I cannot > talk intelligently about its implementation. If you insist on doing > that, this discussion is futile from my POV. The whole point here is that exactly what the XEmacs implementation is *irrelevant*. The point that we implement the same API as GNU Emacs without unibyte buffers or the annoyances and incoherence that comes with them. > > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as > > no-ops forever > > I wasn't talking about those functions. I was talking about the need > to have unibyte buffers and strings. There is no "need for unibyte." You're simply afraid to throw it away. > How is it different? What would be the encoding of a buffer that > contains raw bytes? Depends. If it's uninterpreted bytes, "binary." If those are undecodable bytes, they'll be the representation of raw bytes that occurred in an otherwise sane encoded stream, and the buffer's encoding will be the nominal encoding of that stream. If you want to ensure sanity of output, then you will use an output encoding that errors on rawbytes, and a program that cleans up those rawbytes in a way appropriate for the application. If you expect the next program in the pipeline to handle them, then you use a variant encoding that just encodes them back to the original undecodable rawbytes. > But that's ridiculous: a raw byte is just a single byte, so > string-bytes should return a meaningful value for a string of such > bytes. `string-bytes' should not exist. As I wrote earlier: > > You don't need `string-bytes' unless you've exposed internal > > representation to Lisp, then you desperately need it to write correct > > code (which some users won't be able to do anyway without help, cf. > > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk). So > > *don't expose internal representation* (and the hammer marks on users' > > foreheads will disappear in due time, and the headaches even faster!) > > How else would you know how many bytes will a string take on disk? How does `string-bytes' help? You don't know what encoding will be used to write them, and in general it won't be the same number that they take up in the string. If you use iso-8859-1-unix as the coding system, then "bytes on the wire" == "characters in the string". No problema, señor. > > > > So here you have already at least 2 valid reasons > > > > No, *you* have them. XEmacs works perfectly well without them, using > > code written for Emacs. > > XEmacs also works "perfectly well" without bidi and other stuff. That > doesn't help at all in this discussion. You're right: because XEmacs doesn't handle bidi, it's irrelevant to this discussion. Why did *you* bring it up? What is relevant is how to represent byte streams in Emacs. The obvious non-unibyte way is a one-to-one mapping of bytes to Unicode characters. It is *extremely* convenient if the first 128 of those bytes correspond to the ASCII coded character set, because so many wire protocols use ASCII "words" syntactically. The other 128 don't matter much, so why not just use the extremely convenient Latin-1 set for them? > > > If we want to get rid of unibyte, Someone(TM) should present a > > > complete practical solution to those two problems (and a few > > > others), otherwise, this whole discussion leads nowhere. > > > > Complete practical solution: "They are non-problems, forget about > > them, and rewrite any code that implies you need to remember them." > > That a slogan, not a solution. No, it is a precise high-level design for a solution. The same design that XEmacs uses, and which would be quite straightforward for Emacs to adopt since it already has multibyte buffers of the same power as XEmacs's, though with (currently) a different internal encoding. > > Fortunately for me, I am *intimately* familiar with XEmacs internals, > > and therefore RMS won't let me write this code for Emacs. :-) > > Then perhaps you shouldn't be part of this discussion. Since I've been invited to leave, I will. My point is sufficiently well-made for open minds to deal with the details. I'll finish this post on the off chance that somewhere in it will be the key that will unlock yours. > > Which is precisely why we're having this thread. If there were *no* > > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter. > > And if I had $5M on by bank account, I'd probably be elsewhere > enjoying myself. IOW, how are "if there were no..." arguments useful? Because they point out that this thread wouldn't have happened with a different design. I consider that design better, after experience with two separate implementations of multibyte only (NEmacs, XEmacs/MULE), an implementation with strict separation of bytes from characters (Python 2 with PEP 383), an implementation with strict separation of bytes from characters and space-efficient character representation (Python 3 with PEPS 383, 393), and one implementation with unibyte (Emacs). The first four work fine dealing with bytes and characters, and there is no confusion. Both Pythons can handle undecodable bytes in encoded streams (ie, roundtrip). Only GNU Emacs has issues about dealing with unibyte vs. multibyte. > This is not a discussion about whose model is better, Emacs or XEmacs. > This is a discussion of whether and how can we remove unibyte buffers, > strings, and characters from Emacs. You must start by understanding > how are they used in Emacs 24, and then suggest practical ways to > change that. Well, I would have said "tell me about it", but you've asked me to leave, so I won't. I will say nothing you've said so far even hints at issues with simply removing the whole concept of unibyte. > In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers > and characters. If you use it, you get what it does. And I'm telling you those subtleties are a *problem* that solves nothing that an Emacs without a unibyte concept can't handle fine. > If the buffer is not marked specially, how will I know to avoid > [inserting non-Latin-1 characters in a "binary" buffer]? All experience with XEmacs says *you* (the human programmer) *won't* have any problem avoiding that. As a programmer, if you're working with a binary protocol, you will be using binary buffers and strings, and byte-sized integers. If you accidentally mix things up, you'll quickly get an encoding error on output (since the binary codec can't output non-Latin-1 Unicode characters. It's just not a problem in practice, and that's not why unibyte was introduced in Emacs anyway. Unibyte was introduced because some folks thought working with variable-width-encoded buffers was too inefficient so they wanted access to a flat buffer of bytes. That's why buffer-as-{uni,multi}byte type punning was included. > > But surely you have a function like `char-int-p'[1] [...] > > There's char-valid-p, but I don't see how that is relevant to the > current discussion. Only insofar as you thought char-int confusion might be an issue. > And I still don't see how this is relevant. You are describing a > marginally valid use case, while I'm talking about use cases we meet > every day, and which must be supported, e.g. when some Lisp wants to > decode or encode text by hand. You use `encode-coding-region' and `decode-coding-region', same as you do now. Do you seriously think that XEmacs doesn't support those use cases? o/o ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 9:23 ` Stephen J. Turnbull @ 2014-03-29 9:52 ` Andreas Schwab 2014-03-29 10:48 ` Eli Zaretskii 2014-03-29 10:42 ` David Kastrup ` (3 subsequent siblings) 4 siblings, 1 reply; 103+ messages in thread From: Andreas Schwab @ 2014-03-29 9:52 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > No, I don't. I already told you how to do it: nuke unibyte buffers > and use iso-8859-1-unix as the binary codec. No, you use raw-text, representing each non-ascii character in the eight-bit charset (this is what string-to-multibyte does). Using latin-1 would lose information. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 9:52 ` Andreas Schwab @ 2014-03-29 10:48 ` Eli Zaretskii 2014-03-29 11:00 ` Andreas Schwab 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 10:48 UTC (permalink / raw) To: Andreas Schwab; +Cc: stephen, monnier, emacs-devel > From: Andreas Schwab <schwab@linux-m68k.org> > Cc: Eli Zaretskii <eliz@gnu.org>, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 10:52:29 +0100 > > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > No, I don't. I already told you how to do it: nuke unibyte buffers > > and use iso-8859-1-unix as the binary codec. > > No, you use raw-text, representing each non-ascii character in the > eight-bit charset (this is what string-to-multibyte does). Using > latin-1 would lose information. Right. So one direction would be use a normal multibyte buffer where raw bytes are represented as string-to-multibyte does. Emacs already supports that. The next question is what to do with unibyte strings, which are currently widely used for pure-ASCII text. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 10:48 ` Eli Zaretskii @ 2014-03-29 11:00 ` Andreas Schwab 2014-03-29 11:18 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: Andreas Schwab @ 2014-03-29 11:00 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > The next question is what to do with unibyte strings, which are > currently widely used for pure-ASCII text. You do the same, obviously (the representation wouldn't change for pure-ascii, of course). Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:00 ` Andreas Schwab @ 2014-03-29 11:18 ` Eli Zaretskii 2014-03-29 11:30 ` Andreas Schwab 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 11:18 UTC (permalink / raw) To: Andreas Schwab; +Cc: stephen, monnier, emacs-devel > From: Andreas Schwab <schwab@linux-m68k.org> > Date: Sat, 29 Mar 2014 12:00:32 +0100 > Cc: stephen@xemacs.org, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org > > Eli Zaretskii <eliz@gnu.org> writes: > > > The next question is what to do with unibyte strings, which are > > currently widely used for pure-ASCII text. > > You do the same, obviously (the representation wouldn't change for > pure-ascii, of course). OK, so we get rid of unibyte strings as well. Next question: what happens to implementation of encoding? It currently produces raw bytes. Should it produce eight-bit characters instead? If not, who or what will convert raw bytes into eight-bit characters, when they are inserted into a buffer or string, and who or what will convert them back when they are written to a file or sent to a process? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:18 ` Eli Zaretskii @ 2014-03-29 11:30 ` Andreas Schwab [not found] ` <83ha6hduzz.fsf@gnu.org> 0 siblings, 1 reply; 103+ messages in thread From: Andreas Schwab @ 2014-03-29 11:30 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > Next question: what happens to implementation of encoding? It > currently produces raw bytes. Should it produce eight-bit characters > instead? If not, who or what will convert raw bytes into eight-bit > characters, when they are inserted into a buffer or string, and who or > what will convert them back when they are written to a file or sent to > a process? Writing out a character in the eight-bit charset will produce an eight-bit character, and vice-versa. The process is the same, just put on a lower level. The only visible difference will be the value of aref: it will produce values in the range of the eight-bit charset instead of 128-255. The challenge will be to find and fix all such assumptions. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
[parent not found: <83ha6hduzz.fsf@gnu.org>]
* Re: Unibyte characters, strings, and buffers [not found] ` <83ha6hduzz.fsf@gnu.org> @ 2014-03-29 14:30 ` Andreas Schwab 2014-03-29 14:47 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: Andreas Schwab @ 2014-03-29 14:30 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > Not just aref, I think: we currently pass SSDATA(s) directly to libc > I/O functions in some places. Which part of "on a lower level" did you miss? Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 14:30 ` Andreas Schwab @ 2014-03-29 14:47 ` Eli Zaretskii 0 siblings, 0 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 14:47 UTC (permalink / raw) To: Andreas Schwab; +Cc: stephen, monnier, emacs-devel > From: Andreas Schwab <schwab@linux-m68k.org> > Cc: stephen@xemacs.org, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 15:30:54 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Not just aref, I think: we currently pass SSDATA(s) directly to libc > > I/O functions in some places. > > Which part of "on a lower level" did you miss? I didn't. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 9:23 ` Stephen J. Turnbull 2014-03-29 9:52 ` Andreas Schwab @ 2014-03-29 10:42 ` David Kastrup 2014-03-29 11:07 ` Eli Zaretskii 2014-03-29 10:44 ` Eli Zaretskii ` (2 subsequent siblings) 4 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-29 10:42 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > Eli Zaretskii writes: > > > How is it different? What would be the encoding of a buffer that > > contains raw bytes? > > Depends. If it's uninterpreted bytes, "binary." If those are > undecodable bytes, they'll be the representation of raw bytes that > occurred in an otherwise sane encoded stream, and the buffer's > encoding will be the nominal encoding of that stream. It's worth pointing out that there is no such thing as a "buffer's encoding" in general in Emacs. Buffers are sequences of characters or, in the case of a unibyte buffer, bytes. Encodings come into play for import/export only but they are not an inherent property of the buffer as such but rather, for example, of the file association of the buffer. Emacs has two kinds of internal representation (what one might actually want to call "buffer encoding"): unibyte and multibyte. XEmacs, I think, has only one. The current point of contention is about changing the way of codepoint-based character operations depending on the unibyte state of the current buffer. I consider that an astonishingly bad idea since character and string operations are not tied to a particular buffer. The whole point of MULE from a rather early point of time on was to deal with only a single Unicode-based character set in all of Emacs. Making character operations change meaning based on a buffer's unibyte status means a return to the character set semantics of Emacs 19. I am not necessarily of the same opinion as Stephen regarding whether or not abolishing unibyte buffers is a worthwhile goal. But I am pretty sure that "unibyte" should not be bleeding over into character and string operations. A unibyte buffer or unibyte string might error out when trying to insert characters out of the range 0..255. That's an obvious consequence of the buffer's representation. If we want different semantics for case-fold-search in binary buffers, then the solution is setting a buffer-local setting of case-fold-search when opening a buffer intended to be manipulated in a binary way. But the unibyte setting of the buffer should not affect normal character and string operation semantics. It is a buffer implementation detail that should not really have a visible effect apart from making some buffer operations impossible. Whether or not we want to abolish unibyte buffer representations, we don't want this to bleed effects beyond the buffer representation. If something chooses a unibyte buffer representation for some reason, it is the responsibility of the same something to switch character operations and case-fold-search etc to something making sense in the context of its operation. That may well be through some buffer-local setting of case-fold-search etc, but it is not tied to the internal representation of the buffer contents. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 10:42 ` David Kastrup @ 2014-03-29 11:07 ` Eli Zaretskii 2014-03-29 11:30 ` David Kastrup 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 11:07 UTC (permalink / raw) To: David Kastrup; +Cc: stephen, monnier, emacs-devel > From: David Kastrup <dak@gnu.org> > Cc: Eli Zaretskii <eliz@gnu.org>, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 11:42:43 +0100 > > The current point of contention is about changing the way of > codepoint-based character operations depending on the unibyte state of > the current buffer. The point for which this discussion was started was how to get rid of this dependency, in those few places where we have them in Emacs. > I am not necessarily of the same opinion as Stephen regarding whether or > not abolishing unibyte buffers is a worthwhile goal. But I am pretty > sure that "unibyte" should not be bleeding over into character and > string operations. Indeed, and Emacs tries very hard to contain that distinction, so that it doesn't leak out of the internals. Mostly, it succeeds, but sometimes it doesn't. > A unibyte buffer or unibyte string might error out when trying to insert > characters out of the range 0..255. We currently don't do that. Try (insert "xyz") in a unibyte buffer, where "xyz" is some non-ASCII string, and watch the fun. > If we want different semantics for case-fold-search in binary buffers, > then the solution is setting a buffer-local setting of case-fold-search > when opening a buffer intended to be manipulated in a binary way. > > But the unibyte setting of the buffer should not affect normal character > and string operation semantics. It is a buffer implementation detail > that should not really have a visible effect apart from making some > buffer operations impossible. But if case-fold-search is set to nil in unibyte buffers, and (as we know) buffer-local value of case-fold-search does affects functions that compare text, either because they consult case-fold-search directly or because the consult buffer-local case-table, then the unibyte setting does affect the semantics, albeit indirectly. > If something chooses a unibyte buffer representation for some reason, it > is the responsibility of the same something to switch character > operations and case-fold-search etc to something making sense in the > context of its operation. That may well be through some buffer-local > setting of case-fold-search etc, but it is not tied to the internal > representation of the buffer contents. Not that I disagree with you, but why does it matter whether some code makes a buffer unibyte or sets its case-fold-search, to achieve that goal? In both cases, that something tells Emacs to ignore case conversion, it just uses 2 different ways of saying that. If we are not going to abolish unibyte buffers, how is the difference important? ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:07 ` Eli Zaretskii @ 2014-03-29 11:30 ` David Kastrup 2014-03-29 12:58 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-29 11:30 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: David Kastrup <dak@gnu.org> > >> If we want different semantics for case-fold-search in binary buffers, >> then the solution is setting a buffer-local setting of case-fold-search >> when opening a buffer intended to be manipulated in a binary way. >> >> But the unibyte setting of the buffer should not affect normal character >> and string operation semantics. It is a buffer implementation detail >> that should not really have a visible effect apart from making some >> buffer operations impossible. > > But if case-fold-search is set to nil in unibyte buffers, and (as we > know) buffer-local value of case-fold-search does affects functions > that compare text, either because they consult case-fold-search > directly or because the consult buffer-local case-table, then the > unibyte setting does affect the semantics, albeit indirectly. No, it doesn't. Correlation is not causation. Just because some operations will create a unibyte buffer as well as set a case-fold-search variable does not mean that the unibyte setting of the buffer is the cause of the case-fold-search setting in any meaningful way. >> If something chooses a unibyte buffer representation for some reason, >> it is the responsibility of the same something to switch character >> operations and case-fold-search etc to something making sense in the >> context of its operation. That may well be through some buffer-local >> setting of case-fold-search etc, but it is not tied to the internal >> representation of the buffer contents. > > Not that I disagree with you, but why does it matter whether some code > makes a buffer unibyte or sets its case-fold-search, to achieve that > goal? In both cases, that something tells Emacs to ignore case > conversion, it just uses 2 different ways of saying that. If we are > not going to abolish unibyte buffers, how is the difference important? Because it makes things predictable. I can take a look at the setting of case-fold-search in order to figure out what will happen regarding the case folding of searches. If I want them to occur, I can set the variable, and if I don't want them to occur, I can clear that variable. I can perfectly well do that with a let-binding, and it will work throughout the let-binding without having some buffer properties interfere. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:30 ` David Kastrup @ 2014-03-29 12:58 ` Eli Zaretskii 2014-03-29 13:15 ` David Kastrup 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 12:58 UTC (permalink / raw) To: David Kastrup; +Cc: stephen, monnier, emacs-devel > From: David Kastrup <dak@gnu.org> > Cc: stephen@xemacs.org, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 12:30:21 +0100 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> From: David Kastrup <dak@gnu.org> > > > >> If we want different semantics for case-fold-search in binary buffers, > >> then the solution is setting a buffer-local setting of case-fold-search > >> when opening a buffer intended to be manipulated in a binary way. > >> > >> But the unibyte setting of the buffer should not affect normal character > >> and string operation semantics. It is a buffer implementation detail > >> that should not really have a visible effect apart from making some > >> buffer operations impossible. > > > > But if case-fold-search is set to nil in unibyte buffers, and (as we > > know) buffer-local value of case-fold-search does affects functions > > that compare text, either because they consult case-fold-search > > directly or because the consult buffer-local case-table, then the > > unibyte setting does affect the semantics, albeit indirectly. > > No, it doesn't. Correlation is not causation. But in this case, it is: they both stem from the same cause. > > Not that I disagree with you, but why does it matter whether some code > > makes a buffer unibyte or sets its case-fold-search, to achieve that > > goal? In both cases, that something tells Emacs to ignore case > > conversion, it just uses 2 different ways of saying that. If we are > > not going to abolish unibyte buffers, how is the difference important? > > Because it makes things predictable. I can take a look at the setting > of case-fold-search in order to figure out what will happen regarding > the case folding of searches. If I want them to occur, I can set the > variable, and if I don't want them to occur, I can clear that variable. The same is true about the unibyte flag. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 12:58 ` Eli Zaretskii @ 2014-03-29 13:15 ` David Kastrup 0 siblings, 0 replies; 103+ messages in thread From: David Kastrup @ 2014-03-29 13:15 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen, monnier, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: David Kastrup <dak@gnu.org> >> Cc: stephen@xemacs.org, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org >> Date: Sat, 29 Mar 2014 12:30:21 +0100 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> >> From: David Kastrup <dak@gnu.org> >> > >> >> If we want different semantics for case-fold-search in binary buffers, >> >> then the solution is setting a buffer-local setting of case-fold-search >> >> when opening a buffer intended to be manipulated in a binary way. >> >> >> >> But the unibyte setting of the buffer should not affect normal character >> >> and string operation semantics. It is a buffer implementation detail >> >> that should not really have a visible effect apart from making some >> >> buffer operations impossible. >> > >> > But if case-fold-search is set to nil in unibyte buffers, and (as we >> > know) buffer-local value of case-fold-search does affects functions >> > that compare text, either because they consult case-fold-search >> > directly or because the consult buffer-local case-table, then the >> > unibyte setting does affect the semantics, albeit indirectly. >> >> No, it doesn't. Correlation is not causation. > > But in this case, it is: they both stem from the same cause. That's just word games, and pretty bad ones at that. Not interested. >> > Not that I disagree with you, but why does it matter whether some >> > code makes a buffer unibyte or sets its case-fold-search, to >> > achieve that goal? In both cases, that something tells Emacs to >> > ignore case conversion, it just uses 2 different ways of saying >> > that. If we are not going to abolish unibyte buffers, how is the >> > difference important? >> >> Because it makes things predictable. I can take a look at the >> setting of case-fold-search in order to figure out what will happen >> regarding the case folding of searches. If I want them to occur, I >> can set the variable, and if I don't want them to occur, I can clear >> that variable. > > The same is true about the unibyte flag. So then we have two competing settings. How does that make things predictable? I think that there is nothing missing for reasonable people to come to a decision by now, so there is nothing to be gained from me participating further in this absurd spectacle. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 9:23 ` Stephen J. Turnbull 2014-03-29 9:52 ` Andreas Schwab 2014-03-29 10:42 ` David Kastrup @ 2014-03-29 10:44 ` Eli Zaretskii 2014-03-29 11:06 ` Andreas Schwab 2014-03-29 17:01 ` Nathan Trapuzzano 4 siblings, 0 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 10:44 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: monnier, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: monnier@IRO.UMontreal.CA, > emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 18:23:17 +0900 > > Eli Zaretskii writes: > > > This thread is about different issues. > > *sigh* No, it's about unibyte being a premature pessimization. *Sigh*, indeed. > > > > Likewise examples from XEmacs, since the differences in this area > > > > between Emacs and XEmacs are substantial, and that precludes useful > > > > comparison. > > > > > > "It works fine" isn't useful information? > > > > No, because it describes a very different implementation. > > Not at all. The implementation of multibyte buffers is very similar. Says you. But I cannot talk intelligently about that, because I don't know the details. And it sounds like you cannot talk about the issue at hand, because you don't know the details of Emacs handling of raw bytes. This discussion is about Emacs's unibyte buffers and strings, so it isn't going to yield any useful insights by you talking about XEmacs implementation without knowing what is Emacs's one, and me the other way around. That is why I asked not to bring the XEmacs implementation into this discussion. > What's different is that Emacs complifusticates matters by also having > a separate implementation of unibyte buffers, and then basically > making a union out of the two structures called "buffer". XEmacs > simply implements binary as a particular coding system in and out of > multibyte buffers. In Emacs, a coding system is only consulted when a buffer is read or written. If you also consult it when inserting text into it, or when deciding whether 'downcase' should or shouldn't change the character from the buffer, then you still have unibyte buffers in disguise, you just call them "buffers whose coding system is 'binary'". > > Then I guess you will have to suggest how to implement this without > > unibyte buffers. > > No, I don't. I already told you how to do it: nuke unibyte buffers > and use iso-8859-1-unix as the binary codec. "Codec" is XEmacs terminology, I don't understand what that means in practice, when applied to Emacs. If it means the same as coding system, then how can iso-8859-1-unix byte-stream be decoded into, say, Cyrillic characters (assuming the byte-stream was actually UTF-8 encoded Cyrillic text)? > Then you're done, except for those applications that actually make > the mistake of using unibyte text explicitly. What does "explicitly" mean in this context? Can you show an example of "explicit" vs "implicit" use of unibyte text? > > > > In such unibyte buffers, we need a way to represent raw bytes, which > > > > are parts of as yet un-decoded byte sequences that represent encoded > > > > characters. > > > > > > Again, I disagree. Unibyte is a design mistake, and unnecessary. > > > > Then what do you call a buffer whose "text" is encoded? > > "Binary." That's just a different name. If "binary" buffers are treated differently from any other kind, when processing characters from them, then they are just unibyte buffers in disguise. > > > XEmacs proves it -- we use (essentially) the same code in many > > > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does. > > > > I asked you not to bring XEmacs into the discussion, because I cannot > > talk intelligently about its implementation. If you insist on doing > > that, this discussion is futile from my POV. > > The whole point here is that exactly what the XEmacs implementation is > *irrelevant*. The point that we implement the same API as GNU Emacs > without unibyte buffers or the annoyances and incoherence that comes > with them. Without knowing the details of the implementation, it is impossible to talk about merits and demerits of each design and implementation. Therefore, bringing into this discussion XEmacs implementation without describing it in all detail does not help. Excuse me, but I don't believe you when you say you have no problems at all in this area, just because you say that. If you want that to count, you will have to delve into the gory details, and then show why and how the problems are avoided. > > > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as > > > no-ops forever > > > > I wasn't talking about those functions. I was talking about the need > > to have unibyte buffers and strings. > > There is no "need for unibyte." You're simply afraid to throw it away. I'm not afraid of anything of the kind. This discussion was started in order to try figuring out how to get rid of unibyte. If you want to help, offer specific technical solutions to specific issues we have in Emacs. Copying the XEmacs implementation, even if we were sure it resolves the problem (and I'm not at all sure), is impractical. > > How is it different? What would be the encoding of a buffer that > > contains raw bytes? > > Depends. If it's uninterpreted bytes, "binary." If those are > undecodable bytes, they'll be the representation of raw bytes that > occurred in an otherwise sane encoded stream, and the buffer's > encoding will be the nominal encoding of that stream. If you want to > ensure sanity of output, then you will use an output encoding that > errors on rawbytes, and a program that cleans up those rawbytes in a > way appropriate for the application. If you expect the next program > in the pipeline to handle them, then you use a variant encoding that > just encodes them back to the original undecodable rawbytes. That's exactly what Emacs does, so I think you rather agree to what I originally described as requirements and you said you disagreed. > > But that's ridiculous: a raw byte is just a single byte, so > > string-bytes should return a meaningful value for a string of such > > bytes. > > `string-bytes' should not exist. As I wrote earlier: > > > > You don't need `string-bytes' unless you've exposed internal > > > representation to Lisp, then you desperately need it to write correct > > > code (which some users won't be able to do anyway without help, cf. > > > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk). So > > > *don't expose internal representation* (and the hammer marks on users' > > > foreheads will disappear in due time, and the headaches even faster!) > > > > How else would you know how many bytes will a string take on disk? > > How does `string-bytes' help? It returns that information. > You don't know what encoding will be used to write them Yes, I do know: the buffer's coding system tells me. And if text is already encoded, then I know no additional encoding will be applied, and whatever string-bytes tells me is it. > If you use iso-8859-1-unix as the coding system, then "bytes on the > wire" == "characters in the string". No problema, señor. Not if you want to recode the string in, say, UTF-8. When you shuffle text from one buffer to another, Emacs does not track which encoding that text came from, so the iso-8859-1-unix information is lost. > > > > So here you have already at least 2 valid reasons > > > > > > No, *you* have them. XEmacs works perfectly well without them, using > > > code written for Emacs. > > > > XEmacs also works "perfectly well" without bidi and other stuff. That > > doesn't help at all in this discussion. > > You're right: because XEmacs doesn't handle bidi, it's irrelevant to > this discussion. Why did *you* bring it up? To show how your way of arguing doesn't help. > What is relevant is how to represent byte streams in Emacs. The > obvious non-unibyte way is a one-to-one mapping of bytes to Unicode > characters. It is *extremely* convenient if the first 128 of those > bytes correspond to the ASCII coded character set, because so many > wire protocols use ASCII "words" syntactically. The other 128 don't > matter much, so why not just use the extremely convenient Latin-1 set > for them? Because there are situations when the effect of this is not what Lisp programs and users expect. Case folding and case-insensitive search is one of them, although not the only one. > > > > If we want to get rid of unibyte, Someone(TM) should present a > > > > complete practical solution to those two problems (and a few > > > > others), otherwise, this whole discussion leads nowhere. > > > > > > Complete practical solution: "They are non-problems, forget about > > > them, and rewrite any code that implies you need to remember them." > > > > That a slogan, not a solution. > > No, it is a precise high-level design for a solution. We need a low-level design, not high-level. > > > Fortunately for me, I am *intimately* familiar with XEmacs internals, > > > and therefore RMS won't let me write this code for Emacs. :-) > > > > Then perhaps you shouldn't be part of this discussion. > > Since I've been invited to leave, I will. My point is sufficiently > well-made for open minds to deal with the details. No, it isn't made at all. I tried to explain above why I think so. > > > Which is precisely why we're having this thread. If there were *no* > > > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter. > > > > And if I had $5M on by bank account, I'd probably be elsewhere > > enjoying myself. IOW, how are "if there were no..." arguments useful? > > Because they point out that this thread wouldn't have happened with a > different design. But we _are_ with this design, and have been using it for the last 15 years. Good luck believing that someone will come and replace the existing design with something radically different. There wasn't a comparable revolution in Emacs since 2001, so I largely doubt that expecting another one any time soon is wise. We don't even have people aboard capable of making such changes. The only practical way of advancing in this area is by low-level changes that don't throw away the high-level design. That is why precisely describing the details of every proposal is so important: without them, any proposal becomes impractical and thus not interesting. > > This is not a discussion about whose model is better, Emacs or XEmacs. > > This is a discussion of whether and how can we remove unibyte buffers, > > strings, and characters from Emacs. You must start by understanding > > how are they used in Emacs 24, and then suggest practical ways to > > change that. > > Well, I would have said "tell me about it" And I would have replied "sorry, I have no time for that". The sources are there to be studied, and you are welcome to ask questions about stuff you don't understand just by looking at the sources. There cannot be any useful discussion of these matters without thorough understanding of how Emacs stores characters and raw bytes in its buffers, and where and how the unibyte nuisance comes into play. > I will say nothing you've said so far even hints at issues with > simply removing the whole concept of unibyte. I started by describing some basic requirements that lead to unibyte. You refuse to even acknowledge those requirements. How can we continue a useful discussion when we don't even agree about the basics? To convince me, you need first to take my view of the issue, something that you refuse to do. I cannot begin to explain "the issues" to you if you don't even agree with my starting point. > > In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers > > and characters. If you use it, you get what it does. > > And I'm telling you those subtleties are a *problem* that solves > nothing that an Emacs without a unibyte concept can't handle fine. You keep saying that, but without the details (which you cannot or won't provide), these are just slogans with little technical value. > > If the buffer is not marked specially, how will I know to avoid > > [inserting non-Latin-1 characters in a "binary" buffer]? > > All experience with XEmacs says *you* (the human programmer) *won't* > have any problem avoiding that. As a programmer, if you're working > with a binary protocol, you will be using binary buffers and strings, > and byte-sized integers. If you accidentally mix things up, you'll > quickly get an encoding error on output (since the binary codec can't > output non-Latin-1 Unicode characters. On this level, it sounds like XEmacs does things exactly like Emacs does, it just calls them differently. If so, you have the same problems; e.g., what will 'downcase-word' do in a "binary" buffer, when it sees a "character" whose value is 192? > It's just not a problem in practice, and that's not why unibyte was > introduced in Emacs anyway. Unibyte was introduced because some folks > thought working with variable-width-encoded buffers was too > inefficient so they wanted access to a flat buffer of bytes. That's > why buffer-as-{uni,multi}byte type punning was included. Maybe so, but we are now 15 years after that, so history is only marginally important. What _is_ important is how to get rid of the issues we have, without a complete redesign. > > And I still don't see how this is relevant. You are describing a > > marginally valid use case, while I'm talking about use cases we meet > > every day, and which must be supported, e.g. when some Lisp wants to > > decode or encode text by hand. > > You use `encode-coding-region' and `decode-coding-region', same as you > do now. Do you seriously think that XEmacs doesn't support those use > cases? "Support" doesn't mean "there're no issues". Emacs supports them as well, you know. That fact in itself doesn't help at all in this discussion, because we all know (I hope) that at this "slogan level" things work very well for quite some time. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 9:23 ` Stephen J. Turnbull ` (2 preceding siblings ...) 2014-03-29 10:44 ` Eli Zaretskii @ 2014-03-29 11:06 ` Andreas Schwab 2014-03-29 11:12 ` Eli Zaretskii 2014-03-29 15:37 ` Stephen J. Turnbull 2014-03-29 17:01 ` Nathan Trapuzzano 4 siblings, 2 replies; 103+ messages in thread From: Andreas Schwab @ 2014-03-29 11:06 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > *sigh* No, it's about unibyte being a premature pessimization. Unibyte is a pure space optimisation. Everything else should work as if all bytes in the range 128-255 are decoded in the eight-bit charset. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:06 ` Andreas Schwab @ 2014-03-29 11:12 ` Eli Zaretskii 2014-03-29 16:11 ` Stephen J. Turnbull 2014-03-29 15:37 ` Stephen J. Turnbull 1 sibling, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 11:12 UTC (permalink / raw) To: Andreas Schwab; +Cc: stephen, monnier, emacs-devel > From: Andreas Schwab <schwab@linux-m68k.org> > Cc: Eli Zaretskii <eliz@gnu.org>, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 12:06:31 +0100 > > Unibyte is a pure space optimisation. I think it is (or at least was) also a speed optimization. Reading or writing a huge buffer full of eight-bit characters might be significantly slower if they are in their multibyte representation. Perhaps we should measure that. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:12 ` Eli Zaretskii @ 2014-03-29 16:11 ` Stephen J. Turnbull 0 siblings, 0 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-29 16:11 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Andreas Schwab, monnier, emacs-devel Eli Zaretskii writes: > I think [unibyte] is (or at least was) also a speed optimization. It is. Random access to position N multibyte buffer is average O(N), and O(log N) with a position cache as used in XEmacs and I believe in GNU Emacs too (haven't looked at GNU Emacs's implementation of buffer movement since about v22, though). This slows down mbox-based MUAs like VM and RMail quite a bit if people use 8-bit or binary content-transfer-encodings in their messages. > Reading or writing a huge buffer full of eight-bit characters might > be significantly slower if they are in their multibyte > representation. Perhaps we should measure that. This isn't true (Ben did measurements, as have the Python folks). Coding systems are way faster than I/O. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:06 ` Andreas Schwab 2014-03-29 11:12 ` Eli Zaretskii @ 2014-03-29 15:37 ` Stephen J. Turnbull 2014-03-29 15:55 ` David Kastrup 2014-03-29 15:58 ` Andreas Schwab 1 sibling, 2 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-29 15:37 UTC (permalink / raw) To: Andreas Schwab; +Cc: Eli Zaretskii, monnier, emacs-devel Andreas Schwab writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > *sigh* No, it's about unibyte being a premature pessimization. > > Unibyte is a pure space optimisation. It may be a space optimization, but it's hardly pure. Else this discussion wouldn't be happening. And `string-as-unibyte' exposes the internal representation of strings to Lisp. > Everything else should work as if all bytes in the range 128-255 > are decoded in the eight-bit charset. There seem to be conflicting opinions about that, and I would certainly disagree as there are scads of European charsets that happily fit into bytes. I see no reason why character operations (such as case conversion) shouldn't work transparently on bytes in GR interpreted as the corresponding Latin-1 (or any ISO Latin) charset -- with a little extra metadata in (internal unibyte) buffers and strings to indicate the charset implied. (This charset is independent of the various coding systems associated with buffers; it only says how to interpret a byte as a character in operations on characters in buffers.) ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 15:37 ` Stephen J. Turnbull @ 2014-03-29 15:55 ` David Kastrup 2014-03-29 16:28 ` Stephen J. Turnbull 2014-03-30 0:24 ` Richard Stallman 2014-03-29 15:58 ` Andreas Schwab 1 sibling, 2 replies; 103+ messages in thread From: David Kastrup @ 2014-03-29 15:55 UTC (permalink / raw) To: emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > Andreas Schwab writes: > > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > > > *sigh* No, it's about unibyte being a premature pessimization. > > > > Unibyte is a pure space optimisation. > > It may be a space optimization, but it's hardly pure. Else this > discussion wouldn't be happening. And `string-as-unibyte' exposes the > internal representation of strings to Lisp. > > > Everything else should work as if all bytes in the range 128-255 > > are decoded in the eight-bit charset. > > There seem to be conflicting opinions about that, and I would > certainly disagree as there are scads of European charsets that > happily fit into bytes. That's not what unibyte buffers are for. They are for byte streams, not characters. You would not want to edit a unibyte buffer, for example, by inserting text and stuff. Now for byte stream manipulation, code points other than 0..255 are a nuisance. Certainly a larger nuisance than having to clear case-fold-search if you really want to do a byte search. > I see no reason why character operations (such as case conversion) > shouldn't work transparently on bytes in GR interpreted as the > corresponding Latin-1 (or any ISO Latin) charset -- with a little > extra metadata in (internal unibyte) buffers and strings to indicate > the charset implied. (This charset is independent of the various > coding systems associated with buffers; it only says how to interpret > a byte as a character in operations on characters in buffers.) We have that "extra metadata", it is the unibyte flag. But I consider it a mistake to use it for anything but "character codes in this buffer happen to range from 0..255 rather than 0..1000000 or whatever". And since Unicode 128..255 happens to be the latin-1 plane where the latin-1 plane is defined as all, this will mean that the result will behave like the latin-1 plane. Exactly because Emacs has _one_ underlying character set which happens to be Unicode. Which does not mean that it would be a good idea to use unibyte buffers/strings for actual text that happens to be Latin-1 only. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 15:55 ` David Kastrup @ 2014-03-29 16:28 ` Stephen J. Turnbull 2014-03-29 17:00 ` David Kastrup 2014-03-29 17:08 ` Andreas Schwab 2014-03-30 0:24 ` Richard Stallman 1 sibling, 2 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-29 16:28 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup writes: > That's not what unibyte buffers are for. They are for byte > streams, not characters. You would not want to edit a unibyte > buffer, for example, by inserting text and stuff. I beg to differ. I would like to edit RFC 822 headers for HTTP, SMTP, and other such wire protocols. This is precisely the use case that convinced van Rossum to restore %-formatting for bytes in Python 3.5 (to be released in about 18 months). > We have that "extra metadata", it is the unibyte flag. Yes, I know, but my point is that it should be purely for use of the internal implementation, and probably restricted to the C level. > But I consider it a mistake to use it for anything but "character > codes in this buffer happen to range from 0..255 rather than > 0..1000000 or whatever". I sympathize, though I think it's overkill for Emacs to have separate bytes and text types visible at the Lisp level. FWIW, that's a big step toward the design approach taken by Python 3, which has both bytes and text, but you can't mix them without an explicit encoding or decoding step, and the internal encoding of text is not exposed to Python functions at all. > And since Unicode 128..255 happens to be the latin-1 plane where the > latin-1 plane is defined as all, this will mean that the result will > behave like the latin-1 plane. That's not necessarily true. It just requires a slightly more complex design, which would be appropriate for Emacsen (as compared to Python). ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 16:28 ` Stephen J. Turnbull @ 2014-03-29 17:00 ` David Kastrup 2014-03-30 2:05 ` Stephen J. Turnbull 2014-03-29 17:08 ` Andreas Schwab 1 sibling, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-29 17:00 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > David Kastrup writes: [...] > > And since Unicode 128..255 happens to be the latin-1 plane where > > the latin-1 plane is defined as all, this will mean that the result > > will behave like the latin-1 plane. > > That's not necessarily true. Sure. It depends on whether you value your users' sanity. > It just requires a slightly more complex design, which would be > appropriate for Emacsen (as compared to Python). If the "slightly more complexity" hits in unexpected places, it's going to end up a liability. Having more than one charset to work with if characters themselves don't contain a charset specification is affecting a load of stuff that can then conceivably work in more than one way. Unicode meaningfully uses values 128..255, Bytes meaningfully use values 128..255. When one wants to work without surprises in both cases, converting strings to characters will use 128..255 in either case. Differentiating is, of course, possible. One reasonably cute choice would be mapping bytes (as opposed to characters) 128..255 to integers -128..-1. But if you are talking about case-fold-search semantics, you'll actually need to remap 0..127 as well (they are more relevant than 128..255). And then things get really ugly. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 17:00 ` David Kastrup @ 2014-03-30 2:05 ` Stephen J. Turnbull 2014-03-30 9:01 ` David Kastrup 0 siblings, 1 reply; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-30 2:05 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > It just requires a slightly more complex design, which would be > > appropriate for Emacsen (as compared to Python). > > If the "slightly more complexity" hits in unexpected places, it's going > to end up a liability. Having more than one charset to work with if > characters themselves don't contain a charset specification is affecting > a load of stuff that can then conceivably work in more than one > way. I'm a little smarter than that. The design I have in mind would be transparent. Maybe it wouldn't work; maybe it would be inefficient. But one thing it wouldn't do is present a charset other than Unicode to Lisp. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 2:05 ` Stephen J. Turnbull @ 2014-03-30 9:01 ` David Kastrup 2014-03-30 12:13 ` Stephen J. Turnbull 2014-03-30 14:25 ` Andreas Schwab 0 siblings, 2 replies; 103+ messages in thread From: David Kastrup @ 2014-03-30 9:01 UTC (permalink / raw) To: emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > David Kastrup writes: > > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > > It just requires a slightly more complex design, which would be > > > appropriate for Emacsen (as compared to Python). > > > > If the "slightly more complexity" hits in unexpected places, it's going > > to end up a liability. Having more than one charset to work with if > > characters themselves don't contain a charset specification is affecting > > a load of stuff that can then conceivably work in more than one > > way. > > I'm a little smarter than that. Building on smartness is relying on a limited resource. It's not always easy to find wingmen (pun intended but unworkable). > The design I have in mind would be transparent. I don't think it gets much more transparent than "unibyte flag only marks the valid Unicode-in-Emacs character range". I'm for the range 0..255, Andreas for something like 0..127 U 4194176..4194303 which I find cumbersome for little return. > Maybe it wouldn't work; maybe it would be inefficient. But one thing > it wouldn't do is present a charset other than Unicode to Lisp. Neither does the above. Abolishing unibyte just means that buffers/strings have only one possible character range. That does not really give any "transparency" per se from the Lisp level. The interesting level is the C level. You need a byte stream representation in C at some point anyway, and not being able to call this representation either "string" or "buffer" may be neat in some manners but will end up cumbersome in others. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 9:01 ` David Kastrup @ 2014-03-30 12:13 ` Stephen J. Turnbull 2014-03-30 14:25 ` Andreas Schwab 1 sibling, 0 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-30 12:13 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup writes: > I don't think it gets much more transparent than "unibyte flag only > marks the valid Unicode-in-Emacs character range". I'm for the > range 0..255, It's easy to be more transparent in that case: no unibyte flag. However, that delays detection of out-of-range characters to encoding rather than the insert step. > Andreas for something like 0..127 U 4194176..4194303 which > I find cumbersome for little return. Agreed. If bytes are going to be non-characters, having a half-ASCII type is just going to cause surprises when US English apps get internationalized. > > Maybe it wouldn't work; maybe it would be inefficient. But one > > thing it wouldn't do is present a charset other than Unicode to > > Lisp. > > Neither does the above. Abolishing unibyte just means that > buffers/strings have only one possible character range. That's not really true. Encoding and decoding will still constrain ranges; as pointed out above, it delays detection on the one hand, on the other avoids spurious errors when the user really does want to add characters outside of the prespecified range for some reason. > That does not really give any "transparency" per se from the Lisp > level. I disagree, based primarily on the experience of XEmacs that we can do everything (with characters and bytes) that Emacs does[1], without randomly injecting new bugs due to lack of unibyte that I can recall. (Other bugs, yes, but bugs due to adapting code that used unibyte to XEmacs where there is no unibyte, no.) > The interesting level is the C level. You need a byte stream > representation in C at some point anyway, and not being able to > call this representation either "string" or "buffer" may be neat in > some manners but will end up cumbersome in others. I don't see why you need that, actually. Of course you need C level streams for I/O, but I don't see why it needs to persist past decoding into a buffer or string. Footnotes: [1] OK, we don't have a representation of "undecodable bytes". But that's not conceptually hard, just tedious enough that nobody's done it yet. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 9:01 ` David Kastrup 2014-03-30 12:13 ` Stephen J. Turnbull @ 2014-03-30 14:25 ` Andreas Schwab 2014-03-30 15:05 ` David Kastrup 1 sibling, 1 reply; 103+ messages in thread From: Andreas Schwab @ 2014-03-30 14:25 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup <dak@gnu.org> writes: > I don't think it gets much more transparent than "unibyte flag only > marks the valid Unicode-in-Emacs character range". I'm for the range > 0..255, Andreas for something like 0..127 U 4194176..4194303 which > I find cumbersome for little return. Before decoding there is no charset information yet, so using anything other than the eight-bit charset would be wrong. After decoding, the eight-bit charset is used only for undecodable bytes. That preserves the distinction between encoded and decoded strings/buffers (except for the uninteresting trivial ASCII decoding) in a world without unibyte flag. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 14:25 ` Andreas Schwab @ 2014-03-30 15:05 ` David Kastrup 2014-03-30 15:39 ` Andreas Schwab 0 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-30 15:05 UTC (permalink / raw) To: Andreas Schwab; +Cc: emacs-devel Andreas Schwab <schwab@linux-m68k.org> writes: > David Kastrup <dak@gnu.org> writes: > >> I don't think it gets much more transparent than "unibyte flag only >> marks the valid Unicode-in-Emacs character range". I'm for the range >> 0..255, Andreas for something like 0..127 U 4194176..4194303 which >> I find cumbersome for little return. > > Before decoding there is no charset information yet, so using anything > other than the eight-bit charset would be wrong. When "right" does not buy you anything but trouble, why bother? > After decoding, the eight-bit charset is used only for undecodable > bytes. That preserves the distinction between encoded and decoded > strings/buffers (except for the uninteresting trivial ASCII decoding) > in a world without unibyte flag. The "uninteresting trivial ASCII" listens to case-fold-search just as much as the latin-1 code page does. So being "right" for half of the coding range does not really buy anything. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 15:05 ` David Kastrup @ 2014-03-30 15:39 ` Andreas Schwab 0 siblings, 0 replies; 103+ messages in thread From: Andreas Schwab @ 2014-03-30 15:39 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup <dak@gnu.org> writes: > The "uninteresting trivial ASCII" listens to case-fold-search just as > much as the latin-1 code page does. So being "right" for half of the > coding range does not really buy anything. It doesn't matter, undecoded is just a brief intermediate state most of the time. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 16:28 ` Stephen J. Turnbull 2014-03-29 17:00 ` David Kastrup @ 2014-03-29 17:08 ` Andreas Schwab 1 sibling, 0 replies; 103+ messages in thread From: Andreas Schwab @ 2014-03-29 17:08 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: David Kastrup, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > I beg to differ. I would like to edit RFC 822 headers for HTTP, SMTP, > and other such wire protocols. Nothing stops you from editing eight-bit characters. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 15:55 ` David Kastrup 2014-03-29 16:28 ` Stephen J. Turnbull @ 2014-03-30 0:24 ` Richard Stallman 2014-03-30 3:32 ` Stefan Monnier 1 sibling, 1 reply; 103+ messages in thread From: Richard Stallman @ 2014-03-30 0:24 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Is there any need, nowadays, for a unibyte character to imply a character set? Originally unibyte buffers were meant as a backward compatibility feature for old Emacs versions in which al buffers were unibyte. Nowadays, I think we use unibyte buffers mainly (perhaps exclusively) for buffers whose contents are largely not characters at all. For those buffers, there is no reason to interpret the contents as characters in any particular way. We could consider them as bytes, and nothing else. This means converting those bytes to characters could be done by explicit operations where you would specify what sort of conversion you want. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 0:24 ` Richard Stallman @ 2014-03-30 3:32 ` Stefan Monnier 2014-03-30 15:13 ` Richard Stallman 0 siblings, 1 reply; 103+ messages in thread From: Stefan Monnier @ 2014-03-30 3:32 UTC (permalink / raw) To: Richard Stallman; +Cc: David Kastrup, emacs-devel > For those buffers, there is no reason to interpret the contents > as characters in any particular way. We could consider them as > bytes, and nothing else. That's pretty much what we do nowadays already. Stefan ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 3:32 ` Stefan Monnier @ 2014-03-30 15:13 ` Richard Stallman 0 siblings, 0 replies; 103+ messages in thread From: Richard Stallman @ 2014-03-30 15:13 UTC (permalink / raw) To: Stefan Monnier; +Cc: dak, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > For those buffers, there is no reason to interpret the contents > as characters in any particular way. We could consider them as > bytes, and nothing else. That's pretty much what we do nowadays already. If we make that 100% true, we could disconnect the multibyte flag from operations (including case conversion) that pertain to text rather than bytes. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 15:37 ` Stephen J. Turnbull 2014-03-29 15:55 ` David Kastrup @ 2014-03-29 15:58 ` Andreas Schwab 2014-03-29 16:35 ` Stephen J. Turnbull 1 sibling, 1 reply; 103+ messages in thread From: Andreas Schwab @ 2014-03-29 15:58 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > There seem to be conflicting opinions about that, and I would > certainly disagree as there are scads of European charsets that > happily fit into bytes. Unibyte strings are about raw bytes, not characters. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 15:58 ` Andreas Schwab @ 2014-03-29 16:35 ` Stephen J. Turnbull 2014-03-29 17:06 ` Andreas Schwab 0 siblings, 1 reply; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-29 16:35 UTC (permalink / raw) To: Andreas Schwab; +Cc: Eli Zaretskii, monnier, emacs-devel Andreas Schwab writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > There seem to be conflicting opinions about that, and I would > > certainly disagree as there are scads of European charsets that > > happily fit into bytes. > > Unibyte strings are about raw bytes, not characters. Obviously false, since bytes 0-127 are evidently interpreted as ASCII at need. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 16:35 ` Stephen J. Turnbull @ 2014-03-29 17:06 ` Andreas Schwab 0 siblings, 0 replies; 103+ messages in thread From: Andreas Schwab @ 2014-03-29 17:06 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > Andreas Schwab writes: > > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > > > > > There seem to be conflicting opinions about that, and I would > > > certainly disagree as there are scads of European charsets that > > > happily fit into bytes. > > > > Unibyte strings are about raw bytes, not characters. > > Obviously false, since bytes 0-127 are evidently interpreted as ASCII > at need. That does not contradict my statement. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 9:23 ` Stephen J. Turnbull ` (3 preceding siblings ...) 2014-03-29 11:06 ` Andreas Schwab @ 2014-03-29 17:01 ` Nathan Trapuzzano 2014-03-29 17:08 ` Nathan Trapuzzano 2014-03-29 17:16 ` David Kastrup 4 siblings, 2 replies; 103+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 17:01 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > What is relevant is how to represent byte streams in Emacs. The > obvious non-unibyte way is a one-to-one mapping of bytes to Unicode > characters. It is *extremely* convenient if the first 128 of those > bytes correspond to the ASCII coded character set, because so many > wire protocols use ASCII "words" syntactically. The other 128 don't > matter much, so why not just use the extremely convenient Latin-1 set > for them? Sorry if someone brought this up already, but one reason raw bytes shouldn't be represented as Latin-1 characters is that the "raw bytes"-ness would be lost when writing them back to disk if the stream also contained characters outside the Latin-1 range. For example, say we decode a stream of raw bytes as utf8, but that the stream contains some non-utf8 sequences. IIUC, Emacs will interpret those as "raw bytes", so that when it goes to encode the string to write it back, they will be written back verbatim. Whereas, if they had been interpreted as Latin-1 characters, they would get written back as the UTF8 equivalents. Hence you have the odd situation where you can decode and then encode and end up with a different string. Someone brought up Python in another post. Python (version 3 at least) does the same thing when, e.g., interpreting filenames. If you pass a string (_not_ bytes) to os.listdir, but the contents of the directory can't all be decoded as utf-8, it will return strings (_not_ bytes) where the non-utf8 sequences are Python-specific "characters" (in the Unicode private use areas I believe) representing "raw bytes", i.e. entities to be written back to the disk as the same raw sequences that were read therefrom. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 17:01 ` Nathan Trapuzzano @ 2014-03-29 17:08 ` Nathan Trapuzzano 2014-03-29 17:18 ` David Kastrup 2014-03-29 17:16 ` David Kastrup 1 sibling, 1 reply; 103+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 17:08 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: Eli Zaretskii, monnier, emacs-devel Nathan Trapuzzano <nbtrap@nbtrap.com> writes: > For example, say we decode a stream of raw bytes as utf8, but that the > stream contains some non-utf8 sequences. Of course, most programming languages would simply refuse to decode by, e.g., throwing an exception. But that's not really appropriate for an editor. On one hand, you need some way to distinguish between characters and bytes, even if the distinction's not made by the type system; on the other hand, an _editor_ of all things should be able to deal with both kinds at the same time without the distinction being lost, and Emacs does a tremendous job at this IMO. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 17:08 ` Nathan Trapuzzano @ 2014-03-29 17:18 ` David Kastrup 2014-03-29 17:33 ` Nathan Trapuzzano 0 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-03-29 17:18 UTC (permalink / raw) To: emacs-devel Nathan Trapuzzano <nbtrap@nbtrap.com> writes: > Nathan Trapuzzano <nbtrap@nbtrap.com> writes: > >> For example, say we decode a stream of raw bytes as utf8, but that the >> stream contains some non-utf8 sequences. > > Of course, most programming languages would simply refuse to decode by, > e.g., throwing an exception. But that's not really appropriate for an > editor. On one hand, you need some way to distinguish between > characters and bytes, even if the distinction's not made by the type > system; on the other hand, an _editor_ of all things should be able to > deal with both kinds at the same time without the distinction being > lost, and Emacs does a tremendous job at this IMO. _De_coding into a _unibyte_ buffer is a lossy operation by definition since a unibyte buffer cannot hold the full set of values that _de_coding delivers. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 17:18 ` David Kastrup @ 2014-03-29 17:33 ` Nathan Trapuzzano 2014-03-30 0:24 ` Richard Stallman 0 siblings, 1 reply; 103+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 17:33 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup <dak@gnu.org> writes: > _De_coding into a _unibyte_ buffer is a lossy operation by definition > since a unibyte buffer cannot hold the full set of values that > _de_coding delivers. I know. I was responding to what seemed to be a suggestion to just conflate Latin-1 characters with raw bytes. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 17:33 ` Nathan Trapuzzano @ 2014-03-30 0:24 ` Richard Stallman 2014-03-30 8:38 ` Andreas Schwab 0 siblings, 1 reply; 103+ messages in thread From: Richard Stallman @ 2014-03-30 0:24 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: dak, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] Maybe we should implement decoding unibyte text to produce multibyte text. * A function could decode text from a unibyte buffer and put it in another buffer which is multibyte. * A function could decode a whole unibyte buffer into the same buffer, and mark it as multibyte. For encoding, vice versa. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 0:24 ` Richard Stallman @ 2014-03-30 8:38 ` Andreas Schwab 2014-03-30 15:12 ` Richard Stallman 0 siblings, 1 reply; 103+ messages in thread From: Andreas Schwab @ 2014-03-30 8:38 UTC (permalink / raw) To: rms; +Cc: Nathan Trapuzzano, dak, emacs-devel Richard Stallman <rms@gnu.org> writes: > * A function could decode text from a unibyte buffer and put it in > another buffer which is multibyte. > > * A function could decode a whole unibyte buffer > into the same buffer, and mark it as multibyte. That's what decode-coding-region provides (except for changing the multibyte flag). Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-30 8:38 ` Andreas Schwab @ 2014-03-30 15:12 ` Richard Stallman 0 siblings, 0 replies; 103+ messages in thread From: Richard Stallman @ 2014-03-30 15:12 UTC (permalink / raw) To: Andreas Schwab; +Cc: nbtrap, dak, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > * A function could decode text from a unibyte buffer and put it in > another buffer which is multibyte. > > * A function could decode a whole unibyte buffer > into the same buffer, and mark it as multibyte. That's what decode-coding-region provides (except for changing the multibyte flag). That "except" is the crucial point. Currently we need to access both unibyte text and multibyte text with the same setting of the multibyte flag. These two functions might eliminate the need for that. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 17:01 ` Nathan Trapuzzano 2014-03-29 17:08 ` Nathan Trapuzzano @ 2014-03-29 17:16 ` David Kastrup 1 sibling, 0 replies; 103+ messages in thread From: David Kastrup @ 2014-03-29 17:16 UTC (permalink / raw) To: emacs-devel Nathan Trapuzzano <nbtrap@nbtrap.com> writes: > "Stephen J. Turnbull" <stephen@xemacs.org> writes: > >> What is relevant is how to represent byte streams in Emacs. The >> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode >> characters. It is *extremely* convenient if the first 128 of those >> bytes correspond to the ASCII coded character set, because so many >> wire protocols use ASCII "words" syntactically. The other 128 don't >> matter much, so why not just use the extremely convenient Latin-1 set >> for them? > > Sorry if someone brought this up already, but one reason raw bytes > shouldn't be represented as Latin-1 characters is that the "raw > bytes"-ness would be lost when writing them back to disk if the stream > also contained characters outside the Latin-1 range. No. > For example, say we decode a stream of raw bytes as utf8, but that the > stream contains some non-utf8 sequences. IIUC, Emacs will interpret > those as "raw bytes", so that when it goes to encode the string to write > it back, they will be written back verbatim. "Raw bytes" here are represented as particular characters outside of the Unicode range. They are representable in multibyte buffers. They never were representable in unibyte buffers. While it is conceivable to map characters 128..255 in unibyte strings/buffers to the respective character codes outside of the Unicode range, that would render programmatic manipulation of bytes strenuous. > Whereas, if they had been interpreted as Latin-1 characters, they > would get written back as the UTF8 equivalents. Hence you have the > odd situation where you can decode and then encode and end up with a > different string. No, you can't unless you decode into a unibyte buffer, and then all bets are off regarding reencoding. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 10:28 ` Stephen J. Turnbull 2014-03-28 10:58 ` David Kastrup 2014-03-28 17:29 ` Eli Zaretskii @ 2014-03-28 18:45 ` Daniel Colascione 2014-03-28 19:35 ` Glenn Morris 2014-03-29 11:17 ` Stephen J. Turnbull 2 siblings, 2 replies; 103+ messages in thread From: Daniel Colascione @ 2014-03-28 18:45 UTC (permalink / raw) To: Stephen J. Turnbull, Eli Zaretskii; +Cc: monnier, emacs-devel [-- Attachment #1: Type: text/plain, Size: 270 bytes --] On 03/28/2014 03:28 AM, Stephen J. Turnbull wrote: > Fortunately for me, I am *intimately* familiar with XEmacs internals, > and therefore RMS won't let me write this code for Emacs. :-) What now? People who have contributed to XEmacs can't contribute to Emacs? [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 901 bytes --] ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 18:45 ` Daniel Colascione @ 2014-03-28 19:35 ` Glenn Morris 2014-03-29 11:17 ` Stephen J. Turnbull 1 sibling, 0 replies; 103+ messages in thread From: Glenn Morris @ 2014-03-28 19:35 UTC (permalink / raw) To: Daniel Colascione; +Cc: emacs-devel Daniel Colascione wrote: > What now? People who have contributed to XEmacs can't contribute to Emacs? Of course they can; subject to the same conditions as anyone else. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-28 18:45 ` Daniel Colascione 2014-03-28 19:35 ` Glenn Morris @ 2014-03-29 11:17 ` Stephen J. Turnbull 2014-03-29 11:22 ` Eli Zaretskii 1 sibling, 1 reply; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-29 11:17 UTC (permalink / raw) To: Daniel Colascione; +Cc: Eli Zaretskii, monnier, emacs-devel Daniel Colascione writes: > On 03/28/2014 03:28 AM, Stephen J. Turnbull wrote: > > Fortunately for me, I am *intimately* familiar with XEmacs internals, > > and therefore RMS won't let me write this code for Emacs. :-) > > What now? People who have contributed to XEmacs can't contribute to Emacs? Not a problem, when put that way. However, I'm familiar with a specific implementation of the ideas that I describe. That implementation is not FSF-assigned, and therefore anything I write is tainted with the fear of copyright infringement if I claim it's mine but it looks like Ben's or Martin's. It would be possible, but somebody would have to spend a lot of time studying XEmacs and confirming nothing I wrote was an echo of code I'd studied. Then they'd be tainted by that knowledge .... What I can do freely is discuss design in general terms, and that's what I've done. This is all in the guidelines for reimplementers of non-GNU software. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:17 ` Stephen J. Turnbull @ 2014-03-29 11:22 ` Eli Zaretskii 2014-03-29 16:03 ` Stephen J. Turnbull 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-29 11:22 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: dancol, monnier, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: Eli Zaretskii <eliz@gnu.org>, > monnier@IRO.UMontreal.CA, > emacs-devel@gnu.org > Date: Sat, 29 Mar 2014 20:17:59 +0900 > > What I can do freely is discuss design in general terms I'm quite sure you can also describe the fine details of the implementation, as long as you don't describe that by posting the actual code. AFAIU, copyright protects only the form, not the ideas. Ideas can be described and discussed at any level of detail, because implementation of those same ideas by another person will never, except by improbable accident, be so close to the original as to be suspected of copying. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 11:22 ` Eli Zaretskii @ 2014-03-29 16:03 ` Stephen J. Turnbull 2014-03-31 15:22 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: Stephen J. Turnbull @ 2014-03-29 16:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dancol, monnier, emacs-devel Eli Zaretskii writes: > I'm quite sure you can also describe the fine details of the > implementation, as long as you don't describe that by posting the > actual code. No, that's not necessarily the case. At least in the U.S., the criteria are expressiveness, originality, and fixed in a medium. Email is such a medium. Obviously, design can be original. Design decisions are rarely dictated by the one feasible way to do it, and if not, design is an expressive act and subject to copyright. I don't know if Richard is still so cautious, but the above reasoning is why would-be contributors to GNU of work-alike software are advised to use different algorithms and data structures from the original in their implementations. > AFAIU, copyright protects only the form, not the ideas. Ideas can > be described and discussed at any level of detail, because > implementation of those same ideas by another person will never, > except by improbable accident, be so close to the original as to be > suspected of copying. Unfortunately, many cases that some observers believe involve independent invention in fact were resolved in favor of the plaintiff on the basis that the appearance was sufficiently similar, and the defendent couldn't prove non-copying.[1] Your "probability" argument doesn't hold up. Footnotes: [1] Copyright infringement is a tort, not a crime, here. Criminal infringement puts the burden of proof squarely on the prosecutor. Civil cases, however, are based on the "preponderance of evidence". ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-29 16:03 ` Stephen J. Turnbull @ 2014-03-31 15:22 ` Eli Zaretskii 2014-04-01 3:36 ` Stephen J. Turnbull 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-03-31 15:22 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: dancol, monnier, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Date: Sun, 30 Mar 2014 01:03:15 +0900 > Cc: dancol@dancol.org, monnier@IRO.UMontreal.CA, emacs-devel@gnu.org > > Eli Zaretskii writes: > > > AFAIU, copyright protects only the form, not the ideas. Ideas can > > be described and discussed at any level of detail, because > > implementation of those same ideas by another person will never, > > except by improbable accident, be so close to the original as to be > > suspected of copying. > > Unfortunately, many cases that some observers believe involve > independent invention in fact were resolved in favor of the plaintiff > on the basis that the appearance was sufficiently similar, and the > defendent couldn't prove non-copying. Your "probability" argument > doesn't hold up. Please show your references for that. IANAL, but just by reading related stuff on the Internet, I arrive to the opposite conclusion. For example, here are citations from the last part of http://en.wikipedia.org/wiki/Structure,_sequence_and_organization, which seem to uphold my understanding and contradict yours: Competitors may create programs that provide essentially the same functionality as a protected program as long as they do not copy the code. The trend has been for courts to say that even if there are non-literal SSO similarities, there must be proof of copying. Some relevant court decisions allow for reverse-engineering to discover ideas that are not subject to copyright within a protected program. The ideas can be implemented in a competing program as long as the developers do not copy the original expression. With a clean room design approach one team of engineers derives a functional specification from the original code, and then a second team uses that specification to design and built the new code. [...] The judge [in the Oracle v Google case] asked for [both Google and Oracle] to comment on a ruling by the European Court of Justice in a similar case that found "Neither the functionality of a computer program nor the programming language and the format of data files used in a computer program in order to exploit certain of its functions constitute a form of expression. Accordingly, they do not enjoy copyright protection." On 31 May 2012 the judge ruled that "So long as the specific code used to implement a method is different, anyone is free under the Copyright Act to write his or her own code to carry out exactly the same function or specification of any methods used in the Java API." ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-03-31 15:22 ` Eli Zaretskii @ 2014-04-01 3:36 ` Stephen J. Turnbull 2014-04-01 7:42 ` David Kastrup 2014-04-01 15:16 ` Eli Zaretskii 0 siblings, 2 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-04-01 3:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dancol, monnier, emacs-devel Eli Zaretskii writes: > Please show your references for that. IANAL, but just by reading > related stuff on the Internet, I arrive to the opposite conclusion. Hey, I'm perfectly happy to go on that kind of evidence; the projects I mostly work on don't require assignment and I see no need for it. But we're talking here about Emacs, which is extremely careful about these things. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-01 3:36 ` Stephen J. Turnbull @ 2014-04-01 7:42 ` David Kastrup 2014-04-01 9:38 ` Stephen J. Turnbull 2014-04-01 15:19 ` Eli Zaretskii 2014-04-01 15:16 ` Eli Zaretskii 1 sibling, 2 replies; 103+ messages in thread From: David Kastrup @ 2014-04-01 7:42 UTC (permalink / raw) To: emacs-devel "Stephen J. Turnbull" <stephen@xemacs.org> writes: > Eli Zaretskii writes: > > > Please show your references for that. IANAL, but just by reading > > related stuff on the Internet, I arrive to the opposite conclusion. > > Hey, I'm perfectly happy to go on that kind of evidence; the projects > I mostly work on don't require assignment and I see no need for it. > But we're talking here about Emacs, which is extremely careful about > these things. Well, I remember a tense moment in XEmacs history where a major past contributor stated that he would rescind permission to redistribute his work in XEmacs when XEmacs was going to get relicensed under GPLv3 (I think it was GPLv3 but it may have been some other licensing change originating at GNU Emacs). XEmacs developers are on reasonably good speaking terms to resolve such a conflict. In particular if one can point to the FSF as being the "real" guilty party and external to the project. Emacs does not have that excuse. But that's tangential: you don't just have to secure the goodwill of important contributors. Given the current laws, you have to secure the goodwill of the contributors' heirs 90 years or something after their death, people who are not even born yet. Good luck with that. The single biggest deficiency that corporations have over single persons is that they are immortal. Nam Sibyllam quidem Cumis ego ipse oculis meis vidi in ampulla pendere, et cum illi pueri dicerent: Σίβυλλα τί θέλεις; respondebat illa: ἀποθανεῖν θέλω. Would it have been Walt Disney's will that many of the motion pictures of his youth are rotting away and getting irretrievably lost because the company bearing his name is fighting against legislation allowing them to be copied (and the costs recuperated by distribution) before they fall apart? What would he or other people think if they were told that the future of our cultural heritage and the laws governing it is determined between the two major competing power houses of Mickey Mouse and Bugs Bunny these days? How sad is that? At any rate, nobody knows what his heirs will do 90 years after his death. But corporations don't really die, and neither do contracts. And that gives Emacs the best shot we have not to be killed by lawyers a hundred years from now. Which makes it free to grow into something else, like culture should be able to and no longer can. Well, this mail has definitely grown into something else. Sue me. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-01 7:42 ` David Kastrup @ 2014-04-01 9:38 ` Stephen J. Turnbull 2014-04-01 15:19 ` Eli Zaretskii 1 sibling, 0 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-04-01 9:38 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel David Kastrup writes: > Sue me. Not me. I don't agree that it's worth worrying about, but I certainly don't deny that you and other Emacs contributors have the right to be concerned, and furthermore, the right to do something about it. And a wise man once said something along the lines of "Extremism in the defense of freedom is no vice." I heartily agree with that, even when I disagree with some of the extremists.<wink/> ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-01 7:42 ` David Kastrup 2014-04-01 9:38 ` Stephen J. Turnbull @ 2014-04-01 15:19 ` Eli Zaretskii 1 sibling, 0 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-04-01 15:19 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Date: Tue, 01 Apr 2014 09:42:05 +0200 > > Well, this mail has definitely grown into something else. Indeed. To recall, the subject was whether communicating design and implementation ideas that get implemented by someone else necessarily makes all the participants of such discussions copyright holders of the code that is written based on the discussions. I very much hope that's not the case, because otherwise we better shut down this list, and fast. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-01 3:36 ` Stephen J. Turnbull 2014-04-01 7:42 ` David Kastrup @ 2014-04-01 15:16 ` Eli Zaretskii 2014-04-02 4:20 ` Stephen J. Turnbull 1 sibling, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-04-01 15:16 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: dancol, monnier, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: dancol@dancol.org, > monnier@IRO.UMontreal.CA, > emacs-devel@gnu.org > Date: Tue, 01 Apr 2014 12:36:45 +0900 > > Hey, I'm perfectly happy to go on that kind of evidence; the projects > I mostly work on don't require assignment and I see no need for it. > But we're talking here about Emacs, which is extremely careful about > these things. It would be madness IMO for Emacs to require legal paperwork from everyone who at some point participated in some design discussion here, which later got implemented, just because "design is expressive" and "email is a medium" that fixes that expressiveness. As a matter of fact, this is not currently required, which I interpret as an agreement with my understanding of the fine line that separates design ideas from actual code. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-01 15:16 ` Eli Zaretskii @ 2014-04-02 4:20 ` Stephen J. Turnbull 2014-04-02 17:06 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: Stephen J. Turnbull @ 2014-04-02 4:20 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dancol, monnier, emacs-devel Eli Zaretskii writes: > It would be madness IMO for Emacs to require legal paperwork from > everyone who at some point participated in some design discussion > here, which later got implemented, Of course that would be madness. What you're ignoring is that we're talking not just about participation in design discussion, but *also* implementation by a person who is intimately familiar with and participated another implementation of the same feature with the same design that is not assigned, and is highly unlikely to ever be assigned. In that case if there were enough similarity that the FSF were taken to court and the case not dismissed immediately, the "it's just an accident" argument would not fly in court because it would be easy to show that I know a lot about the XEmacs implementation, and I personally would undoubtedly be at best greatly inconvenienced by being called to testify, at worst liable for damages (remember, in that case the FSF assignment makes me liable for FSF's court costs and damages, and that agreement doesn't contain mitigating circumstances like "in good faith" or "invited by Eli Z"). No, thank you. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-02 4:20 ` Stephen J. Turnbull @ 2014-04-02 17:06 ` Eli Zaretskii 2014-04-03 10:59 ` David Kastrup 2014-04-03 13:04 ` Stephen J. Turnbull 0 siblings, 2 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-04-02 17:06 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: dancol, monnier, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: dancol@dancol.org, > monnier@IRO.UMontreal.CA, > emacs-devel@gnu.org > Date: Wed, 02 Apr 2014 13:20:40 +0900 > > Eli Zaretskii writes: > > In that case if there were enough similarity that the FSF were taken > to court and the case not dismissed immediately, the "it's just an > accident" argument would not fly in court because it would be easy > to show that I know a lot about the XEmacs implementation, and I > personally would undoubtedly be at best greatly inconvenienced by > being called to testify, at worst liable for damages (remember, in > that case the FSF assignment makes me liable for FSF's court costs > and damages, and that agreement doesn't contain mitigating > circumstances like "in good faith" or "invited by Eli Z"). > > No, thank you. My goal is not to convince you to do something you don't want to. The main issue here, at least for me, is not whether Mr. X wants to describe an existing implementation -- we obviously cannot do anything if he doesn't, no matter what are his reasons. The main issue here is, once Mr. X _did_ describe such an implementation, is it OK for someone else, who is not familiar with the actual code, to re-implement it from scratch, and then submit it to Emacs as their own, under assigned copyright. My conclusion from everything I know and read is that YES, it is OK. IOW, I'd like to avoid the situation where others here might become intimidated by what you wrote in a broader sense, and will as result refrain from participating in discussions that reveal details of other implementations, or from assigning their code written based on those discussions. That would cause some real damage to Emacs. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-02 17:06 ` Eli Zaretskii @ 2014-04-03 10:59 ` David Kastrup 2014-04-03 16:07 ` Eli Zaretskii 2014-04-03 13:04 ` Stephen J. Turnbull 1 sibling, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-04-03 10:59 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > My goal is not to convince you to do something you don't want to. > > The main issue here, at least for me, is not whether Mr. X wants to > describe an existing implementation -- we obviously cannot do anything > if he doesn't, no matter what are his reasons. The main issue here > is, once Mr. X _did_ describe such an implementation, is it OK for > someone else, who is not familiar with the actual code, to > re-implement it from scratch, and then submit it to Emacs as their > own, under assigned copyright. My conclusion from everything I know > and read is that YES, it is OK. > > IOW, I'd like to avoid the situation where others here might become > intimidated by what you wrote in a broader sense, and will as result > refrain from participating in discussions that reveal details of other > implementations, or from assigning their code written based on those > discussions. That would cause some real damage to Emacs. Nobody claimed that the broken copyright system does not lead to a whole lot of real damage to a whole lot of software development. <URL:https://en.wikipedia.org/wiki/Sequence,_structure_and_organization> may be somewhat instructional about some current court practice in the U.S.A. Please note that Oracle/Google ruling is unfortunately somewhat atypical and on appeal (appeal hearing was in December) <URL:http://arstechnica.com/tech-policy/2013/12/googles-copyright-win-against-oracle-is-in-danger-on-appeal/> and that the FSF would not have been in a position to pay the kind of legal expenses incurred here. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-03 10:59 ` David Kastrup @ 2014-04-03 16:07 ` Eli Zaretskii 2014-04-03 16:26 ` David Kastrup 0 siblings, 1 reply; 103+ messages in thread From: Eli Zaretskii @ 2014-04-03 16:07 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Date: Thu, 03 Apr 2014 12:59:20 +0200 > > > IOW, I'd like to avoid the situation where others here might become > > intimidated by what you wrote in a broader sense, and will as result > > refrain from participating in discussions that reveal details of other > > implementations, or from assigning their code written based on those > > discussions. That would cause some real damage to Emacs. > > Nobody claimed that the broken copyright system does not lead to a whole > lot of real damage to a whole lot of software development. On this general level, I agree. However, I only talked about a very specific situation. In any case, the system being broken notwithstanding, we shouldn't see problems where none exist (yet). > <URL:https://en.wikipedia.org/wiki/Sequence,_structure_and_organization> > may be somewhat instructional about some current court practice in the > U.S.A. That's the URL from which I quoted a few messages ago. > Please note that Oracle/Google ruling is unfortunately somewhat > atypical and on appeal (appeal hearing was in December) > <URL:http://arstechnica.com/tech-policy/2013/12/googles-copyright-win-against-oracle-is-in-danger-on-appeal/> Even if you take this article at face value (as opposed to someone whose interests are unknown reiterating rumors), the conclusion is that jury is still out in this issue. Which is exactly what I wrote: this issue is not decided yet, and precedents are contradictory. > and that the FSF would not have been in a position to pay the kind of > legal expenses incurred here. If there is a precedent, you don't need to pay any expenses. Anyway, this all is only relevant if someone of those who wrote the code that was discussed and reimplemented actually sue the FSF. Since such code almost always comes from Free Software, I don't think there's a danger of this. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-03 16:07 ` Eli Zaretskii @ 2014-04-03 16:26 ` David Kastrup 2014-04-03 19:11 ` Eli Zaretskii 0 siblings, 1 reply; 103+ messages in thread From: David Kastrup @ 2014-04-03 16:26 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> <URL:http://arstechnica.com/tech-policy/2013/12/googles-copyright-win-against-oracle-is-in-danger-on-appeal/> > > Even if you take this article at face value (as opposed to someone > whose interests are unknown reiterating rumors), the conclusion is > that jury is still out in this issue. Which is exactly what I wrote: > this issue is not decided yet, and precedents are contradictory. > >> and that the FSF would not have been in a position to pay the kind of >> legal expenses incurred here. > > If there is a precedent, you don't need to pay any expenses. Nonsense. For most court cases there are precedents that are getting referenced. In the U.S., both sides have to pay their own legal expenses. Judges _may_ award legal costs to a defendant if the case was brought forward clearly frivolously and/or vexatiously. That is very rarely done. A successful defense will be expensive even in the rare case that the case is decided in summary judgment. > Anyway, this all is only relevant if someone of those who wrote the > code that was discussed and reimplemented actually sue the FSF. Since > such code almost always comes from Free Software, I don't think > there's a danger of this. If an employer of a non-assigned contributor is sued by the FSF over infringement of some FSF-copyrighted software, the whole case can get thrown out of court if the FSF is shown to have "dirty hands", namely to have incorporated code themselves that is legally under copyright by the employer. In the case of XEmacs, we are not necessarily talking about core developers highly sympathetic to the FSF. There is no playful element to the history of the Emacs/XEmacs schism like with the Emacs/vi "editor wars". The details of the complex Emacs/XEmacs relation aside, nobody should be blamed for choosing to err on the safe side. In particular since the copyright maximalists are pretty successful in eroding the safe side and moving the borderlines. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-03 16:26 ` David Kastrup @ 2014-04-03 19:11 ` Eli Zaretskii 2014-04-03 20:03 ` David Kastrup 2014-04-04 11:40 ` Richard Stallman 0 siblings, 2 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-04-03 19:11 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Cc: emacs-devel@gnu.org > Date: Thu, 03 Apr 2014 18:26:38 +0200 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> <URL:http://arstechnica.com/tech-policy/2013/12/googles-copyright-win-against-oracle-is-in-danger-on-appeal/> > > > > Even if you take this article at face value (as opposed to someone > > whose interests are unknown reiterating rumors), the conclusion is > > that jury is still out in this issue. Which is exactly what I wrote: > > this issue is not decided yet, and precedents are contradictory. > > > >> and that the FSF would not have been in a position to pay the kind of > >> legal expenses incurred here. > > > > If there is a precedent, you don't need to pay any expenses. > > Nonsense. You misunderstood. I meant there would be no need to pay for creating a precedent where one already exists. > If an employer of a non-assigned contributor is sued by the FSF over > infringement of some FSF-copyrighted software, the whole case can get > thrown out of court if the FSF is shown to have "dirty hands", namely to > have incorporated code themselves that is legally under copyright by the > employer. If you are afraid to get into a road accident, stay inside. > In the case of XEmacs, we are not necessarily talking about core > developers highly sympathetic to the FSF. There is no playful element > to the history of the Emacs/XEmacs schism like with the Emacs/vi "editor > wars". The amount of code borrowed by XEmacs from Emacs is orders of magnitude larger than the other way around. So this is a red herring. > nobody should be blamed for choosing to err on the safe side. I never blamed anyone. People should know the true state of affairs, and then decide for themselves. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-03 19:11 ` Eli Zaretskii @ 2014-04-03 20:03 ` David Kastrup 2014-04-04 0:48 ` Stephen J. Turnbull 2014-04-04 7:58 ` Eli Zaretskii 2014-04-04 11:40 ` Richard Stallman 1 sibling, 2 replies; 103+ messages in thread From: David Kastrup @ 2014-04-03 20:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > The amount of code borrowed by XEmacs from Emacs is orders of > magnitude larger than the other way around. So this is a red herring. Magnitude does not really matter with "dirty hands". At any rate, you _are_ aware that Oracle sued Google for billions of dollars because of what amounted to 11 lines of code? They did not prevail at the first trial, but Google did not get attorney costs back, either, and the whole thing went into appeal with murky outlook. >> nobody should be blamed for choosing to err on the safe side. > > I never blamed anyone. People should know the true state of affairs, > and then decide for themselves. The true state of affairs is that the U.S. legal and political system does not leave much leeway for paranoia. It's as bad as imagination gets. -- David Kastrup ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-03 20:03 ` David Kastrup @ 2014-04-04 0:48 ` Stephen J. Turnbull 2014-04-04 8:08 ` Eli Zaretskii 2014-04-04 7:58 ` Eli Zaretskii 1 sibling, 1 reply; 103+ messages in thread From: Stephen J. Turnbull @ 2014-04-04 0:48 UTC (permalink / raw) To: David Kastrup; +Cc: Eli Zaretskii, emacs-devel David Kastrup writes: > Eli Zaretskii <eliz@gnu.org> writes: > > I never blamed anyone. People should know the true state of > > affairs, and then decide for themselves. Not in Emacs. It's not up to the individual contributor, it's a matter for project policy, ie, RMS as advised by the FSF legal dept. > The true state of affairs is that the U.S. legal and political system > does not leave much leeway for paranoia. It's as bad as imagination > gets. Oh, come on, David. A German writes this in a thread that a resident of Japan participates in? Have you no sense of history? Indeed, the reach of copyright and patent in the U.S. system has gone way beyond the bounds that even a Milton Friedman can sanction. But it's not hard to imagine worse, even in just that limited area of law. Bottom line: Eli's theoretical assessment of the "typical" risks involved seems pretty plausible to me. But in the worst case, things can get pretty bad, and it's easy to justify "legal paranoia" on the part of the FSF in managing software freedom of selected critical projects, including Emacs. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-04 0:48 ` Stephen J. Turnbull @ 2014-04-04 8:08 ` Eli Zaretskii 0 siblings, 0 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-04-04 8:08 UTC (permalink / raw) To: Stephen J. Turnbull; +Cc: dak, emacs-devel > From: "Stephen J. Turnbull" <stephen@xemacs.org> > Cc: Eli Zaretskii <eliz@gnu.org>, > emacs-devel@gnu.org > Date: Fri, 04 Apr 2014 09:48:17 +0900 > > David Kastrup writes: > > Eli Zaretskii <eliz@gnu.org> writes: > > > > I never blamed anyone. People should know the true state of > > > affairs, and then decide for themselves. > > Not in Emacs. It's not up to the individual contributor, it's a > matter for project policy, ie, RMS as advised by the FSF legal dept. To some degree, yes. (Although I hear only deafening silence from those quarters about these matters.) But since it is me who signs the legal papers, and it is me who decides whether some code I submit under the assignment fits the FSF standards of what can be called "my original work", then I, too, am a part of this equation, and my decisions on these matters do count. > Bottom line: Eli's theoretical assessment of the "typical" risks > involved seems pretty plausible to me. But in the worst case, things > can get pretty bad, and it's easy to justify "legal paranoia" on the > part of the FSF in managing software freedom of selected critical > projects, including Emacs. I agree, FWIW. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-03 20:03 ` David Kastrup 2014-04-04 0:48 ` Stephen J. Turnbull @ 2014-04-04 7:58 ` Eli Zaretskii 1 sibling, 0 replies; 103+ messages in thread From: Eli Zaretskii @ 2014-04-04 7:58 UTC (permalink / raw) To: David Kastrup; +Cc: emacs-devel > From: David Kastrup <dak@gnu.org> > Cc: emacs-devel@gnu.org > Date: Thu, 03 Apr 2014 22:03:32 +0200 > > >> nobody should be blamed for choosing to err on the safe side. > > > > I never blamed anyone. People should know the true state of affairs, > > and then decide for themselves. > > The true state of affairs is that the U.S. legal and political system > does not leave much leeway for paranoia. It's as bad as imagination > gets. Even if they really are after you, it doesn't mean you need to become paranoid. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-03 19:11 ` Eli Zaretskii 2014-04-03 20:03 ` David Kastrup @ 2014-04-04 11:40 ` Richard Stallman 1 sibling, 0 replies; 103+ messages in thread From: Richard Stallman @ 2014-04-04 11:40 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dak, emacs-devel [[[ To any NSA and FBI agents reading my email: please consider ]]] [[[ whether defending the US Constitution against all enemies, ]]] [[[ foreign or domestic, requires you to follow Snowden's example. ]]] A discussion of the general issue of GPL enforcement is outside of the purpose of emacs-devel. The FSF studies this with lawyers, which is the useful way to do it. -- Dr Richard Stallman President, Free Software Foundation 51 Franklin St Boston MA 02110 USA www.fsf.org www.gnu.org Skype: No way! That's nonfree (freedom-denying) software. Use Ekiga or an ordinary phone call. ^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: Unibyte characters, strings, and buffers 2014-04-02 17:06 ` Eli Zaretskii 2014-04-03 10:59 ` David Kastrup @ 2014-04-03 13:04 ` Stephen J. Turnbull 1 sibling, 0 replies; 103+ messages in thread From: Stephen J. Turnbull @ 2014-04-03 13:04 UTC (permalink / raw) To: Eli Zaretskii; +Cc: dancol, monnier, emacs-devel Eli Zaretskii writes: > The main issue here, at least for me is, once Mr. X _did_ describe > such an implementation, is it OK for someone else, who is not > familiar with the actual code, to re-implement it from scratch, and > then submit it to Emacs as their own, under assigned copyright. My > conclusion from everything I know and read is that YES, it is OK. I'd risk it. But it's not the classic "clean-room" reimplementation where the behavior of the original in response to various inputs (vs. "internal structure" etc) is used as a specification (vs. "design") for the clone. For Emacs, you'd have to ask an FSF lawyer. ^ permalink raw reply [flat|nested] 103+ messages in thread
end of thread, other threads:[~2014-04-04 11:40 UTC | newest] Thread overview: 103+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-26 19:04 Buffer-local variables affect general-purpose functions Eli Zaretskii 2014-03-26 19:32 ` Paul Eggert 2014-03-26 20:03 ` Eli Zaretskii 2014-03-26 21:50 ` Paul Eggert 2014-03-27 17:42 ` Eli Zaretskii 2014-03-27 18:55 ` Paul Eggert 2014-03-27 14:17 ` Stefan Monnier 2014-03-27 17:17 ` Eli Zaretskii 2014-03-27 21:04 ` Stefan Monnier 2014-03-28 7:11 ` Eli Zaretskii 2014-03-28 7:46 ` Paul Eggert 2014-03-28 8:18 ` Unibyte characters, strings and buffers Eli Zaretskii 2014-03-28 18:42 ` Paul Eggert 2014-03-28 18:52 ` Eli Zaretskii 2014-03-28 19:21 ` Paul Eggert 2014-03-29 6:40 ` Eli Zaretskii 2014-03-29 18:57 ` Paul Eggert 2014-03-29 19:46 ` Eli Zaretskii 2014-03-28 20:23 ` Stefan Monnier 2014-03-29 19:34 ` Stefan Monnier 2014-03-28 14:12 ` Buffer-local variables affect general-purpose functions Stefan Monnier 2014-03-28 3:38 ` Stephen J. Turnbull 2014-03-28 8:51 ` Unibyte characters, strings, and buffers Eli Zaretskii 2014-03-28 10:28 ` Stephen J. Turnbull 2014-03-28 10:58 ` David Kastrup 2014-03-28 11:22 ` Andreas Schwab 2014-03-28 11:34 ` David Kastrup 2014-03-28 11:42 ` Stephen J. Turnbull 2014-03-28 17:29 ` Eli Zaretskii 2014-03-28 17:50 ` David Kastrup 2014-03-28 18:31 ` Eli Zaretskii 2014-03-28 19:25 ` David Kastrup 2014-03-29 6:43 ` Eli Zaretskii 2014-03-29 7:23 ` David Kastrup 2014-03-29 8:24 ` Eli Zaretskii 2014-03-29 8:40 ` David Kastrup 2014-03-29 9:25 ` Eli Zaretskii 2014-03-28 20:27 ` Stefan Monnier 2014-03-29 9:23 ` Stephen J. Turnbull 2014-03-29 9:52 ` Andreas Schwab 2014-03-29 10:48 ` Eli Zaretskii 2014-03-29 11:00 ` Andreas Schwab 2014-03-29 11:18 ` Eli Zaretskii 2014-03-29 11:30 ` Andreas Schwab [not found] ` <83ha6hduzz.fsf@gnu.org> 2014-03-29 14:30 ` Andreas Schwab 2014-03-29 14:47 ` Eli Zaretskii 2014-03-29 10:42 ` David Kastrup 2014-03-29 11:07 ` Eli Zaretskii 2014-03-29 11:30 ` David Kastrup 2014-03-29 12:58 ` Eli Zaretskii 2014-03-29 13:15 ` David Kastrup 2014-03-29 10:44 ` Eli Zaretskii 2014-03-29 11:06 ` Andreas Schwab 2014-03-29 11:12 ` Eli Zaretskii 2014-03-29 16:11 ` Stephen J. Turnbull 2014-03-29 15:37 ` Stephen J. Turnbull 2014-03-29 15:55 ` David Kastrup 2014-03-29 16:28 ` Stephen J. Turnbull 2014-03-29 17:00 ` David Kastrup 2014-03-30 2:05 ` Stephen J. Turnbull 2014-03-30 9:01 ` David Kastrup 2014-03-30 12:13 ` Stephen J. Turnbull 2014-03-30 14:25 ` Andreas Schwab 2014-03-30 15:05 ` David Kastrup 2014-03-30 15:39 ` Andreas Schwab 2014-03-29 17:08 ` Andreas Schwab 2014-03-30 0:24 ` Richard Stallman 2014-03-30 3:32 ` Stefan Monnier 2014-03-30 15:13 ` Richard Stallman 2014-03-29 15:58 ` Andreas Schwab 2014-03-29 16:35 ` Stephen J. Turnbull 2014-03-29 17:06 ` Andreas Schwab 2014-03-29 17:01 ` Nathan Trapuzzano 2014-03-29 17:08 ` Nathan Trapuzzano 2014-03-29 17:18 ` David Kastrup 2014-03-29 17:33 ` Nathan Trapuzzano 2014-03-30 0:24 ` Richard Stallman 2014-03-30 8:38 ` Andreas Schwab 2014-03-30 15:12 ` Richard Stallman 2014-03-29 17:16 ` David Kastrup 2014-03-28 18:45 ` Daniel Colascione 2014-03-28 19:35 ` Glenn Morris 2014-03-29 11:17 ` Stephen J. Turnbull 2014-03-29 11:22 ` Eli Zaretskii 2014-03-29 16:03 ` Stephen J. Turnbull 2014-03-31 15:22 ` Eli Zaretskii 2014-04-01 3:36 ` Stephen J. Turnbull 2014-04-01 7:42 ` David Kastrup 2014-04-01 9:38 ` Stephen J. Turnbull 2014-04-01 15:19 ` Eli Zaretskii 2014-04-01 15:16 ` Eli Zaretskii 2014-04-02 4:20 ` Stephen J. Turnbull 2014-04-02 17:06 ` Eli Zaretskii 2014-04-03 10:59 ` David Kastrup 2014-04-03 16:07 ` Eli Zaretskii 2014-04-03 16:26 ` David Kastrup 2014-04-03 19:11 ` Eli Zaretskii 2014-04-03 20:03 ` David Kastrup 2014-04-04 0:48 ` Stephen J. Turnbull 2014-04-04 8:08 ` Eli Zaretskii 2014-04-04 7:58 ` Eli Zaretskii 2014-04-04 11:40 ` Richard Stallman 2014-04-03 13:04 ` Stephen J. Turnbull
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).